Member of Technical Staff, Pre-Training Data

CohereCohere·Remote(Toronto)
Other

WFA Digital Insight

The demand for AI and machine learning specialists is skyrocketing, with a 25% growth in job postings over the past year. As a leader in natural language processing, Cohere is at the forefront of this trend. With a strong focus on innovation and research, this company stands out for its commitment to pushing the boundaries of AI capabilities. Candidates with a passion for bridging research and engineering to solve complex data-related challenges will thrive in this role. Before applying, consider the importance of data quality assessment and experimentation with data mixtures in driving advancements in AI model training.

Job Description

About the Role

As a Member of Technical Staff, Pre-Training Data at Cohere, you will play a pivotal role in developing the data pipeline that underpins the company's advanced language models. Your work will involve conducting data ablations to evaluate data quality and constructing pre-training data mixtures to enhance model performance. This role requires a strong foundation in software engineering, data modeling, and research, as well as the ability to collaborate with cross-functional teams.

The team at Cohere is comprised of researchers, engineers, designers, and more, all passionate about their craft and dedicated to delivering efficient and reliable language understanding and generation capabilities. As a member of this team, you will be expected to contribute to the development of cutting-edge language models and drive innovation in natural language processing.

Cohere's mission is to scale intelligence to serve humanity, and the company is committed to creating a diverse and inclusive work environment that values and celebrates different perspectives. With offices in London, Paris, Toronto, San Francisco, and New York, as well as a remote-friendly culture, you can work from anywhere between EST and EU.

What You Will Do

  • Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance
  • Develop robust data modeling techniques to ensure datasets are structured and formatted for optimal training efficiency
  • Research and implement innovative data curation methods, leveraging Cohere's infrastructure to drive advancements in natural language processing
  • Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models
  • Work closely with the team to identify and address data quality issues and develop solutions to improve model performance
  • Develop and maintain large-scale datasets, including web data, code data, and multilingual corpora
  • Implement data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools
  • Collaborate with the research team to develop and implement new data-related research projects
  • Participate in code reviews and contribute to the improvement of the codebase

What We Are Looking For

  • Strong software engineering skills, with proficiency in Python and experience building data pipelines
  • Familiarity with curriculum learning, data mixing, and data attribution
  • Experience working with large-scale datasets, including web data, code data, and multilingual corpora
  • Knowledge of data quality assessment techniques and experimentation with data mixtures
  • A passion for bridging research and engineering to solve complex data-related challenges in AI model training
  • Experience with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools
  • Strong collaboration and communication skills, with the ability to work effectively with cross-functional teams
  • A bachelor's or master's degree in computer science, mathematics, or a related field

Nice to Have

  • Paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)
  • Experience working with distributed computing systems and cloud-based infrastructure
  • Familiarity with containerization using Docker and Kubernetes

Benefits and Perks

  • Competitive salary and benefits package
  • Opportunity to work on cutting-edge projects and contribute to the development of innovative AI models
  • Collaborative and dynamic work environment with a team of experienced professionals
  • Flexible working hours and remote work options
  • Professional development opportunities, including training and conference attendance
  • Access to the latest technologies and tools
  • A culture that values and celebrates diversity and inclusion
  • Comprehensive health insurance and wellness programs
  • Generous paid time off and vacation policy

How to Stand Out

  • Make sure to highlight your experience with data pipelines and software engineering in your application
  • Showcase your knowledge of data quality assessment and experimentation with data mixtures
  • Be prepared to discuss your passion for bridging research and engineering to solve complex data-related challenges
  • Familiarize yourself with Cohere's mission and values, and be prepared to explain how your skills and experience align with the company's goals
  • Don't hesitate to ask about the company culture and remote work setup during the interview process
  • Consider sharing examples of your previous work, such as research papers or projects, to demonstrate your expertise
  • Be prepared to discuss your experience with distributed computing systems and cloud-based infrastructure, if applicable

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.