Member of Technical Staff, Pre-Training Data

Cohere·Remote(Toronto)

Other

WFA Digital Insight

The demand for AI and machine learning specialists is skyrocketing, with a 25% growth in job postings over the past year. As a leader in natural language processing, Cohere is at the forefront of this trend. With a strong focus on innovation and research, this company stands out for its commitment to pushing the boundaries of AI capabilities. Candidates with a passion for bridging research and engineering to solve complex data-related challenges will thrive in this role. Before applying, consider the importance of data quality assessment and experimentation with data mixtures in driving advancements in AI model training.

Job Description

About the Role

As a Member of Technical Staff, Pre-Training Data at Cohere, you will play a pivotal role in developing the data pipeline that underpins the company's advanced language models. Your work will involve conducting data ablations to evaluate data quality and constructing pre-training data mixtures to enhance model performance. This role requires a strong foundation in software engineering, data modeling, and research, as well as the ability to collaborate with cross-functional teams.

The team at Cohere is comprised of researchers, engineers, designers, and more, all passionate about their craft and dedicated to delivering efficient and reliable language understanding and generation capabilities. As a member of this team, you will be expected to contribute to the development of cutting-edge language models and drive innovation in natural language processing.

Cohere's mission is to scale intelligence to serve humanity, and the company is committed to creating a diverse and inclusive work environment that values and celebrates different perspectives. With offices in London, Paris, Toronto, San Francisco, and New York, as well as a remote-friendly culture, you can work from anywhere between EST and EU.

What You Will Do

Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance
Develop robust data modeling techniques to ensure datasets are structured and formatted for optimal training efficiency
Research and implement innovative data curation methods, leveraging Cohere's infrastructure to drive advancements in natural language processing
Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models
Work closely with the team to identify and address data quality issues and develop solutions to improve model performance
Develop and maintain large-scale datasets, including web data, code data, and multilingual corpora
Implement data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools
Collaborate with the research team to develop and implement new data-related research projects
Participate in code reviews and contribute to the improvement of the codebase

What We Are Looking For

Strong software engineering skills, with proficiency in Python and experience building data pipelines
Familiarity with curriculum learning, data mixing, and data attribution
Experience working with large-scale datasets, including web data, code data, and multilingual corpora
Knowledge of data quality assessment techniques and experimentation with data mixtures
A passion for bridging research and engineering to solve complex data-related challenges in AI model training
Experience with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools
Strong collaboration and communication skills, with the ability to work effectively with cross-functional teams
A bachelor's or master's degree in computer science, mathematics, or a related field

Nice to Have

Paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)
Experience working with distributed computing systems and cloud-based infrastructure
Familiarity with containerization using Docker and Kubernetes

Benefits and Perks

Competitive salary and benefits package
Opportunity to work on cutting-edge projects and contribute to the development of innovative AI models
Collaborative and dynamic work environment with a team of experienced professionals
Flexible working hours and remote work options
Professional development opportunities, including training and conference attendance
Access to the latest technologies and tools
A culture that values and celebrates diversity and inclusion
Comprehensive health insurance and wellness programs
Generous paid time off and vacation policy

How to Stand Out

Make sure to highlight your experience with data pipelines and software engineering in your application
Showcase your knowledge of data quality assessment and experimentation with data mixtures
Be prepared to discuss your passion for bridging research and engineering to solve complex data-related challenges
Familiarize yourself with Cohere's mission and values, and be prepared to explain how your skills and experience align with the company's goals
Don't hesitate to ask about the company culture and remote work setup during the interview process
Consider sharing examples of your previous work, such as research papers or projects, to demonstrate your expertise
Be prepared to discuss your experience with distributed computing systems and cloud-based infrastructure, if applicable

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.