Member of Technical Staff, Synthetic Data

CohereCohere·Remote(Toronto)
Other

WFA Digital Insight

Demand for machine learning engineers specializing in synthetic data has skyrocketed, with the global AI market expected to reach

90 billion by 2027. As remote work becomes the norm, companies like Cohere are leading the charge in developing innovative language models. With the rise of natural language processing, this role is at the forefront of AI advancements. Candidates with strong software engineering skills and experience working with large-scale datasets are in high demand. Before applying, consider highlighting your passion for bridging research and engineering to solve complex data-related challenges in AI model training. With the remote job market booming, it's essential to stand out with a unique blend of technical expertise and innovative thinking.

Job Description

## About the Role As a Member of Technical Staff, Synthetic Data at Cohere, you will play a pivotal role in developing the synthetic data pipeline that drives the company's advanced language models. Your day-to-day responsibilities will involve managing the end-to-end synthetic data pipeline, conducting data analysis and generation, and collaborating with cross-functional teams to ensure data pipelines meet the demands of cutting-edge language models. You will work closely with researchers, engineers, and designers who are passionate about their craft, and each person is one of the best in the world at what they do.

Cohere's mission is to scale intelligence to serve humanity, and the company is committed to creating a diverse and inclusive work environment. The team is passionate about building great products, and a diverse range of perspectives is a requirement for achieving this goal. As a member of the technical staff, you will be responsible for contributing to the development of innovative language models that will power magical experiences like content generation, semantic search, RAG, and agents.

The company's culture is centered around working hard and moving fast to do what's best for customers. Cohere is committed to providing equal opportunities and values diversity and inclusion. If you are passionate about transforming data into the foundation of AI systems, this role offers a unique opportunity to make a meaningful impact.

## What You Will Do - Design and build scalable inference pipelines that run on large GPU clusters - Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance - Research and implement innovative synthetic data curation methods, leveraging Cohere's infrastructure to drive advancements in natural language processing - Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models - Work with diverse web data and code data and transform them using generative models to improve token efficiency and model quality - Bridge the gap between raw data and cutting-edge AI models, directly contributing to improvements in critical training metrics like throughput and accelerator utilization - Develop and maintain the synthetic data pipeline, including data analysis, generation, and model evaluation - Work closely with the research team to develop new synthetic data methods and techniques - Collaborate with the engineering team to ensure seamless integration of synthetic data pipelines with the company's infrastructure

## What We Are Looking For - Strong software engineering skills, with proficiency in Python and experience building data pipelines - Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools - Experience working with large-scale datasets, including web data, code data, and multilingual corpora - Experience working with LLMs through work projects, open-source contributions, or personal experimentation - Familiarity with LLM inference frameworks such as vLLM and TensorRT - A passion for bridging research and engineering to solve complex data-related challenges in AI model training - Excellent collaboration and communication skills, with the ability to work effectively in a remote team environment - Strong problem-solving skills, with the ability to analyze complex data-related issues and develop innovative solutions

## Nice to Have - Paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP) - Experience working with cloud-based infrastructure and containerization (e.g., Docker) - Familiarity with agile development methodologies and version control systems (e.g., Git)

## Benefits and Perks - Competitive compensation package - Opportunities for professional growth and development in a rapidly growing company - Collaborative and dynamic work environment with a team of passionate and talented individuals - Flexible working hours and remote work options - Access to cutting-edge technologies and tools - Comprehensive health insurance and benefits package - Generous paid time off and vacation policy - Opportunities for career advancement and professional development - A fun and inclusive company culture with regular team-building activities and social events

How to Stand Out

- Highlight your experience with synthetic data pipelines: Showcase your skills in designing and building scalable inference pipelines, and highlight your experience working with large-scale datasets.

  • Develop a strong understanding of LLMs: Familiarize yourself with LLM inference frameworks such as vLLM and TensorRT, and be prepared to discuss your experience working with LLMs.
  • Emphasize your collaboration skills: As a member of the technical staff, you will be working closely with cross-functional teams, so be sure to highlight your excellent collaboration and communication skills.
  • Be prepared to discuss your problem-solving skills: Showcase your ability to analyze complex data-related issues and develop innovative solutions.
  • Showcase your passion for AI and machine learning: Demonstrate your passion for bridging research and engineering to solve complex data-related challenges in AI model training, and highlight your experience working with machine learning models.
  • Research the company culture: Familiarize yourself with Cohere's mission, values, and culture, and be prepared to discuss how you align with the company's goals and values.
  • Prepare your portfolio: Make sure your portfolio is up-to-date and showcases your experience working with synthetic data pipelines, LLMs, and large-scale datasets.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.