Senior Member of Technical Staff, Web Data

Cohere·Remote(Toronto)

Other

WFA Digital Insight

The demand for skilled professionals in AI and data engineering has skyrocketed, with a 25% increase in job postings over the past year. As companies like Cohere continue to push the boundaries of language models, the need for experts who can harness the power of web data has become critical. With its commitment to remote work and diversity, Cohere stands out as an attractive option for those looking to make a meaningful impact in the tech industry. For candidates looking to apply, it's essential to highlight a strong foundation in software engineering, data processing, and a passion for bridging research and engineering.

Job Description

About the Role

As a Senior Member of Technical Staff specializing in web data, you will play a pivotal role in developing the large-scale web data pipeline that underpins Cohere's advanced language models. This involves working extensively with Common Crawl and other large-scale web corpora to transform raw, noisy internet data into high-quality training data for pretraining. Your work will be essential to Cohere's mission of delivering efficient and reliable language understanding and generation capabilities.

The role involves collaborating closely with the broader data and evaluation teams to iterate on the training corpus. You will analyze the composition and quality of web data, studying its impact on downstream model performance. This position requires a deep understanding of data pipelines, extraction, parsing, deduplication, and filtering, as well as the ability to work with cross-functional teams.

Cohere's team is composed of researchers, engineers, designers, and more, all passionate about their craft and committed to building great products. The company values diversity and strives to create an inclusive work environment for all, welcoming applicants from all backgrounds.

What You Will Do

Maintain large-scale pipelines for processing web corpora, ensuring efficiency and reliability.
Develop and maintain highly-performant deduplication pipelines to improve data quality.
Work on filtering and quality-scoring systems to identify high-value web documents.
Analyze web data composition across domains, languages, and time periods to understand trends and patterns.
Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.
Develop and implement data quality assessment techniques and experiment with data mixtures to improve model performance.
Contribute to the development of new data processing frameworks and tools.
Stay updated with the latest advancements in data engineering and AI model training.
Participate in code reviews and contribute to the improvement of the codebase.
Work on scaling the data pipeline to handle increasing volumes of data.

What We Are Looking For

Strong software engineering skills, with proficiency in Python.
Experience building data pipelines, preferably with large-scale web datasets.
Familiarity with data processing frameworks such as Apache Spark, Apache Beam, or Pandas.
Knowledge of data quality assessment techniques and experimentation with data mixtures.
A passion for bridging research and engineering to solve complex data-related challenges in AI model training.
Experience working in a collaborative environment with cross-functional teams.
Strong analytical and problem-solving skills, with the ability to work independently.
Excellent communication skills, both written and verbal.
A degree in Computer Science, Engineering, or a related field.

Nice to Have

Experience with AI model training and deployment.
Knowledge of natural language processing techniques and tools.
Familiarity with cloud computing platforms such as AWS or GCP.
Experience working with containerization tools such as Docker.
Participation in open-source projects or personal projects related to data engineering or AI.

Benefits and Perks

The opportunity to work on cutting-edge AI models and contribute to the development of language understanding and generation capabilities.
Collaborative and dynamic work environment with a team of passionate professionals.
Support for professional development and continuous learning.
Flexible working hours and remote work options.
Access to the latest tools and technologies in data engineering and AI.
Competitive compensation package.
Equity options.
Comprehensive health insurance.
Generous PTO policy.

How to Stand Out

Highlight your experience with data processing frameworks and large-scale web datasets in your resume and cover letter.
Showcase your passion for bridging research and engineering by discussing personal projects or contributions to open-source projects related to data engineering or AI.
Prepare to discuss your approach to data quality assessment and experimentation with data mixtures during the interview.
Emphasize your ability to work collaboratively with cross-functional teams and communicate complex technical ideas effectively.
Be ready to provide examples of your experience with cloud computing platforms, containerization tools, and natural language processing techniques.
If possible, share a portfolio or GitHub repository demonstrating your work in data engineering or AI model training.
Research Cohere's mission and values, and be prepared to discuss how your skills and experience align with the company's goals.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.