Member of Technical Staff, Integration/RL Team (Research Engineer)

Cohere·Remote(Paris)

Software Development

Excel

WFA Digital Insight

In the rapidly evolving field of AI, Cohere stands out for its commitment to scaling intelligence to serve humanity. With a 45% growth in demand for AI and machine learning specialists in the past two years, this role offers a unique opportunity to contribute to the development of frontier models. As a member of the technical staff, you'll work on integrating and optimizing large-scale distributed RL methods, requiring a blend of software engineering skills, proficiency in Python and ML frameworks, and a passion for research. With Cohere's emphasis on diversity and inclusivity, this is an attractive opportunity for those seeking a challenging and collaborative environment. Before applying, consider how your skills align with the company's mission and the role's requirements.

Job Description

About the Role

As a Member of the Technical Staff on the Integration/RL Team at Cohere, you'll play a pivotal role in developing and scaling machine learning algorithms and infrastructure. This involves designing experiments, crafting design documents, and implementing production code to support the team's research efforts. The position requires a unique blend of engineering prowess and scientific curiosity, as you'll be working on large-scale, distributed reinforcement learning (RL) methods. Your work will directly contribute to the post-training ecosystem, enhancing the quality and scalability of Cohere's models.

The Integration team at Cohere focuses on the critical post-training phase of model development, where the rubber meets the road in terms of making AI systems practical and efficient. This involves not just optimizing algorithms but also designing and implementing tools that support and accelerate research. As part of this team, you'll work closely with other engineers, researchers, and scientists to ensure that the solutions developed are not only theoretically sound but also practically viable and scalable.

The role is based in Paris, but Cohere is remote-friendly, allowing applicants to work from various locations between UTC−06:00 and UTC+01:00. This flexibility, combined with the company's mission to scale intelligence to serve humanity, makes for a compelling opportunity for those passionate about AI, machine learning, and making a meaningful impact.

What You Will Do

Design and write high-performing, scalable software for training models, focusing on efficiency, reliability, and scalability.
Develop new tools to support and accelerate research and Large Language Model (LLM) training, ensuring that these tools are user-friendly and meet the evolving needs of the research team.
Collaborate with other engineering teams, such as Infrastructure, Efficiency, and Serving, to create a cohesive post-training ecosystem that supports the entire lifecycle of model development and deployment.
Work closely with scientific teams (Agent, Multimodal, Multilingual, etc.) to integrate the post-training solutions with their research and development efforts, ensuring seamless interaction and maximum impact.
Craft and implement techniques to improve performance and speed up training cycles, considering both offline preference and the RL regime, and leveraging the latest advancements in ML and AI.
Research, implement, and experiment with new ideas on the cluster and data infrastructure, staying at the forefront of technological advancements and contributing to the body of knowledge in the field.
Participate in code reviews, ensuring that all code meets the highest standards of quality, readability, and maintainability, and contribute to reducing technical debt across the codebase.
Collaborate with other scientists, engineers, and teams to share knowledge, best practices, and lessons learned, fostering a culture of openness, collaboration, and continuous improvement.
Stay updated with the latest developments in ML, LLM, and RL, applying this knowledge to improve Cohere's models and solutions, and contributing to the company's thought leadership in the industry.

What We Are Looking For

Extremely strong software engineering skills, with a focus on designing and developing scalable, efficient software systems.
A strong belief in test-driven development methods, clean code, and the importance of reducing technical debt at all levels of the codebase.
Proficiency in Python and related ML frameworks such as JAX, PyTorch, and/or XLA/MLIR, with the ability to learn and adapt to new technologies and frameworks.
Experience using and debugging large-scale distributed training strategies, including memory and speed profiling, to optimize the performance of ML models.
[Bonus] Experience with distributed training infrastructures (Kubernetes) and associated frameworks (Ray), which is highly desirable for optimizing the deployment and scaling of models.
[Bonus] Hands-on experience with the post-training phase of model training, with a strong emphasis on scalability and performance, demonstrating an understanding of the challenges and opportunities in this critical phase.
[Bonus] Experience in ML, LLM, and RL academic research, providing a solid foundation in the theoretical underpinnings of AI and machine learning.

Nice to Have

Experience with agile development methodologies and version control systems like Git.
Familiarity with cloud-based services and platforms, particularly those used for AI and ML workloads.
Knowledge of databases and data storage solutions, especially those optimized for large-scale AI applications.
Experience with containerization using Docker and container orchestration using Kubernetes.
Participation in open-source projects or personal projects related to AI, ML, or software development, demonstrating a passion for innovation and community contribution.

Benefits and Perks

Competitive compensation package, reflecting the value you bring to the company and the industry.
Opportunity to work on cutting-edge AI and ML projects, making a real impact on the future of technology and society.
Collaborative and dynamic work environment, with a team of highly skilled and motivated professionals.
Flexible working hours and remote work options, allowing you to balance your work and personal life effectively.
Professional development opportunities, including training, mentorship, and conference attendance, to support your growth and career aspirations.
Access to the latest technologies and tools, ensuring you stay at the forefront of AI and ML advancements.
A culture that values diversity, inclusivity, and wellness, recognizing the importance of a healthy and supportive work environment.

How to Stand Out

Develop a strong foundation in Python and ML frameworks like JAX, PyTorch, and/or XLA/MLIR to increase your chances of success in this role.
Showcase your ability to work collaboratively in a fast-paced, technically challenging environment, highlighting any experience with agile development methodologies and version control systems.
Prepare to discuss your experience with large-scale distributed training strategies, including any challenges you've faced and how you overcame them, demonstrating your problem-solving skills and ability to learn.
Highlight any research or academic experience in ML, LLM, and RL, as this can be a significant advantage in understanding the theoretical underpinnings of the work.
Be ready to talk about your approach to test-driven development, clean code, and reducing technical debt, as these are core values for the team.
Demonstrate your passion for quality work and optimization, showing how you've applied these principles in previous roles or projects to achieve better outcomes.
Prepare questions about the company culture, team dynamics, and opportunities for growth and development, as these can give you valuable insights into whether the role and company are the right fit for you.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.