Senior ML Systems Engineer, Frameworks & Tooling

Cohere·Remote(London)

Software Development

Excel

WFA Digital Insight

The demand for skilled ML systems engineers has skyrocketed in recent years, with a growth rate of 45% in 2025 alone. As companies like Cohere continue to push the boundaries of AI, professionals with expertise in large-scale distributed training and HPC systems are in high demand. With Cohere's mission to scale intelligence and serve humanity, this role offers a unique opportunity to work on cutting-edge projects and collaborate with a team of world-class researchers and engineers. Candidates should be prepared to showcase their experience with distributed training abstractions, multi-node cluster orchestration, and containerized environments. Before applying, it's essential to understand the current landscape of ML systems and the skills required to succeed in this field.

Job Description

About the Role

The Senior ML Systems Engineer role at Cohere is a critical position that requires expertise in large-scale distributed training and HPC systems. As a key member of the team, you will be responsible for designing and maintaining the core components that enable fast, reliable, and scalable model training. Your work will have a direct impact on the company's mission to scale intelligence and serve humanity. You will collaborate closely with the research, infra, and deployment teams to ensure that the training framework is optimized for performance and efficiency.

The day-to-day responsibilities of this role will involve working on the development and maintenance of the training framework, designing distributed training abstractions, and improving training throughput and stability on multi-node clusters. You will also be responsible for developing and maintaining tooling for monitoring, logging, debugging, and developer ergonomics. Your expertise in large-scale distributed training and HPC systems will be essential in resolving performance bottlenecks and building robust systems that ensure reproducible, debuggable, large-scale runs.

Cohere is a team of passionate researchers, engineers, designers, and more, who are dedicated to their craft. As a Senior ML Systems Engineer, you will be working with a world-class team that is committed to building great products and driving innovation in the field of AI. Your contributions will be valued, and you will have the autonomy to make a significant impact on the company's mission.

What You Will Do

Design and maintain the training framework responsible for large-scale LLM training
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics
Collaborate closely with infra teams to ensure that the cluster, container environments, and hardware configurations support high-performance training
Investigate and resolve performance bottlenecks across the ML systems stack
Build robust systems that ensure reproducible, debuggable, large-scale runs
Improve training throughput and stability on multi-node clusters
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing)
Develop and maintain systems for data pipeline optimization, sharded datasets, or caching strategies
Work closely with the research team to develop and implement new training techniques and algorithms

What We Are Looking For

Strong engineering experience in large-scale distributed training or HPC systems
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
Experience working with containerized environments (Docker, Singularity/Apptainer)
A track record of building tools that increase developer velocity for ML teams
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
Strong collaboration skills — you’ll work closely with infra, research, and deployment teams
Experience with training LLMs or other large transformer architectures
Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.)
Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches)

Nice to Have

Experience with data pipeline optimization, sharded datasets, or caching strategies
Background in performance engineering, profiling, or low-level systems
Contributions to open-source ML projects or research papers in top-tier venues

Benefits and Perks

Opportunity to work on cutting-edge projects and collaborate with a world-class team
Competitive compensation package
Equity options
Flexible working hours and remote work arrangements
Access to the latest tools and technologies
Professional development opportunities and training
Health and wellness programs
Remote stipend and equipment allowance
Paid time off and holidays

How to Stand Out

Tip: Make sure to highlight your experience with large-scale distributed training and HPC systems in your resume and cover letter.
When preparing for the interview, focus on your ability to design and maintain complex systems, as well as your experience with containerized environments and multi-node cluster orchestration.
Be prepared to provide specific examples of your experience with performance debugging and optimization.
Showcasing your contributions to open-source ML projects or research papers in top-tier venues can be a significant advantage.
Don't be afraid to ask questions about the company culture and team dynamics during the interview process.
Be prepared to discuss your experience with ML frameworks and your ability to build tools that increase developer velocity for ML teams.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.