Senior ML Systems Engineer, Frameworks & Tooling
WFA Digital Insight
The demand for skilled ML systems engineers has skyrocketed in recent years, with a growth rate of 45% in 2025 alone. As companies like Cohere continue to push the boundaries of AI, professionals with expertise in large-scale distributed training and HPC systems are in high demand. With Cohere's mission to scale intelligence and serve humanity, this role offers a unique opportunity to work on cutting-edge projects and collaborate with a team of world-class researchers and engineers. Candidates should be prepared to showcase their experience with distributed training abstractions, multi-node cluster orchestration, and containerized environments. Before applying, it's essential to understand the current landscape of ML systems and the skills required to succeed in this field.
Job Description
About the Role
The Senior ML Systems Engineer role at Cohere is a critical position that requires expertise in large-scale distributed training and HPC systems. As a key member of the team, you will be responsible for designing and maintaining the core components that enable fast, reliable, and scalable model training. Your work will have a direct impact on the company's mission to scale intelligence and serve humanity. You will collaborate closely with the research, infra, and deployment teams to ensure that the training framework is optimized for performance and efficiency.The day-to-day responsibilities of this role will involve working on the development and maintenance of the training framework, designing distributed training abstractions, and improving training throughput and stability on multi-node clusters. You will also be responsible for developing and maintaining tooling for monitoring, logging, debugging, and developer ergonomics. Your expertise in large-scale distributed training and HPC systems will be essential in resolving performance bottlenecks and building robust systems that ensure reproducible, debuggable, large-scale runs.
Cohere is a team of passionate researchers, engineers, designers, and more, who are dedicated to their craft. As a Senior ML Systems Engineer, you will be working with a world-class team that is committed to building great products and driving innovation in the field of AI. Your contributions will be valued, and you will have the autonomy to make a significant impact on the company's mission.
What You Will Do
- Design and maintain the training framework responsible for large-scale LLM training
- Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics
- Collaborate closely with infra teams to ensure that the cluster, container environments, and hardware configurations support high-performance training
- Investigate and resolve performance bottlenecks across the ML systems stack
- Build robust systems that ensure reproducible, debuggable, large-scale runs
- Improve training throughput and stability on multi-node clusters
- Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing)
- Develop and maintain systems for data pipeline optimization, sharded datasets, or caching strategies
- Work closely with the research team to develop and implement new training techniques and algorithms
What We Are Looking For
- Strong engineering experience in large-scale distributed training or HPC systems
- Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
- Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
- Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
- Experience working with containerized environments (Docker, Singularity/Apptainer)
- A track record of building tools that increase developer velocity for ML teams
- Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
- Strong collaboration skills — you’ll work closely with infra, research, and deployment teams
- Experience with training LLMs or other large transformer architectures
- Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.)
- Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches)
Nice to Have
- Experience with data pipeline optimization, sharded datasets, or caching strategies
- Background in performance engineering, profiling, or low-level systems
- Contributions to open-source ML projects or research papers in top-tier venues
Benefits and Perks
- Opportunity to work on cutting-edge projects and collaborate with a world-class team
- Competitive compensation package
- Equity options
- Flexible working hours and remote work arrangements
- Access to the latest tools and technologies
- Professional development opportunities and training
- Health and wellness programs
- Remote stipend and equipment allowance
- Paid time off and holidays
How to Stand Out
- Tip: Make sure to highlight your experience with large-scale distributed training and HPC systems in your resume and cover letter.
- When preparing for the interview, focus on your ability to design and maintain complex systems, as well as your experience with containerized environments and multi-node cluster orchestration.
- Be prepared to provide specific examples of your experience with performance debugging and optimization.
- Showcasing your contributions to open-source ML projects or research papers in top-tier venues can be a significant advantage.
- Don't be afraid to ask questions about the company culture and team dynamics during the interview process.
- Be prepared to discuss your experience with ML frameworks and your ability to build tools that increase developer velocity for ML teams.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.