Software Engineer, GPU Infrastructure (HPC)
WFA Digital Insight
The demand for skilled professionals in AI and high-performance computing (HPC) has surged, with a notable 25% increase in job postings for ML engineers in the past year. Cohere, a leader in AI model development, is seeking a seasoned Staff Software Engineer to drive the evolution of their GPU infrastructure. This role stands out for its focus on scalability, innovation, and collaboration with AI researchers. Before applying, candidates should be aware of the unique challenges and opportunities in this field, including the need for expertise in Kubernetes, ML frameworks, and low-level systems knowledge.
Job Description
About the Role
As a Staff Software Engineer specializing in GPU Infrastructure at Cohere, you will play a pivotal role in the development and scaling of the company's foundational models. Your expertise will be crucial in ensuring the stability, scalability, and observability of the infrastructure that supports these models. This role is part of the internal infrastructure team, which is responsible for building world-class infrastructure and tools used to train, evaluate, and serve Cohere's models.The internal infrastructure team collaborates closely with AI researchers to understand their workload needs and to support them with the cutting-edge infrastructure required for the development of industry-leading AI models. As a key member of this team, you will work on building and operating superclusters across multiple clouds, directly contributing to the acceleration of AI model development.
What You Will Do
- Design, deploy, and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, focusing on high throughput and low-latency performance for AI workloads.
- Collaborate with cloud providers to optimize infrastructure for cost efficiency, reliability, and performance, leveraging technologies such as RDMA, NCCL, and high-speed interconnects.
- Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to minimize disruption to AI/ML workflows.
- Develop intuitive interfaces and workflows that enable researchers to monitor, debug, and optimize their training jobs independently.
- Work closely with AI researchers to understand emerging needs and translate them into robust, scalable infrastructure solutions.
- Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization to ensure systems are maintainable and resilient.
- Share expertise through code reviews, documentation, and cross-team collaboration to foster a culture of knowledge transfer and engineering excellence.
- Participate in a 24x7 on-call rotation to ensure the continuity of critical infrastructure services.
- Stay updated with the latest developments in ML/HPC infrastructure and apply this knowledge to drive innovation within the team.
What We Are Looking For
- Deep expertise in ML/HPC infrastructure, including experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments.
- Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads at scale.
- Strong programming skills in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions.
- Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
- Experience collaborating with AI researchers or ML engineers to solve infrastructure challenges.
- Ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.
- Excellent problem-solving skills and the ability to work independently.
Nice to Have
- Experience with other programming languages, such as C++ or Rust, for specific infrastructure tasks.
- Knowledge of containerization technologies beyond Kubernetes, such as Docker.
- Familiarity with agile development methodologies and version control systems like Git.
- Participation in open-source projects related to ML/HPC infrastructure.
Benefits and Perks
- Competitive compensation package, reflecting the candidate's experience and skills.
- Opportunity to work on cutting-edge AI technology and infrastructure.
- Collaborative and dynamic work environment with a team of highly skilled professionals.
- Participation in a 24x7 on-call rotation with appropriate compensation.
- Professional development opportunities, including conferences, training, and workshops.
- Flexible remote work arrangements, providing a healthy work-life balance.
- Access to the latest tools and technologies in the field of AI and HPC.
- A culture that values innovation, collaboration, and knowledge sharing.
How to Stand Out
- Develop a strong foundation in Kubernetes and cloud computing to effectively manage and scale GPU/TPU superclusters for AI workloads.
- Stay updated with the latest advancements in ML frameworks and HPC technologies, such as JAX, PyTorch, and TensorFlow, to drive innovation in infrastructure development.
- Highlight your experience in collaborative environments, particularly in working with AI researchers to understand their needs and develop tailored infrastructure solutions.
- Prepare to discuss specific examples of optimizing infrastructure for cost efficiency, reliability, and performance, demonstrating your problem-solving skills and ability to drive impact.
- Showcase your proficiency in programming languages such as Python and Go, and mention any contributions to open-source projects related to ML/HPC infrastructure.
- Be ready to talk about your approach to staying current with industry developments and how you apply this knowledge to improve infrastructure and processes.
- Emphasize your ability to work in a fast-paced environment and participate in on-call rotations, ensuring the reliability and availability of critical infrastructure services.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.