Software Engineer, Internal Infrastructure (Europe & UK)

CohereCohere·Remote(United Kingdom)
Software Development
Excel

WFA Digital Insight

As demand for AI-driven solutions surges, companies like Cohere are at the forefront, needing skilled engineers to build and maintain the complex infrastructure required. With the global AI market expected to reach

90 billion by 2027, the need for professionals who can handle the technical demands of AI model training and deployment is on the rise. This role stands out for its focus on scalability, stability, and innovation in AI infrastructure. Before applying, candidates should be prepared to showcase their expertise in Kubernetes, cloud infrastructure, and a passion for contributing to the development of industry-leading AI models.

Job Description

About the Role

The Software Engineer position in the Internal Infrastructure team at Cohere is a critical role that involves building and operating world-class infrastructure and tools to support the training, evaluation, and serving of Cohere's foundational models. This is a unique opportunity to work closely with AI researchers to support their AI workload needs on the cutting edge, focusing on stability, scalability, and observability. The role requires a deep understanding of Kubernetes, cloud infrastructure, and the ability to design and build resilient, scalable systems for training AI models.

As part of this team, you will be responsible for ensuring the smooth operation of complex systems, collaborating with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads. You will also partner with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency of novel model training techniques. This is a role that requires a strong foundation in software engineering principles, a passion for innovation, and the ability to work in a fast-paced environment.

What You Will Do

  • Build and operate Kubernetes compute superclusters across multiple clouds to support AI model training and deployment
  • Partner with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads
  • Work closely with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency of novel model training techniques
  • Design and build resilient, scalable systems for training AI models, focusing on creating intuitive user interfaces that empower researchers to self-serve to troubleshoot and resolve problems
  • Encourage software best practices across the company and participate in team processes such as knowledge sharing, reviews, and on-call
  • Participate in a 24x7 on-call rotation to ensure the continuous operation of critical infrastructure
  • Collaborate with cross-functional teams to identify and prioritize infrastructure needs and projects
  • Stay up-to-date with the latest developments in AI infrastructure and contribute to the company's knowledge base
  • Develop and maintain documentation for infrastructure systems and processes
  • Identify and mitigate potential security risks in the infrastructure

What We Are Looking For

  • Deep experience running Kubernetes clusters at scale and/or scaling and troubleshooting Cloud Native infrastructure
  • Strong programming skills in Go or Python
  • Experience with Infrastructure as Code (IaC) tools
  • Self-directed and adaptable with excellent problem-solving skills
  • Strong communication skills and the ability to thrive in fast-paced environments
  • Experience with cloud providers such as AWS, GCP, or Azure
  • Familiarity with CI/CD pipelines and agile development methodologies
  • Passion for building systems that help others be more productive
  • Experience with mentorship, knowledge transfer, and review as essential prerequisites for a healthy team

Nice to Have

  • Previous experience working with ML training infrastructure and GPU workloads
  • Familiarity with RDMA networking
  • Expertise to support and troubleshoot low-level Linux systems
  • Experience collaborating with research teams or machine learning engineers
  • Contributions to Open Source solutions

Benefits and Perks

  • Competitive compensation package
  • Opportunity to work on cutting-edge AI technology
  • Collaborative and dynamic work environment
  • Professional development opportunities
  • Flexible working hours and remote work options
  • Access to the latest tools and technologies
  • Recognition and reward for outstanding performance
  • Comprehensive health insurance
  • Generous paid time off

How to Stand Out

  • Tailor your resume: Ensure your resume highlights your experience with Kubernetes, cloud infrastructure, and software engineering principles.
  • Prepare for technical interviews: Be ready to discuss your experience with cloud providers, IaC tools, and troubleshooting complex systems.
  • Showcase your passion for AI: Demonstrate your interest in AI and machine learning, and how you see your role contributing to the development of AI models.
  • Develop a personal project: Having a personal project that showcases your skills in building and operating Kubernetes clusters or working with cloud infrastructure can be a significant plus.
  • Practice your communication skills: As this role involves working closely with research teams and cross-functional teams, being able to communicate complex technical ideas simply is crucial.
  • Stay updated on industry trends: Keep yourself informed about the latest developments in AI infrastructure and cloud computing to show your commitment to the field.
  • Be prepared to discuss your experience with on-call rotations: Highlight your ability to work in a 24x7 on-call rotation and ensure the continuous operation of critical infrastructure.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.