Software Engineer, Internal Infrastructure (Europe & UK)

Cohere·Remote(United Kingdom)

Software Development

Excel

WFA Digital Insight

As demand for AI-driven solutions surges, companies like Cohere are at the forefront, needing skilled engineers to build and maintain the complex infrastructure required. With the global AI market expected to reach

90 billion by 2027, the need for professionals who can handle the technical demands of AI model training and deployment is on the rise. This role stands out for its focus on scalability, stability, and innovation in AI infrastructure. Before applying, candidates should be prepared to showcase their expertise in Kubernetes, cloud infrastructure, and a passion for contributing to the development of industry-leading AI models.

Job Description

About the Role

The Software Engineer position in the Internal Infrastructure team at Cohere is a critical role that involves building and operating world-class infrastructure and tools to support the training, evaluation, and serving of Cohere's foundational models. This is a unique opportunity to work closely with AI researchers to support their AI workload needs on the cutting edge, focusing on stability, scalability, and observability. The role requires a deep understanding of Kubernetes, cloud infrastructure, and the ability to design and build resilient, scalable systems for training AI models.

As part of this team, you will be responsible for ensuring the smooth operation of complex systems, collaborating with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads. You will also partner with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency of novel model training techniques. This is a role that requires a strong foundation in software engineering principles, a passion for innovation, and the ability to work in a fast-paced environment.

What You Will Do

Build and operate Kubernetes compute superclusters across multiple clouds to support AI model training and deployment
Partner with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads
Work closely with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency of novel model training techniques
Design and build resilient, scalable systems for training AI models, focusing on creating intuitive user interfaces that empower researchers to self-serve to troubleshoot and resolve problems
Encourage software best practices across the company and participate in team processes such as knowledge sharing, reviews, and on-call
Participate in a 24x7 on-call rotation to ensure the continuous operation of critical infrastructure
Collaborate with cross-functional teams to identify and prioritize infrastructure needs and projects
Stay up-to-date with the latest developments in AI infrastructure and contribute to the company's knowledge base
Develop and maintain documentation for infrastructure systems and processes
Identify and mitigate potential security risks in the infrastructure

What We Are Looking For

Deep experience running Kubernetes clusters at scale and/or scaling and troubleshooting Cloud Native infrastructure
Strong programming skills in Go or Python
Experience with Infrastructure as Code (IaC) tools
Self-directed and adaptable with excellent problem-solving skills
Strong communication skills and the ability to thrive in fast-paced environments
Experience with cloud providers such as AWS, GCP, or Azure
Familiarity with CI/CD pipelines and agile development methodologies
Passion for building systems that help others be more productive
Experience with mentorship, knowledge transfer, and review as essential prerequisites for a healthy team

Nice to Have

Previous experience working with ML training infrastructure and GPU workloads
Familiarity with RDMA networking
Expertise to support and troubleshoot low-level Linux systems
Experience collaborating with research teams or machine learning engineers
Contributions to Open Source solutions

Benefits and Perks

Competitive compensation package
Opportunity to work on cutting-edge AI technology
Collaborative and dynamic work environment
Professional development opportunities
Flexible working hours and remote work options
Access to the latest tools and technologies
Recognition and reward for outstanding performance
Comprehensive health insurance
Generous paid time off

How to Stand Out

Tailor your resume: Ensure your resume highlights your experience with Kubernetes, cloud infrastructure, and software engineering principles.
Prepare for technical interviews: Be ready to discuss your experience with cloud providers, IaC tools, and troubleshooting complex systems.
Showcase your passion for AI: Demonstrate your interest in AI and machine learning, and how you see your role contributing to the development of AI models.
Develop a personal project: Having a personal project that showcases your skills in building and operating Kubernetes clusters or working with cloud infrastructure can be a significant plus.
Practice your communication skills: As this role involves working closely with research teams and cross-functional teams, being able to communicate complex technical ideas simply is crucial.
Stay updated on industry trends: Keep yourself informed about the latest developments in AI infrastructure and cloud computing to show your commitment to the field.
Be prepared to discuss your experience with on-call rotations: Highlight your ability to work in a 24x7 on-call rotation and ensure the continuous operation of critical infrastructure.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.