Software Engineer, Internal Infrastructure (North America)

CohereCohere·Remote(Toronto)
Software Development
Excel

WFA Digital Insight

As demand for AI and machine learning specialists surges, companies like Cohere are at the forefront of innovation. With a 25% increase in cloud infrastructure spending in 2025, skilled software engineers who can manage Kubernetes clusters and optimize infrastructure costs are in high demand. Cohere's commitment to diversity and inclusion, coupled with its cutting-edge AI models, makes this role particularly compelling. Candidates should be prepared to showcase their expertise in cloud native infrastructure, as well as their ability to collaborate with research teams.

Job Description

## About the Role The Software Engineer role at Cohere is a critical position that requires building and operating world-class infrastructure to support the company's AI models. As part of the internal infrastructure team, you will work closely with AI researchers to identify and address their infrastructure needs, ensuring stability, scalability, and observability. Your expertise in Kubernetes, cloud native infrastructure, and programming languages like Go or Python will be essential in driving the development of industry-leading AI models.

The internal infrastructure team at Cohere is responsible for designing and implementing the underlying systems that power the company's AI platform. This includes building and operating Kubernetes GPU superclusters across multiple clouds, as well as partnering with cloud providers to optimize infrastructure costs and performance. As a software engineer on this team, you will have the opportunity to make a significant impact on the company's mission to scale intelligence and serve humanity.

Cohere's commitment to innovation and excellence is evident in its state-of-the-art AI models and its collaboration with top researchers in the field. As a software engineer at Cohere, you will be part of a team that values diversity, inclusivity, and knowledge sharing. You will have the opportunity to work with talented individuals from various backgrounds and contribute to the company's mission to drive AI adoption.

## What You Will Do - Build and operate Kubernetes compute superclusters across multiple clouds to support AI model training and deployment - Partner with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads - Collaborate with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency - Design and implement resilient, scalable systems for training AI models, focusing on intuitive user interfaces - Encourage software best practices across the company and participate in team processes such as knowledge sharing, reviews, and on-call rotations - Troubleshoot and resolve complex infrastructure issues, using expertise in Linux systems, networking, and cloud native technologies - Participate in the design and implementation of new infrastructure projects, including the evaluation of new technologies and tools - Develop and maintain documentation for infrastructure systems, including architecture diagrams and technical guides - Collaborate with other teams to ensure seamless integration of infrastructure with other company systems - Stay up-to-date with industry trends and advancements in cloud native infrastructure, AI, and machine learning

## What We Are Looking For - Deep experience running Kubernetes clusters at scale and/or scaling and troubleshooting cloud native infrastructure - Strong programming skills in Go or Python, with expertise in software development and testing - Experience with Infrastructure as Code (IaC) tools such as Terraform or CloudFormation - Self-directed and adaptable, with excellent problem-solving skills and attention to detail - Strong communication skills, with the ability to collaborate with cross-functional teams - Experience with AI or machine learning infrastructure, including GPU workloads and RDMA networking - Familiarity with agile development methodologies and version control systems such as Git - Bachelor's or master's degree in computer science, engineering, or a related field

## Nice to Have - Experience collaborating with research teams or machine learning engineers - Expertise in supporting and troubleshooting low-level Linux systems - Knowledge of cloud security and compliance frameworks, such as HIPAA or PCI-DSS - Experience with containerization using Docker or Kubernetes - Familiarity with monitoring and logging tools such as Prometheus or Grafana

## Benefits and Perks - Competitive salary and benefits package - Opportunity to work with cutting-edge AI models and technologies - Collaborative and dynamic work environment with a team of experts - Flexible work arrangements, including remote work options - Professional development opportunities, including training and conference attendance - Access to the latest tools and technologies, including cloud native infrastructure and AI frameworks - Recognition and reward for outstanding performance and contributions - Comprehensive health and wellness benefits, including medical, dental, and vision coverage

How to Stand Out

- Showcase your expertise in Kubernetes and cloud native infrastructure by highlighting specific projects or experiences in your resume or cover letter.

  • Be prepared to discuss your experience with Infrastructure as Code (IaC) tools and agile development methodologies during the interview process.
  • Demonstrate your ability to collaborate with cross-functional teams, including research teams and machine learning engineers, by providing examples from your previous experience.
  • Emphasize your problem-solving skills and attention to detail, as well as your ability to troubleshoot complex infrastructure issues.
  • Research Cohere's AI models and technologies, and be prepared to discuss how your skills and experience align with the company's mission and goals.
  • Highlight your experience with AI or machine learning infrastructure, including GPU workloads and RDMA networking, to stand out as a candidate.
  • Prepare to discuss your experience with cloud security and compliance frameworks, as well as your knowledge of containerization using Docker or Kubernetes.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.