Site Reliability Engineer, Inference Infrastructure

Cohere·Remote(Toronto)

Software Development

Excel

WFA Digital Insight

As the demand for AI and machine learning specialists continues to surge, with a 25% growth in job postings over the past year, Cohere stands out for its commitment to scaling intelligence to serve humanity. With a strong focus on innovation and a dedication to building high-performance, scalable systems, this role offers a unique opportunity for Site Reliability Engineers to make a real impact. Candidates should be prepared to bring their expertise in Kubernetes, NLP, and distributed systems to the table, as well as a passion for collaborating with cross-functional teams to drive results. With the global AI market projected to reach

90 billion by 2025, this is an exciting time to join a company at the forefront of this revolution.

Job Description

About the Role

The Site Reliability Engineer, Inference Infrastructure role at Cohere is a critical position that requires a unique blend of technical expertise, collaboration, and innovation. As a member of the Model Serving team, you will be responsible for developing, deploying, and operating the AI platform that delivers Cohere's large language models through easy-to-use API endpoints. This is a high-visibility role that requires strong relationships with internal developers, a deep understanding of distributed systems, and a passion for building scalable, reliable, and high-performance machine learning systems.

The Model Serving team is a close-knit group of engineers, researchers, and designers who are passionate about their craft and dedicated to building the next generation of AI platforms. As a Site Reliability Engineer, you will work closely with this team to deploy optimized NLP models to production in low-latency, high-throughput, and high-availability environments. You will also have the opportunity to interface with customers and create customized deployments to meet their specific needs.

Cohere's mission is to scale intelligence to serve humanity, and this role is instrumental in achieving that goal. The company is committed to building a diverse and inclusive work environment, and this role offers a unique opportunity to join a team of talented individuals who are passionate about their work.

What You Will Do

Build self-service systems that automate managing, deploying, and operating services, including custom Kubernetes operators that support language model deployments.
Automate environment observability and resilience, enabling all developers to troubleshoot and resolve problems.
Take steps required to ensure defined SLOs are met, including participation in an on-call rotation.
Build strong relationships with internal developers and influence the Infrastructure team's roadmap based on their feedback.
Develop the team through knowledge sharing and an active review process.
Design and implement large, highly available distributed systems with Kubernetes and GPU workloads on those clusters.
Collaborate with cross-functional teams to deploy optimized NLP models to production in low-latency, high-throughput, and high-availability environments.
Interface with customers to create customized deployments that meet their specific needs.
Participate in the development of our team through knowledge sharing and an active review process.
Stay up-to-date with the latest developments in machine learning, NLP, and distributed systems, applying this knowledge to improve the efficiency and effectiveness of our systems.

What We Are Looking For

5+ years of engineering experience running production infrastructure at a large scale.
Experience designing large, highly available distributed systems with Kubernetes and GPU workloads on those clusters.
Experience with Kubernetes dev and production coding and support.
Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem/hybrid serving.
Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments.
Experience in compute/storage/network resource and cost management.
Excellent collaboration and troubleshooting skills to build mission-critical systems and ensure smooth operations and efficient teamwork.
The grit and adaptability to solve complex technical challenges that evolve day to day.
Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference.
Strong understanding or working experience with distributed systems.
Experience in Golang, C++, or other languages designed for high-performance scalable servers.

Nice to Have

Experience with Excel and other data analysis tools.
Knowledge of machine learning frameworks and libraries, such as TensorFlow or PyTorch.
Experience with agile development methodologies and version control systems, such as Git.
Familiarity with cloud-based services, such as AWS or GCP.
Experience with CI/CD pipelines and automation tools, such as Jenkins or Docker.

Benefits and Perks

Competitive salary and benefits package.
Opportunity to work with a talented team of engineers, researchers, and designers who are passionate about their craft.
Collaborative and dynamic work environment that encourages innovation and creativity.
Professional development opportunities, including training and education programs.
Flexible work arrangements, including remote work options.
Access to the latest tools and technologies, including cloud-based services and machine learning frameworks.
Comprehensive health and wellness programs, including mental health support and employee assistance programs.
Generous PTO and vacation policy, including paid holidays and sick leave.

How to Stand Out

Tip: Be prepared to provide specific examples of your experience with Kubernetes, NLP, and distributed systems, highlighting your ability to build scalable and reliable machine learning systems.
Tip: Showcase your passion for innovation and collaboration, demonstrating how you have worked with cross-functional teams to drive results in previous roles.
Tip: Familiarize yourself with Cohere's mission and values, and be prepared to discuss how your own values and goals align with those of the company.
Tip: Develop a strong understanding of the latest developments in machine learning, NLP, and distributed systems, and be prepared to discuss how you have applied this knowledge in previous roles.
Tip: Highlight your experience with agile development methodologies and version control systems, demonstrating your ability to work efficiently and effectively in a fast-paced environment.
Tip: Be prepared to discuss your experience with cloud-based services, including AWS or GCP, and demonstrate your ability to design and deploy scalable systems in these environments.
Tip: Showcase your ability to communicate complex technical ideas to non-technical stakeholders, demonstrating your ability to work effectively with customers and other teams.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.