Senior ML Infrastructure / DevOps Engineer

Pathway·Remote(Anywhere in the World)·Work From Anywhere

Software Development

WFA Digital Insight

The demand for skilled ML infrastructure engineers has skyrocketed, with a 25% increase in postings over the past year. As companies like Pathway push the boundaries of AI, experts who can scale and optimize ML workloads are in high demand. With the remote job market booming, candidates with experience in Linux, distributed systems, and cloud providers are well-positioned for success. Before applying, consider highlighting your expertise in automation, monitoring, and collaboration with cross-functional teams.

Job Description

About the Role

The Senior ML Infrastructure / DevOps Engineer will play a critical role in designing, operating, and scaling Pathway's ML infrastructure. As a key member of the team, you will own the production infrastructure that powers Pathway's ML training and inference workloads, working closely with the R&D team to ensure seamless integration and optimization. Your expertise in Linux, distributed systems, and cloud providers will be essential in driving the company's innovative approach to AI.

In this role, you will have the opportunity to work on cutting-edge projects, collaborating with a team of talented engineers and researchers to push the boundaries of what is possible with ML. Your work will have a direct impact on the company's ability to train, ship, and iterate on models, making this a highly visible and rewarding position.

What You Will Do

Design, operate, and scale GPU and CPU clusters for ML training and inference
Automate infrastructure provisioning and configuration using infrastructure-as-code and configuration management
Build and maintain robust ML pipelines with strong guarantees around reproducibility, traceability, and rollback
Implement and evolve ML-centric CI/CD: testing, packaging, deployment of models and services
Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift
Work with terabyte-scale datasets and the associated storage, networking, and performance challenges
Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
Participate in on-call rotation for critical ML infrastructure and lead incident response and post-mortems when things break

What We Are Looking For

5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems
Strong background in Linux, distributed systems, and cloud providers
Experience with automation tools such as Terraform, CloudFormation, and cluster-tooling
Knowledge of container orchestration and CI/CD pipelines
Strong understanding of networking, storage, and performance optimization
Experience working with large-scale datasets and ML workloads
Excellent collaboration and communication skills

Nice to Have

Experience with Kubernetes, autoscaling, and queueing
Familiarity with ML frameworks and libraries such as TensorFlow or PyTorch
Knowledge of security best practices and compliance frameworks
Experience with agile development methodologies and version control systems

Benefits and Perks

Competitive salary and benefits package
Opportunity to work on cutting-edge projects with a talented team of engineers and researchers
Flexible working hours and remote work options
Professional development and growth opportunities
Access to the latest tools and technologies
Collaborative and dynamic work environment

How to Stand Out

Be prepared to discuss your experience with automation tools and infrastructure-as-code
Highlight your understanding of ML workloads and scalability challenges
Show examples of your work with large-scale datasets and cloud providers
Emphasize your collaboration and communication skills, as this role requires close work with cross-functional teams
Research Pathway's innovative approach to AI and be ready to discuss how your skills and experience align with their goals
Consider creating a portfolio or repository showcasing your DevOps and ML infrastructure projects

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.