Senior ML Infrastructure / DevOps Engineer
WFA Digital Insight
The demand for skilled ML infrastructure engineers has skyrocketed, with a 25% increase in postings over the past year. As companies like Pathway push the boundaries of AI, experts who can scale and optimize ML workloads are in high demand. With the remote job market booming, candidates with experience in Linux, distributed systems, and cloud providers are well-positioned for success. Before applying, consider highlighting your expertise in automation, monitoring, and collaboration with cross-functional teams.
Job Description
About the Role
The Senior ML Infrastructure / DevOps Engineer will play a critical role in designing, operating, and scaling Pathway's ML infrastructure. As a key member of the team, you will own the production infrastructure that powers Pathway's ML training and inference workloads, working closely with the R&D team to ensure seamless integration and optimization. Your expertise in Linux, distributed systems, and cloud providers will be essential in driving the company's innovative approach to AI.In this role, you will have the opportunity to work on cutting-edge projects, collaborating with a team of talented engineers and researchers to push the boundaries of what is possible with ML. Your work will have a direct impact on the company's ability to train, ship, and iterate on models, making this a highly visible and rewarding position.
What You Will Do
- Design, operate, and scale GPU and CPU clusters for ML training and inference
- Automate infrastructure provisioning and configuration using infrastructure-as-code and configuration management
- Build and maintain robust ML pipelines with strong guarantees around reproducibility, traceability, and rollback
- Implement and evolve ML-centric CI/CD: testing, packaging, deployment of models and services
- Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift
- Work with terabyte-scale datasets and the associated storage, networking, and performance challenges
- Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
- Participate in on-call rotation for critical ML infrastructure and lead incident response and post-mortems when things break
What We Are Looking For
- 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems
- Strong background in Linux, distributed systems, and cloud providers
- Experience with automation tools such as Terraform, CloudFormation, and cluster-tooling
- Knowledge of container orchestration and CI/CD pipelines
- Strong understanding of networking, storage, and performance optimization
- Experience working with large-scale datasets and ML workloads
- Excellent collaboration and communication skills
Nice to Have
- Experience with Kubernetes, autoscaling, and queueing
- Familiarity with ML frameworks and libraries such as TensorFlow or PyTorch
- Knowledge of security best practices and compliance frameworks
- Experience with agile development methodologies and version control systems
Benefits and Perks
- Competitive salary and benefits package
- Opportunity to work on cutting-edge projects with a talented team of engineers and researchers
- Flexible working hours and remote work options
- Professional development and growth opportunities
- Access to the latest tools and technologies
- Collaborative and dynamic work environment
How to Stand Out
- Be prepared to discuss your experience with automation tools and infrastructure-as-code
- Highlight your understanding of ML workloads and scalability challenges
- Show examples of your work with large-scale datasets and cloud providers
- Emphasize your collaboration and communication skills, as this role requires close work with cross-functional teams
- Research Pathway's innovative approach to AI and be ready to discuss how your skills and experience align with their goals
- Consider creating a portfolio or repository showcasing your DevOps and ML infrastructure projects
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.