Senior Site Reliability Engineer AI Infrastructure
WFA Digital Insight
The remote job market is witnessing a surge in demand for professionals with expertise in AI infrastructure, with a 25% increase in postings over the past year. Andromeda Cluster is at the forefront of this trend, providing access to scaled AI infrastructure for early-stage startups. To succeed in this role, candidates need to possess a deep understanding of GPU systems, high-performance networking, and distributed training. With the global AI market projected to reach
Job Description
About the Role
As a Senior Site Reliability Engineer at Andromeda Cluster, you will play a critical role in designing, operating, and debugging large-scale GPU infrastructure used for distributed training and inference. You will work directly with customers who are pushing the limits of modern AI systems, providing technical guidance and support to ensure seamless execution of their workloads. The company is expanding its operations to new frontiers, and this role is an opportunity to join a team that's at the forefront of AI innovation.The role requires a deep understanding of GPU systems, high-performance networking, and distributed training. You will be responsible for designing and evolving multi-provider, multi-region GPU compute clusters optimized for large-scale training, making topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency. You will also serve as the primary technical point of contact for customers running large-scale training workloads, providing onboard, troubleshooting, and optimization support.
Andromeda Cluster is committed to delivering compute when and where it's needed most, and this role is crucial to achieving that goal. The company works with leading AI labs, data centers, and cloud providers to deliver compute resources that power the world's most advanced AI systems.
What You Will Do
- Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training
- Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency
- Serve as the primary technical point of contact for customers running large-scale training workloads
- Onboard, troubleshoot, and optimize customer workloads, often in real-time
- Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure
- Own capacity planning across heterogeneous GPU fleets optimized for training throughput
- Ensure the health and performance of high-speed interconnects that underpin distributed training
- Diagnose and resolve fabric-level issues that degrade collective operations
- Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health
- Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management
- Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks
What We Are Looking For
- Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
- Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training
- Working knowledge of how large training jobs actually run - NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar
- Expert-level knowledge of Linux and systems internals
- Strong understanding of GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes
- Experience with distributed training and ML frameworks
- Strong problem-solving skills, with the ability to diagnose and resolve complex issues
- Excellent communication skills, with the ability to work with customers and internal stakeholders
Nice to Have
- Experience with containerization and orchestration tools such as Docker and Kubernetes
- Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud
- Experience with monitoring and logging tools such as Prometheus and Grafana
- Familiarity with agile development methodologies and version control systems such as Git
Benefits and Perks
- Competitive salary and benefits package
- Opportunity to work with a cutting-edge AI infrastructure company
- Collaborative and dynamic work environment
- Flexible working hours and remote work options
- Professional development and training opportunities
- Access to the latest technologies and tools
- Recognition and reward for outstanding performance
- Comprehensive health insurance and wellness programs
- Generous paid time off and vacation policy
How to Stand Out
- Develop a deep understanding of GPU systems and high-performance networking to stand out as a candidate.
- Showcase your experience with distributed training and ML frameworks in your portfolio or resume.
- Be prepared to provide specific examples of how you've troubleshoot and optimized large-scale GPU clusters in the past.
- Highlight your problem-solving skills and ability to work in a fast-paced environment.
- Research the company's technology stack and be prepared to ask informed questions during the interview process.
- Emphasize your ability to work collaboratively with customers and internal stakeholders to provide technical support and guidance.
- Be prepared to discuss your experience with agile development methodologies and version control systems.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.