Senior Site Reliability Engineer AI Infrastructure

Software Development

WFA Digital Insight

The remote job market is witnessing a surge in demand for professionals with expertise in AI infrastructure, with a 25% increase in postings over the past year. Andromeda Cluster is at the forefront of this trend, providing access to scaled AI infrastructure for early-stage startups. To succeed in this role, candidates need to possess a deep understanding of GPU systems, high-performance networking, and distributed training. With the global AI market projected to reach

90 billion by 2027, this is an exciting time to join a company that's shaping the future of AI compute. Before applying, candidates should be aware of the company's focus on building a liquidity layer for global AI compute and be prepared to showcase their expertise in managing large-scale GPU clusters.

Job Description

About the Role

As a Senior Site Reliability Engineer at Andromeda Cluster, you will play a critical role in designing, operating, and debugging large-scale GPU infrastructure used for distributed training and inference. You will work directly with customers who are pushing the limits of modern AI systems, providing technical guidance and support to ensure seamless execution of their workloads. The company is expanding its operations to new frontiers, and this role is an opportunity to join a team that's at the forefront of AI innovation.

The role requires a deep understanding of GPU systems, high-performance networking, and distributed training. You will be responsible for designing and evolving multi-provider, multi-region GPU compute clusters optimized for large-scale training, making topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency. You will also serve as the primary technical point of contact for customers running large-scale training workloads, providing onboard, troubleshooting, and optimization support.

Andromeda Cluster is committed to delivering compute when and where it's needed most, and this role is crucial to achieving that goal. The company works with leading AI labs, data centers, and cloud providers to deliver compute resources that power the world's most advanced AI systems.

What You Will Do

Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training
Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency
Serve as the primary technical point of contact for customers running large-scale training workloads
Onboard, troubleshoot, and optimize customer workloads, often in real-time
Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure
Own capacity planning across heterogeneous GPU fleets optimized for training throughput
Ensure the health and performance of high-speed interconnects that underpin distributed training
Diagnose and resolve fabric-level issues that degrade collective operations
Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health
Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management
Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks

What We Are Looking For

Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training
Working knowledge of how large training jobs actually run - NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar
Expert-level knowledge of Linux and systems internals
Strong understanding of GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes
Experience with distributed training and ML frameworks
Strong problem-solving skills, with the ability to diagnose and resolve complex issues
Excellent communication skills, with the ability to work with customers and internal stakeholders

Nice to Have

Experience with containerization and orchestration tools such as Docker and Kubernetes
Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud
Experience with monitoring and logging tools such as Prometheus and Grafana
Familiarity with agile development methodologies and version control systems such as Git

Benefits and Perks

Competitive salary and benefits package
Opportunity to work with a cutting-edge AI infrastructure company
Collaborative and dynamic work environment
Flexible working hours and remote work options
Professional development and training opportunities
Access to the latest technologies and tools
Recognition and reward for outstanding performance
Comprehensive health insurance and wellness programs
Generous paid time off and vacation policy

How to Stand Out

Develop a deep understanding of GPU systems and high-performance networking to stand out as a candidate.
Showcase your experience with distributed training and ML frameworks in your portfolio or resume.
Be prepared to provide specific examples of how you've troubleshoot and optimized large-scale GPU clusters in the past.
Highlight your problem-solving skills and ability to work in a fast-paced environment.
Research the company's technology stack and be prepared to ask informed questions during the interview process.
Emphasize your ability to work collaboratively with customers and internal stakeholders to provide technical support and guidance.
Be prepared to discuss your experience with agile development methodologies and version control systems.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.