Staff+ Software Engineer, Inference Runtime
WFA Digital Insight
As demand for AI specialists grows, Anthropic stands out with its mission to create reliable AI systems. With the rise of remote work, companies like Anthropic are looking for skilled engineers who can drive technical direction. The demand for professionals with expertise in systems engineering and ML infrastructure has increased by 27% in the past year, making this role highly competitive. Candidates should be prepared to showcase their technical expertise and experience working with large-scale distributed systems.
Job Description
About the Role
The Staff+ Software Engineer will be a technical lead for Inference Runtime, the team responsible for the shared, accelerator-agnostic core of Anthropic's inference serving stack. This role involves setting technical direction for the team, owning the architecture and roadmap for the shared runtime, and partnering with other teams to drive technical decisions. The ideal candidate will have experience working on large-scale distributed systems and a deep understanding of systems engineering and ML infrastructure.The Inference organization at Anthropic serves Claude to millions of users and enterprise customers, requiring speed, reliability, and efficiency. The team is looking for a senior engineer who can drive the technical roadmap and represent the team in cross-org efforts. The role involves working closely with the Engineering Manager, who owns hiring and people development, and collaborating with other teams to make technical decisions.
What You Will Do
- Set technical direction for the team, owning the architecture and roadmap for the shared runtime of the inference serving stack
- Own and evolve the accelerator-agnostic runtime itself, including its interfaces, internal boundaries, and build structure
- Drive efficient accelerator usage, including utilization, scheduling, and memory management across GPU, TPU, and Trainium
- Build the runtime's validation surface around partitioned builds, change-scoped testing, and canary/shadow/rollback as first-class mechanisms
- Act as a technical counterpart to Anthropic's central Infrastructure org on compilers, build systems, and toolchains
- Mentor engineers on the team through design review, code review, and direct collaboration
- Partner with other teams to drive technical decisions and represent the team in cross-org efforts
- Keep the platform's expansion cost low by ensuring new models and deployment targets pay only for their own specialization
- Drive the adoption of new technologies and techniques to improve the performance and reliability of the inference serving stack
What We Are Looking For
- Deep background in systems engineering or ML infrastructure, with experience working on large-scale distributed systems
- Real depth in at least one accelerator ecosystem (CUDA/GPU, TPU, or Trainium/AWS Neuron) and a genuine appetite to keep the runtime agnostic across all of them
- Significant software engineering experience, with a strong background in high-performance, large-scale distributed systems serving millions of users
- A track record of defining and using engineering metrics to drive improvement, including setting SLOs on platform surfaces and driving escape rates, release times, latency, or throughput
- Experience working with performance profiling, latency and throughput optimization, and systems debugging at scale
- Strong background in programming languages such as Rust and Python
- Experience with Agile development methodologies and version control systems such as Git
Nice to Have
- Experience working with cloud-based infrastructure and containerization technologies such as Docker and Kubernetes
- Familiarity with machine learning frameworks and libraries such as TensorFlow and PyTorch
- Experience working with data storage and processing technologies such as Apache Cassandra and Apache Spark
- Strong understanding of computer architecture and operating system design
- Experience working with security and compliance frameworks such as HIPAA and PCI-DSS
Benefits and Perks
- Competitive salary and equity package
- Comprehensive health insurance, including medical, dental, and vision
- Generous PTO and holiday package, including paid time off for vacations and sick leave
- Opportunity to work on cutting-edge technology and collaborate with experienced engineers
- Flexible working hours and remote work options, including the ability to work from home or from one of our offices
- Access to professional development and training opportunities, including conferences, workshops, and online courses
- Collaborative and dynamic work environment, including regular team-building activities and social events
How to Stand Out
- Develop a strong understanding of systems engineering and ML infrastructure, including experience working with large-scale distributed systems and accelerator ecosystems.
- Showcase your technical expertise and experience working with performance profiling, latency and throughput optimization, and systems debugging at scale.
- Highlight your ability to drive technical direction and own the architecture and roadmap for a shared runtime.
- Prepare to discuss your experience working with Agile development methodologies and version control systems such as Git.
- Research Anthropic's mission and values, and be prepared to discuss how your skills and experience align with the company's goals.
- Be prepared to provide examples of your experience working with security and compliance frameworks, and your understanding of computer architecture and operating system design.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.