Staff+ Software Engineer Observability
WFA Digital Insight
Demand for skilled software engineers in AI observability is soaring, with some estimates suggesting a 25% increase in 2025. As companies like Anthropic pioneer reliable and interpretable AI, professionals with expertise in monitoring and telemetry are in high demand. With its commitment to beneficial AI, Anthropic stands out in the industry. Before applying, candidates should be prepared to showcase their technical skills, particularly in Excel, and understand the company's mission to create safe and beneficial AI systems.
Job Description
About the Role
Anthropic is at the forefront of creating reliable, interpretable, and steerable AI systems, and its Observability team plays a critical role in this mission. As a Staff+ Software Engineer in Observability, you will be part of a dynamic team that owns the monitoring and telemetry infrastructure used by every engineer and researcher at Anthropic. This infrastructure is crucial for the detection, diagnosis, and resolution of issues, ensuring the reliability and operational excellence of Anthropic's research and product systems.The Observability team is tasked with building next-generation observability systems capable of handling the exponentially growing complexity of Anthropic's infrastructure. This includes developing high-throughput ingest pipelines, cost-efficient columnar storage, unified query layers across signals, and diagnostic tools that enable engineers to resolve issues in minutes rather than hours.
What You Will Do
- Design, develop, and maintain the monitoring and telemetry infrastructure to support Anthropic's growing infrastructure
- Collaborate with engineers and researchers to identify and prioritize observability needs
- Develop and implement high-throughput ingest pipelines for operational data
- Design and implement cost-efficient columnar storage solutions
- Develop unified query layers across different signals to enhance data accessibility
- Create and maintain diagnostic tools to aid in issue detection and resolution
- Work on distributed tracing, error analytics, alerting, and dashboard development
- Engage in the development of agentic diagnostic tools for advanced issue resolution
- Participate in on-call rotations to ensure 24/7 support for Anthropic's systems
- Collaborate with cross-functional teams to ensure observability requirements are met across the organization
What We Are Looking For
- Proficiency in software development languages such as Python, Java, or C++
- Experience with monitoring and telemetry tools such as Prometheus, Grafana, or New Relic
- Knowledge of data storage solutions such as relational databases or NoSQL databases
- Understanding of distributed systems and cloud computing platforms like AWS or GCP
- Experience with containerization using Docker and orchestration using Kubernetes
- Strong problem-solving skills and the ability to work in a fast-paced environment
- Excellent communication skills to collaborate effectively with cross-functional teams
- Experience with Agile development methodologies and version control systems like Git
Nice to Have
- Experience with AI or machine learning technologies
- Knowledge of security practices and protocols for protecting sensitive data
- Familiarity with project management tools such as Jira or Asana
- Certification in cloud computing or containerization
Benefits and Perks
- Competitive salary package
- Equity opportunities
- Comprehensive health insurance
- Generous PTO policy
- Remote work stipend
- Access to cutting-edge technologies and tools
- Professional development opportunities
- Collaborative and dynamic work environment
How to Stand Out
- Tailor your resume and cover letter to highlight your experience in software development, monitoring, and telemetry, ensuring you mention specific tools and technologies relevant to the role.
- Prepare to discuss your problem-solving skills and how you handle complex issues in distributed systems, as well as your experience with data storage and query layers.
- Showcase your understanding of AI and machine learning concepts, even if it's not a direct requirement, to demonstrate your versatility and willingness to learn.
- Be ready to talk about your experience with containerization and orchestration, as these are key technologies in managing complex infrastructure.
- Ask about the team's dynamics and how the company supports professional growth, to understand the work environment and opportunities for development.
- Prepare examples of your work, such as personal projects or contributions to open-source projects, to demonstrate your coding skills and ability to work on complex systems.
- Don't hesitate to ask about the company's approach to AI ethics and safety, showing your interest in the company's mission and values.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.