Staff Site Reliability Engineer - Site Experience

RedditReddit·Remote(Remote - United Kingdom)
Software Development
Excel

WFA Digital Insight

As demand for reliable digital infrastructure grows, Reddit seeks a Staff Site Reliability Engineer to lead the charge. With over 126 million daily active users, the platform's performance is crucial. Candidates with experience in large-scale distributed systems and a passion for solving complex reliability challenges will thrive in this role. The current remote job market sees a 25% increase in demand for site reliability engineers, making this a prime opportunity for those looking to make a significant impact.

Job Description

About the Role

The Staff Site Reliability Engineer position at Reddit is a technical leadership role focused on ensuring the reliability and performance of critical user-facing systems. As a key member of the Site Experience SRE team, you will partner with product and infrastructure teams to drive availability, latency, scalability, and operational excellence across Reddit's most business-critical experiences.

Reddit is a community of communities, built on shared interests, passion, and trust. With over 100,000 active communities and approximately 126 million daily active unique visitors, the platform is one of the largest and most influential on the internet. As such, reliability and performance are more critical than ever.

The Site Experience SRE team sits at the intersection of infrastructure, product engineering, and user experience, ensuring that every interaction across web, mobile, APIs, feeds, media delivery, and real-time systems is fast, reliable, and resilient.

What You Will Do

  • Lead reliability engineering initiatives for critical user-facing systems at internet scale
  • Partner with product and infrastructure teams to improve availability, latency, scalability, and operational excellence
  • Drive architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning
  • Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure
  • Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health
  • Eliminate repetitive operational work through automation and tooling
  • Lead complex incident response efforts across engineering teams
  • Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented
  • Influence engineering standards and best practices across the organization

What We Are Looking For

  • 5+ years of experience in site reliability engineering or a related field
  • Strong technical leadership and architecture skills
  • Experience with large-scale distributed systems and cloud infrastructure
  • Proficiency in programming languages such as Python, Java, or C++
  • Strong understanding of reliability, scalability, and performance principles
  • Experience with automation and tooling, such as Ansible, Docker, or Kubernetes
  • Strong communication and collaboration skills
  • Experience with incident management and blameless postmortems

Nice to Have

  • Experience with Excel or other data analysis tools
  • Experience with machine learning or artificial intelligence
  • Experience with containerization and orchestration
  • Experience with agile development methodologies

Benefits and Perks

  • Competitive salary and benefits package
  • Opportunity to work on a high-impact team with a significant influence on the company's growth and success
  • Collaborative and dynamic work environment
  • Flexible working hours and remote work options
  • Access to cutting-edge technologies and tools
  • Professional development and growth opportunities
  • Recognition and reward for outstanding performance
  • Comprehensive health and wellness programs
  • Generous parental leave policy

How to Stand Out

  • Ensure you have a strong understanding of large-scale distributed systems and cloud infrastructure, as well as experience with automation and tooling.
  • Develop a portfolio that showcases your technical leadership and architecture skills, including examples of reliability engineering initiatives you've led.
  • Be prepared to discuss your experience with incident management and blameless postmortems, and how you've driven engineering improvements to reduce incidents and improve service health.
  • Research Reddit's company culture and values, and be prepared to discuss how your skills and experience align with their mission and goals.
  • Don't be afraid to ask questions during the interview process, such as what a typical day looks like in the role or what opportunities there are for growth and development.
  • Make sure to highlight your experience with data analysis tools like Excel, and be prepared to discuss how you've used data to drive technical decisions.
  • Be prepared to negotiate your salary and benefits package, and don't be afraid to ask about opportunities for professional development and growth.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.