Staff Site Reliability Engineer - Site Experience

Reddit·Remote(Remote - United Kingdom)

Software Development

Excel

WFA Digital Insight

As demand for reliable digital infrastructure grows, Reddit seeks a Staff Site Reliability Engineer to lead the charge. With over 126 million daily active users, the platform's performance is crucial. Candidates with experience in large-scale distributed systems and a passion for solving complex reliability challenges will thrive in this role. The current remote job market sees a 25% increase in demand for site reliability engineers, making this a prime opportunity for those looking to make a significant impact.

Job Description

About the Role

The Staff Site Reliability Engineer position at Reddit is a technical leadership role focused on ensuring the reliability and performance of critical user-facing systems. As a key member of the Site Experience SRE team, you will partner with product and infrastructure teams to drive availability, latency, scalability, and operational excellence across Reddit's most business-critical experiences.

Reddit is a community of communities, built on shared interests, passion, and trust. With over 100,000 active communities and approximately 126 million daily active unique visitors, the platform is one of the largest and most influential on the internet. As such, reliability and performance are more critical than ever.

The Site Experience SRE team sits at the intersection of infrastructure, product engineering, and user experience, ensuring that every interaction across web, mobile, APIs, feeds, media delivery, and real-time systems is fast, reliable, and resilient.

What You Will Do

Lead reliability engineering initiatives for critical user-facing systems at internet scale
Partner with product and infrastructure teams to improve availability, latency, scalability, and operational excellence
Drive architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning
Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure
Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health
Eliminate repetitive operational work through automation and tooling
Lead complex incident response efforts across engineering teams
Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented
Influence engineering standards and best practices across the organization

What We Are Looking For

5+ years of experience in site reliability engineering or a related field
Strong technical leadership and architecture skills
Experience with large-scale distributed systems and cloud infrastructure
Proficiency in programming languages such as Python, Java, or C++
Strong understanding of reliability, scalability, and performance principles
Experience with automation and tooling, such as Ansible, Docker, or Kubernetes
Strong communication and collaboration skills
Experience with incident management and blameless postmortems

Nice to Have

Experience with Excel or other data analysis tools
Experience with machine learning or artificial intelligence
Experience with containerization and orchestration
Experience with agile development methodologies

Benefits and Perks

Competitive salary and benefits package
Opportunity to work on a high-impact team with a significant influence on the company's growth and success
Collaborative and dynamic work environment
Flexible working hours and remote work options
Access to cutting-edge technologies and tools
Professional development and growth opportunities
Recognition and reward for outstanding performance
Comprehensive health and wellness programs
Generous parental leave policy

How to Stand Out

Ensure you have a strong understanding of large-scale distributed systems and cloud infrastructure, as well as experience with automation and tooling.
Develop a portfolio that showcases your technical leadership and architecture skills, including examples of reliability engineering initiatives you've led.
Be prepared to discuss your experience with incident management and blameless postmortems, and how you've driven engineering improvements to reduce incidents and improve service health.
Research Reddit's company culture and values, and be prepared to discuss how your skills and experience align with their mission and goals.
Don't be afraid to ask questions during the interview process, such as what a typical day looks like in the role or what opportunities there are for growth and development.
Make sure to highlight your experience with data analysis tools like Excel, and be prepared to discuss how you've used data to drive technical decisions.
Be prepared to negotiate your salary and benefits package, and don't be afraid to ask about opportunities for professional development and growth.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.