Site Reliability Engineer

SupabaseSupabase·Remote
Software Development

WFA Digital Insight

As the demand for reliable digital infrastructure grows, so does the need for skilled Site Reliability Engineers. With a 25% increase in cloud computing adoption in 2025, companies like Supabase are looking for experts who can ensure seamless service delivery. Supabase stands out for its commitment to remote work and employee growth, with a unique approach to SRE that emphasizes collaboration and influence. Candidates should be prepared to demonstrate their ability to drive reliability and scalability in fast-paced environments, with a strong focus on automation, incident response, and team empowerment.

Job Description

About the Role

Supabase is a leading Postgres development platform that provides a complete backend solution for developers. As the company continues to grow, it's essential to have a strong Site Reliability Engineer who can ensure the platform's reliability and scalability. The successful candidate will be embedded within the Service Operations team and will work closely with various engineering teams to establish practices, frameworks, and feedback loops that promote reliability.

The role requires a deep understanding of SRE principles, as well as experience in defining and operationalizing SLOs/SLIs at scale. The ideal candidate will have a software engineering mindset, with hands-on experience in building tools and driving adoption across engineering teams. With a strong focus on async and globally distributed teams, Supabase is looking for someone who can influence without authority and drive systemic improvements.

What You Will Do

  • Partner with service teams to define meaningful SLIs and SLOs grounded in customer experience
  • Own and evolve the Operational Readiness Review (ORR) process, conducting reviews for new services and major changes
  • Strengthen the incident-to-improvement pipeline, connecting postmortem findings to operational readiness gaps
  • Act as the reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design
  • Identify and quantify operational toil across the org, building or advocating for automation that eliminates it
  • Help teams design sustainable on-call practices, including alert quality, escalation paths, runbook coverage, and noise reduction
  • Track and report on org-wide operational maturity, surfacing systemic gaps and driving remediation
  • Develop and maintain dashboards and metrics to measure reliability and performance
  • Collaborate with cross-functional teams to drive reliability and scalability initiatives

What We Are Looking For

  • 7+ years of experience in SRE, production engineering, or reliability-focused roles
  • Software engineering mindset, with hands-on experience in building tools and driving adoption
  • Experience defining and operationalizing SLOs/SLIs at scale, including error budget policies
  • Deep experience with incident response, postmortem facilitation, and turning incident learnings into systemic improvements
  • Proficiency with cloud infrastructure (AWS preferred) and infrastructure-as-code (Pulumi preferred)
  • Strong communication and influencing skills, with experience in async or globally distributed teams
  • Experience with large-scale multi-tenant systems, including managed database platforms or Postgres
  • Familiarity with OpenTelemetry, VictoriaMetrics, Grafana, or similar observability tooling

Nice to Have

  • Experience with Kubernetes-based platform operations
  • Familiarity with building developer-facing reliability tooling (SLO dashboards, ORR frameworks, toil tracking, DORA metrics)
  • Experience with Terraform or CDK for infrastructure-as-code

Benefits and Perks

  • Fully remote work arrangement, with a WeWork membership or co-working allowance
  • ESOP (equity ownership) in the company, with a shared vision for growth and success
  • Opportunities for professional development and growth, with a focus on employee empowerment
  • Access to cutting-edge technology and tools, with a strong focus on innovation and experimentation
  • Flexible working hours and a healthy work-life balance, with a emphasis on productivity and efficiency
  • Competitive compensation package, with a focus on performance and results
  • Comprehensive health and wellness benefits, with a focus on employee well-being
  • Generous PTO and holiday policy, with a emphasis on work-life balance

How to Stand Out

  • Be prepared to demonstrate your experience in defining and operationalizing SLOs/SLIs at scale, including error budget policies.
  • Showcase your ability to influence without authority and drive systemic improvements in a fast-paced environment.
  • Highlight your proficiency with cloud infrastructure and infrastructure-as-code, including AWS and Pulumi.
  • Emphasize your experience with large-scale multi-tenant systems and managed database platforms, such as Postgres.
  • Prepare to discuss your approach to incident response, postmortem facilitation, and turning incident learnings into systemic improvements.
  • Be ready to provide examples of how you've driven reliability and scalability initiatives in previous roles, and how you can apply those skills to Supabase.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.