Site Reliability Engineer

Software Development

WFA Digital Insight

As the demand for reliable digital infrastructure grows, so does the need for skilled Site Reliability Engineers. With a 25% increase in cloud computing adoption in 2025, companies like Supabase are looking for experts who can ensure seamless service delivery. Supabase stands out for its commitment to remote work and employee growth, with a unique approach to SRE that emphasizes collaboration and influence. Candidates should be prepared to demonstrate their ability to drive reliability and scalability in fast-paced environments, with a strong focus on automation, incident response, and team empowerment.

Job Description

About the Role

Supabase is a leading Postgres development platform that provides a complete backend solution for developers. As the company continues to grow, it's essential to have a strong Site Reliability Engineer who can ensure the platform's reliability and scalability. The successful candidate will be embedded within the Service Operations team and will work closely with various engineering teams to establish practices, frameworks, and feedback loops that promote reliability.

The role requires a deep understanding of SRE principles, as well as experience in defining and operationalizing SLOs/SLIs at scale. The ideal candidate will have a software engineering mindset, with hands-on experience in building tools and driving adoption across engineering teams. With a strong focus on async and globally distributed teams, Supabase is looking for someone who can influence without authority and drive systemic improvements.

What You Will Do

Partner with service teams to define meaningful SLIs and SLOs grounded in customer experience
Own and evolve the Operational Readiness Review (ORR) process, conducting reviews for new services and major changes
Strengthen the incident-to-improvement pipeline, connecting postmortem findings to operational readiness gaps
Act as the reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design
Identify and quantify operational toil across the org, building or advocating for automation that eliminates it
Help teams design sustainable on-call practices, including alert quality, escalation paths, runbook coverage, and noise reduction
Track and report on org-wide operational maturity, surfacing systemic gaps and driving remediation
Develop and maintain dashboards and metrics to measure reliability and performance
Collaborate with cross-functional teams to drive reliability and scalability initiatives

What We Are Looking For

7+ years of experience in SRE, production engineering, or reliability-focused roles
Software engineering mindset, with hands-on experience in building tools and driving adoption
Experience defining and operationalizing SLOs/SLIs at scale, including error budget policies
Deep experience with incident response, postmortem facilitation, and turning incident learnings into systemic improvements
Proficiency with cloud infrastructure (AWS preferred) and infrastructure-as-code (Pulumi preferred)
Strong communication and influencing skills, with experience in async or globally distributed teams
Experience with large-scale multi-tenant systems, including managed database platforms or Postgres
Familiarity with OpenTelemetry, VictoriaMetrics, Grafana, or similar observability tooling

Nice to Have

Experience with Kubernetes-based platform operations
Familiarity with building developer-facing reliability tooling (SLO dashboards, ORR frameworks, toil tracking, DORA metrics)
Experience with Terraform or CDK for infrastructure-as-code

Benefits and Perks

Fully remote work arrangement, with a WeWork membership or co-working allowance
ESOP (equity ownership) in the company, with a shared vision for growth and success
Opportunities for professional development and growth, with a focus on employee empowerment
Access to cutting-edge technology and tools, with a strong focus on innovation and experimentation
Flexible working hours and a healthy work-life balance, with a emphasis on productivity and efficiency
Competitive compensation package, with a focus on performance and results
Comprehensive health and wellness benefits, with a focus on employee well-being
Generous PTO and holiday policy, with a emphasis on work-life balance

How to Stand Out

Be prepared to demonstrate your experience in defining and operationalizing SLOs/SLIs at scale, including error budget policies.
Showcase your ability to influence without authority and drive systemic improvements in a fast-paced environment.
Highlight your proficiency with cloud infrastructure and infrastructure-as-code, including AWS and Pulumi.
Emphasize your experience with large-scale multi-tenant systems and managed database platforms, such as Postgres.
Prepare to discuss your approach to incident response, postmortem facilitation, and turning incident learnings into systemic improvements.
Be ready to provide examples of how you've driven reliability and scalability initiatives in previous roles, and how you can apply those skills to Supabase.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.