Staff Site Reliability Engineer — Project Volcano
WFA Digital Insight
As demand for cloud infrastructure specialists grew 25% in 2025, companies like Kong are investing heavily in reliable, scalable platforms. With Project Volcano, Kong is pioneering an internal developer platform that requires top-notch Site Reliability Engineers. Candidates should know that this role demands a unique blend of technical expertise, particularly in Kubernetes and PostgreSQL, and strategic thinking to drive the platform's reliability posture. Kong's commitment to innovation and customer satisfaction makes this an attractive opportunity for those seeking a challenging, high-impact role in the remote job market.
Job Description
About the Role
Kong is seeking a highly skilled Staff Site Reliability Engineer to join the team behind Project Volcano, an internal developer platform designed to provide on-demand preview environments, edge deployments, and managed services like PostgreSQL and auth. This strategic initiative, driven by the Office of the CTO, aims to create a scalable, reliable platform that serves all of Kong's customers. As a founding member of the SRE team for Volcano, you will have the opportunity to define the platform's reliability posture, build its SRE practice from the ground up, and ensure Volcano's infrastructure can support the demands of a growing customer base.The role of a Staff Site Reliability Engineer at Kong is multifaceted, requiring a deep understanding of cloud infrastructure, particularly Kubernetes, and the ability to architect and implement scalable, reliable systems. You will work closely with engineering leadership to drive the platform's technical vision and collaborate with cross-functional teams to ensure the integration of reliability and compliance into Volcano's architecture.
What You Will Do
- Own reliability for Volcano end-to-end, defining and driving SLOs, error budgets, and incident response practices.
- Architect the platform's infrastructure, including multi-region Kubernetes, networking, and data plane design.
- Establish deployment automation, canary pipelines, and preview environment provisioning using ArgoCD, Helm, and Terraform/Terragrunt.
- Design, operate, and harden multi-tenant PostgreSQL clusters, Redis caching layers, and object storage, focusing on data isolation, performance, and disaster recovery.
- Drive observability from day one, instrumenting every Volcano service with meaningful SLIs and building dashboards, alerts, and runbooks using Datadog, Prometheus, and Grafana.
- Lead cross-functional reliability work, collaborating with the OCTO team, product engineering, and security to integrate reliability and compliance into Volcano's architecture.
- Set SRE culture and standards, mentoring engineers on reliability principles, leading postmortems, defining on-call practices, and fostering a blameless engineering culture.
- Evaluate and adopt emerging technologies, making architectural decisions on edge runtimes, serverless compute, vector databases, and AI-native infrastructure components.
What We Are Looking For
- BS in Computer Science or equivalent, with substantial experience at the Staff or Principal IC level in SRE/Platform Engineering.
- Proven track record of building SRE or platform engineering practices for developer-facing platforms or PaaS/SaaS products, ideally at the greenfield stage.
- Deep Kubernetes expertise, including multi-tenant cluster design, networking, autoscaling, and security hardening.
- Experience with PostgreSQL, Redis, and object storage, as well as data isolation, performance, and disaster recovery strategies.
- Strong understanding of observability tools like Datadog, Prometheus, and Grafana, and the ability to instrument services for monitoring and alerting.
- Excellent collaboration and leadership skills, with the ability to mentor engineers and drive cross-functional projects.
Nice to Have
- Experience with edge runtimes, serverless compute, vector databases, and AI-native infrastructure components.
- Familiarity with ArgoCD, Helm, and Terraform/Terragrunt for deployment automation and infrastructure management.
- Knowledge of security practices and compliance standards relevant to cloud infrastructure and SaaS products.
Benefits and Perks
- Competitive salary range: 40K -97K.
- Opportunity to work on a high-impact, high-visibility project that is central to Kong's growth strategy.
- Collaborative, dynamic work environment with a team of experienced engineers and leaders.
- Access to cutting-edge technologies and tools, with the freedom to innovate and experiment.
- Flexible, remote work arrangements, with a stipend for home office setup and ongoing support for remote work.
- Comprehensive health insurance, retirement savings plans, and paid time off to support work-life balance.
How to Stand Out
- Tip: Highlight your experience with Kubernetes, PostgreSQL, and observability tools like Datadog and Prometheus in your resume and cover letter.
- Ensure you have a strong understanding of SRE principles and practices, including SLOs, error budgets, and incident response.
- Be prepared to discuss your approach to building and maintaining reliable, scalable systems, and how you stay current with emerging technologies.
- Showcase your ability to collaborate with cross-functional teams and drive technical vision and strategy.
- Research Kong's products and services, and be ready to discuss how your skills and experience align with the company's goals and mission.
- Consider creating a portfolio or repository of your work, especially if you have experience with open-source projects or personal initiatives related to SRE or cloud infrastructure.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.