Senior Site Reliability Engineer, Kong Konnect
WFA Digital Insight
The demand for skilled Site Reliability Engineers has surged, with a 27% increase in job postings over the last year. As companies like Kong continue to expand their remote workforces, the need for experts who can ensure the reliability and scalability of SaaS platforms has never been greater. With the rise of cloud computing and digital transformation, candidates with experience in Kubernetes, Terraform, and CI/CD pipelines are in high demand. Kong, a leader in API and AI connectivity, offers a unique opportunity for engineers to work on a global SaaS platform. Before applying, candidates should be prepared to showcase their experience in building and operating large-scale systems, as well as their ability to collaborate with cross-functional teams.
Job Description
About the Role
As a Senior Site Reliability Engineer at Kong, you will be responsible for building, operating, and scaling the company's multi-region SaaS platform, Kong Konnect. This platform powers the world's API connectivity, serving thousands of customers across AWS, GCP, and Azure. You will be part of the global Platform SRE team, working closely with development and security teams to ensure the smooth operation of SaaS services. Your primary focus will be on designing, automating, and running production systems, ensuring reliability, scalability, and security.The role of a Site Reliability Engineer is critical to the success of Kong's SaaS offerings. You will be working on complex systems, including multi-region Kubernetes clusters, service mesh, and gateway architectures. Your expertise in troubleshooting and resolving issues will be essential in maintaining the high availability and performance of the platform. As a senior engineer, you will also be responsible for mentoring junior team members and contributing to the development of best practices and standards.
Kong's SaaS platform is built on a microservices architecture, with a focus on scalability, reliability, and security. As a Senior Site Reliability Engineer, you will be working on the design and implementation of new features, as well as the maintenance and improvement of existing systems. You will have the opportunity to work with a range of technologies, including Kubernetes, Terraform, and CI/CD pipelines.
What You Will Do
- Operate and scale Kong's global SaaS platform, ensuring reliability, availability, and performance across regions and clouds
- Build, automate, and maintain Kubernetes-based infrastructure and deployment workflows using Terraform/Terragrunt, Helm, and ArgoCD
- Design, maintain, and optimize multi-region data and caching layers for high availability and low latency
- Operate and improve Kong Gateway and Kong Mesh environments supporting hybrid and distributed architectures
- Develop and maintain CI/CD pipelines and GitOps workflows to automate service delivery and ensure consistent infrastructure changes
- Enhance observability and incident response readiness through systems like Datadog, Prometheus, Grafana, and Thanos
- Collaborate closely with development and security teams to ensure smooth operation of SaaS services in compliance with reliability, security, and regulatory standards
- Participate in a global 24/7 on-call rotation and drive continuous improvement of operational playbooks and postmortem practices
- Lead and contribute to scaling initiatives that improve elasticity, reliability, and cost-efficiency across the SaaS platform
What We Are Looking For
- BS in Computer Science or equivalent practical experience
- Proven experience managing SaaS or PaaS systems at enterprise scale
- Deep expertise in Kubernetes, including debugging cluster/networking issues and designing for fault tolerance and scalability
- Strong proficiency with Infrastructure as Code tools like Terraform or Terragrunt
- Experience with CI/CD pipelines and GitOps workflows
- Proficiency in one or more programming languages for automation and tooling
- Solid understanding of Linux/Unix systems, networking, load balancers, and distributed systems
- Experience working with API gateway and service mesh technologies
- Familiarity with streaming systems like Kafka and observability platforms
Nice to Have
- Hands-on experience with Kong Gateway, Kong Mesh, or similar service connectivity technologies
- Experience operating ClickHouse, Druid, or other time-series and analytics databases
- Experience managing PostgreSQL and Redis in multi-region configurations
- Working knowledge of AWS networking, Azure VNet, or GCP NCC
- Strong understanding of disaster recovery, resiliency testing, and compliance-driven reliability practices
Benefits and Perks
- Competitive salary and equity package
- Comprehensive health, dental, and vision insurance
- Flexible PTO and sick leave policy
- Remote work stipend and home office setup support
- Professional development opportunities, including conference sponsorships and training programs
- Access to cutting-edge technologies and tools
- Collaborative and dynamic work environment
- Recognition and reward programs for outstanding performance
- Flexible working hours and compressed workweek options
How to Stand Out
- Tip: Showcase your experience with Kubernetes and Infrastructure as Code tools like Terraform or Terragrunt in your resume and cover letter.
- Tip: Be prepared to explain your approach to troubleshooting and resolving issues in large-scale systems during the interview process.
- Tip: Highlight your understanding of microservices architecture and your experience with API gateway and service mesh technologies.
- Tip: Emphasize your ability to collaborate with cross-functional teams and contribute to the development of best practices and standards.
- Tip: Don't be afraid to ask about the company culture, values, and expectations during the interview process to ensure you're a good fit for the role and the organization.
- Tip: Consider creating a personal project or contributing to open-source projects to demonstrate your skills and passion for site reliability engineering.
- Tip: Prepare to discuss your experience with CI/CD pipelines and GitOps workflows, as well as your understanding of observability and incident response readiness.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.