Senior Site Reliability Engineer, Kong Konnect

Kong·Remote(Canada)

Software Development

WFA Digital Insight

The demand for skilled Site Reliability Engineers has surged, with a 27% increase in job postings over the last year. As companies like Kong continue to expand their remote workforces, the need for experts who can ensure the reliability and scalability of SaaS platforms has never been greater. With the rise of cloud computing and digital transformation, candidates with experience in Kubernetes, Terraform, and CI/CD pipelines are in high demand. Kong, a leader in API and AI connectivity, offers a unique opportunity for engineers to work on a global SaaS platform. Before applying, candidates should be prepared to showcase their experience in building and operating large-scale systems, as well as their ability to collaborate with cross-functional teams.

Job Description

About the Role

As a Senior Site Reliability Engineer at Kong, you will be responsible for building, operating, and scaling the company's multi-region SaaS platform, Kong Konnect. This platform powers the world's API connectivity, serving thousands of customers across AWS, GCP, and Azure. You will be part of the global Platform SRE team, working closely with development and security teams to ensure the smooth operation of SaaS services. Your primary focus will be on designing, automating, and running production systems, ensuring reliability, scalability, and security.

The role of a Site Reliability Engineer is critical to the success of Kong's SaaS offerings. You will be working on complex systems, including multi-region Kubernetes clusters, service mesh, and gateway architectures. Your expertise in troubleshooting and resolving issues will be essential in maintaining the high availability and performance of the platform. As a senior engineer, you will also be responsible for mentoring junior team members and contributing to the development of best practices and standards.

Kong's SaaS platform is built on a microservices architecture, with a focus on scalability, reliability, and security. As a Senior Site Reliability Engineer, you will be working on the design and implementation of new features, as well as the maintenance and improvement of existing systems. You will have the opportunity to work with a range of technologies, including Kubernetes, Terraform, and CI/CD pipelines.

What You Will Do

Operate and scale Kong's global SaaS platform, ensuring reliability, availability, and performance across regions and clouds
Build, automate, and maintain Kubernetes-based infrastructure and deployment workflows using Terraform/Terragrunt, Helm, and ArgoCD
Design, maintain, and optimize multi-region data and caching layers for high availability and low latency
Operate and improve Kong Gateway and Kong Mesh environments supporting hybrid and distributed architectures
Develop and maintain CI/CD pipelines and GitOps workflows to automate service delivery and ensure consistent infrastructure changes
Enhance observability and incident response readiness through systems like Datadog, Prometheus, Grafana, and Thanos
Collaborate closely with development and security teams to ensure smooth operation of SaaS services in compliance with reliability, security, and regulatory standards
Participate in a global 24/7 on-call rotation and drive continuous improvement of operational playbooks and postmortem practices
Lead and contribute to scaling initiatives that improve elasticity, reliability, and cost-efficiency across the SaaS platform

What We Are Looking For

BS in Computer Science or equivalent practical experience
Proven experience managing SaaS or PaaS systems at enterprise scale
Deep expertise in Kubernetes, including debugging cluster/networking issues and designing for fault tolerance and scalability
Strong proficiency with Infrastructure as Code tools like Terraform or Terragrunt
Experience with CI/CD pipelines and GitOps workflows
Proficiency in one or more programming languages for automation and tooling
Solid understanding of Linux/Unix systems, networking, load balancers, and distributed systems
Experience working with API gateway and service mesh technologies
Familiarity with streaming systems like Kafka and observability platforms

Nice to Have

Hands-on experience with Kong Gateway, Kong Mesh, or similar service connectivity technologies
Experience operating ClickHouse, Druid, or other time-series and analytics databases
Experience managing PostgreSQL and Redis in multi-region configurations
Working knowledge of AWS networking, Azure VNet, or GCP NCC
Strong understanding of disaster recovery, resiliency testing, and compliance-driven reliability practices

Benefits and Perks

Competitive salary and equity package
Comprehensive health, dental, and vision insurance
Flexible PTO and sick leave policy
Remote work stipend and home office setup support
Professional development opportunities, including conference sponsorships and training programs
Access to cutting-edge technologies and tools
Collaborative and dynamic work environment
Recognition and reward programs for outstanding performance
Flexible working hours and compressed workweek options

How to Stand Out

Tip: Showcase your experience with Kubernetes and Infrastructure as Code tools like Terraform or Terragrunt in your resume and cover letter.
Tip: Be prepared to explain your approach to troubleshooting and resolving issues in large-scale systems during the interview process.
Tip: Highlight your understanding of microservices architecture and your experience with API gateway and service mesh technologies.
Tip: Emphasize your ability to collaborate with cross-functional teams and contribute to the development of best practices and standards.
Tip: Don't be afraid to ask about the company culture, values, and expectations during the interview process to ensure you're a good fit for the role and the organization.
Tip: Consider creating a personal project or contributing to open-source projects to demonstrate your skills and passion for site reliability engineering.
Tip: Prepare to discuss your experience with CI/CD pipelines and GitOps workflows, as well as your understanding of observability and incident response readiness.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.