Senior Site Reliability Engineer - Hiring Sprint

Airbyte·Remote(San Francisco)

Software Development

WFA Digital Insight

The pandemic has accelerated digital transformation, with 75% of companies now relying on remote data integration. As demand for skilled tech professionals grows, Airbyte stands out as a pioneer in data movement and AI-powered solutions. With a strong focus on innovation and collaboration, this role offers a chance to work with a cutting-edge tech stack and contribute to the development of AI-driven tools. Before applying, candidates should be aware of the importance of hands-on experience with Kubernetes, Terraform, and observability stacks, as well as a willingness to learn and adapt to new technologies. The current remote job market favors candidates with expertise in cloud infrastructure, DevOps, and AI-powered automation, with a 25% increase in job postings for similar roles in the past quarter.

Job Description

About the Role

As a Senior Site Reliability Engineer at Airbyte, you will play a critical role in ensuring the reliability and efficiency of the company's data replication platform. This platform is a full-stack product team running over 3 million sync jobs a week, powering thousands of data use cases across multiple regions and clouds. You will be responsible for building and maintaining the infrastructure underpinning this platform, partnering with product engineers to integrate product features with infrastructure, and maintaining and enhancing observability, alerting, and anomaly detection.

The Data Replication team is a collaborative and innovative group, with a strong focus on using AI as a force multiplier to automate toil, augment incident response, and build smarter internal tooling. As a Senior Site Reliability Engineer, you will be expected to actively use AI tools to improve the reliability and efficiency of the platform.

The role reports to the Engineering Manager, and you will work closely with the product engineers, DevOps team, and other stakeholders to ensure the smooth operation of the platform.

What You Will Do

Own the infrastructure underpinning the Data Replication platform, including Kubernetes clusters, CI/CD pipelines, secrets management, networking, and cloud resource configuration across AWS and GCP.
Partner with product engineers to reliably integrate product features with infrastructure.
Maintain and enhance observability, alerting, and anomaly detection with an eye towards LLM automation.
Maintain and enhance AI-augmented release and internal tooling, including canary deployments, progressive rollouts, automated release qualification, and rollback automation.
Set the infrastructure bar for the team, building self-serve tooling, writing runbooks, and coaching engineers to own more of their stack.
Collaborate with the DevOps team to ensure the smooth operation of the platform.
Participate in on-call operations, including responding to incidents and resolving issues.
Continuously monitor and improve the performance and reliability of the platform.
Develop and maintain documentation for the platform, including infrastructure diagrams, technical guides, and troubleshooting guides.

What We Are Looking For

7+ years of experience in infrastructure, platform engineering, SRE, or DevOps.
Hands-on ownership of Kubernetes, Helm, and Terraform in production environments.
Deep experience with observability stacks, including Prometheus, Grafana, and Datadog.
Experience with CI/CD pipeline ownership and developer tooling.
Ability and willingness to read backend code to understand how systems break and instrument them correctly.
Fluency with AI tools, including LLMs and agentic frameworks to automate, debug faster, and reduce toil.
Strong understanding of cloud infrastructure, including AWS and GCP.
Experience with Agile development methodologies and version control systems, including Git.
Strong communication and collaboration skills, with ability to work effectively with cross-functional teams.

Nice to Have

Experience with containerization, including Docker and rkt.
Knowledge of security best practices, including compliance and risk management.
Experience with IT service management, including incident management and problem management.
Familiarity with DevSecOps practices and tools, including security testing and vulnerability management.

Benefits and Perks

Competitive salary and equity package.
Comprehensive health insurance, including medical, dental, and vision.
Flexible PTO policy, including vacation days, sick leave, and holidays.
Remote work stipend, including reimbursement for home office expenses and internet connectivity.
Professional development opportunities, including training, mentorship, and conference attendance.
Access to cutting-edge technology and tools, including AI-powered automation and DevOps platforms.
Collaborative and dynamic work environment, with a strong focus on innovation and teamwork.

How to Stand Out

Tip: Make sure to highlight your experience with Kubernetes, Terraform, and observability stacks in your resume and cover letter.
To stand out, showcase your ability to use AI tools to automate toil, augment incident response, and build smarter internal tooling.
Be prepared to discuss your experience with CI/CD pipelines, developer tooling, and cloud infrastructure during the interview process.
If you have experience with containerization, security best practices, or IT service management, be sure to highlight these skills in your application.
During the interview, ask questions about the company culture, team dynamics, and opportunities for growth and development.
Be prepared to provide examples of your experience with Agile development methodologies, version control systems, and collaboration tools.
Show enthusiasm for learning and adapting to new technologies, including AI-powered automation and DevOps platforms.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.