Senior Site Reliability Engineer, Wikimedia Enterprise

Software Development

AdjustExcel

WFA Digital Insight

The demand for skilled site reliability engineers has grown significantly in recent years, with a focus on building and maintaining scalable infrastructure. As remote work continues to rise, companies like Wikimedia are looking for experts who can ensure high availability and reliability of their services. With the Wikimedia Enterprise aiming to revolutionize content distribution, this role offers a unique chance to work on a high-impact project. Before applying, candidates should be aware of the need for strong collaboration and communication skills, as well as the ability to work with a distributed team.

Job Description

About the Role

The Senior Site Reliability Engineer role at Wikimedia is a key part of the team responsible for designing, developing, and maintaining the infrastructure for the company's API services. As a senior member of the team, you will be responsible for ensuring the reliability, scalability, and availability of these services. You will work closely with the engineering team to embed reliability best practices early in the development lifecycle and drive the adoption of new technologies. The Wikimedia Foundation is a distributed and diverse team of engineers with a drive to explore, experiment, and embrace new technologies. As a senior site reliability engineer, you will be part of a team that builds quickly, deploys often, and has a very high impact on the global knowledge ecosystem. You will have the opportunity to work on a wide range of projects, from designing and running infrastructure and services to participating in incident response and being on-call. The Wikimedia Enterprise is a new, revenue-generating product that provides fast, comprehensive, reliable, and secure data ingestion for organizations that wish to repurpose Wikimedia/Wikipedia content in third-party environments. As a senior site reliability engineer, you will play a key role in ensuring the reliability and availability of this service.

What You Will Do

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
Partner with engineering team members to embed reliability best practices early in the development lifecycle
Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD (or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
Continuously optimize infrastructure cost and efficiency using FinOps

What We Are Looking For

Experience with designing, building, and maintaining large-scale infrastructure and services
Strong understanding of reliability engineering principles and practices
Experience with cloud-based infrastructure (e.g., AWS, GCP, Azure) and containerization (e.g., Docker, Kubernetes)
Strong programming skills in languages such as Python, Java, or C++
Experience with CI/CD and GitOps workflows and tools such as GitLab, ArgoCD, or similar
Strong understanding of security best practices and experience with implementing secure-by-default infrastructure
Experience with incident response and being on-call
Strong communication and collaboration skills, with the ability to work with a distributed team

Nice to Have

Experience with Wikimedia's technology stack and open-source software development
Experience with chaos testing and resilience validation
Experience with FinOps and cost optimization
Experience with machine learning and artificial intelligence

Benefits and Perks

Competitive salary and benefits package
Remote work arrangement with flexible working hours
Opportunity to work on a high-impact project with a global reach
Collaborative and dynamic work environment with a distributed team
Professional development and growth opportunities
Access to the latest technologies and tools
Flexible paid time off and holidays
Health and wellness benefits, including mental health support
Remote stipend and home office setup support

How to Stand Out

Make sure to highlight your experience with reliability engineering principles and practices, as well as your ability to work with a distributed team.
Showcase your skills in designing, building, and maintaining large-scale infrastructure and services, and be prepared to provide examples of your work.
Be prepared to discuss your experience with cloud-based infrastructure, containerization, and CI/CD and GitOps workflows.
Emphasize your strong communication and collaboration skills, and highlight your ability to work with a diverse team.
Consider creating a portfolio of your work, including any open-source projects or contributions to showcase your skills and experience.
Don't be afraid to ask about the company culture and values, as well as the team's dynamics and communication style.
Be prepared to negotiate your salary and benefits package, and don't be afraid to ask about opportunities for professional development and growth.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.