Senior Site Reliability Engineer, Wikimedia Enterprise

WikimediaWikimedia·Remote
Software Development
AdjustExcel

WFA Digital Insight

The demand for skilled site reliability engineers has grown significantly in recent years, with a focus on building and maintaining scalable infrastructure. As remote work continues to rise, companies like Wikimedia are looking for experts who can ensure high availability and reliability of their services. With the Wikimedia Enterprise aiming to revolutionize content distribution, this role offers a unique chance to work on a high-impact project. Before applying, candidates should be aware of the need for strong collaboration and communication skills, as well as the ability to work with a distributed team.

Job Description

About the Role

The Senior Site Reliability Engineer role at Wikimedia is a key part of the team responsible for designing, developing, and maintaining the infrastructure for the company's API services. As a senior member of the team, you will be responsible for ensuring the reliability, scalability, and availability of these services. You will work closely with the engineering team to embed reliability best practices early in the development lifecycle and drive the adoption of new technologies. The Wikimedia Foundation is a distributed and diverse team of engineers with a drive to explore, experiment, and embrace new technologies. As a senior site reliability engineer, you will be part of a team that builds quickly, deploys often, and has a very high impact on the global knowledge ecosystem. You will have the opportunity to work on a wide range of projects, from designing and running infrastructure and services to participating in incident response and being on-call. The Wikimedia Enterprise is a new, revenue-generating product that provides fast, comprehensive, reliable, and secure data ingestion for organizations that wish to repurpose Wikimedia/Wikipedia content in third-party environments. As a senior site reliability engineer, you will play a key role in ensuring the reliability and availability of this service.

What You Will Do

  • Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
  • Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
  • Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
  • Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
  • Partner with engineering team members to embed reliability best practices early in the development lifecycle
  • Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD (or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
  • Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
  • Continuously optimize infrastructure cost and efficiency using FinOps

What We Are Looking For

  • Experience with designing, building, and maintaining large-scale infrastructure and services
  • Strong understanding of reliability engineering principles and practices
  • Experience with cloud-based infrastructure (e.g., AWS, GCP, Azure) and containerization (e.g., Docker, Kubernetes)
  • Strong programming skills in languages such as Python, Java, or C++
  • Experience with CI/CD and GitOps workflows and tools such as GitLab, ArgoCD, or similar
  • Strong understanding of security best practices and experience with implementing secure-by-default infrastructure
  • Experience with incident response and being on-call
  • Strong communication and collaboration skills, with the ability to work with a distributed team

Nice to Have

  • Experience with Wikimedia's technology stack and open-source software development
  • Experience with chaos testing and resilience validation
  • Experience with FinOps and cost optimization
  • Experience with machine learning and artificial intelligence

Benefits and Perks

  • Competitive salary and benefits package
  • Remote work arrangement with flexible working hours
  • Opportunity to work on a high-impact project with a global reach
  • Collaborative and dynamic work environment with a distributed team
  • Professional development and growth opportunities
  • Access to the latest technologies and tools
  • Flexible paid time off and holidays
  • Health and wellness benefits, including mental health support
  • Remote stipend and home office setup support

How to Stand Out

  • Make sure to highlight your experience with reliability engineering principles and practices, as well as your ability to work with a distributed team.
  • Showcase your skills in designing, building, and maintaining large-scale infrastructure and services, and be prepared to provide examples of your work.
  • Be prepared to discuss your experience with cloud-based infrastructure, containerization, and CI/CD and GitOps workflows.
  • Emphasize your strong communication and collaboration skills, and highlight your ability to work with a diverse team.
  • Consider creating a portfolio of your work, including any open-source projects or contributions to showcase your skills and experience.
  • Don't be afraid to ask about the company culture and values, as well as the team's dynamics and communication style.
  • Be prepared to negotiate your salary and benefits package, and don't be afraid to ask about opportunities for professional development and growth.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.