Senior Site Reliability Engineer, Infrastructure Foundations

Software Development

Adjust

WFA Digital Insight

As demand for reliable online platforms grows, so does the need for skilled site reliability engineers. With a 25% increase in remote IT jobs in 2025, Wikimedia's remote Senior Site Reliability Engineer role stands out. This position requires a unique blend of technical expertise and collaboration skills. Candidates should be prepared to work in a fast-paced, open-source environment and have a strong understanding of infrastructure security and technical response. With the rise of remote work, companies like Wikimedia are looking for talented individuals who can work independently and as part of a global team.

Job Description

About the Role

The Senior Site Reliability Engineer role at Wikimedia is a critical position that ensures the reliability and performance of the organization's infrastructure. As a member of the Site Reliability Engineering (SRE) team, you will be responsible for the day-to-day operations of Wikimedia's public-facing infrastructure, including deployment, maintenance, configuration, and troubleshooting. The SRE team is a globally distributed and diverse team of engineers who work in the open, publishing all documentation, code, and configuration as open source.

The SRE team at Wikimedia is responsible for ensuring that the organization's global top-10 website and its underlying infrastructure are healthy and developing further in support of Wikimedia's mission. The team works closely with product teams to bring scalable functionality to users, assisting in the architectural design of new services and making them operate at scale.

What You Will Do

Perform day-to-day operational/DevOps tasks on Wikimedia's public-facing infrastructure, including deployment, maintenance, configuration, and troubleshooting.
Implement and utilize configuration management and deployment tools, such as Puppet and Kubernetes.
Lead continuous improvement by automating the installation, configuration, and maintenance of services on the platform.
Work closely with product teams to help them bring scalable functionality to users by assisting in the architectural design of new services and making them operate at scale.
Participate in a 24/7 on-call rotation shared across the broader SRE team, including incident response, diagnosis, and follow-up on system outages or alerts across Wikimedia's production infrastructure.
Collaborate with a global, cross-functional team in an asynchronous communication environment.
Mentor peers in areas of technical and operational strength.
Ability and willingness to travel 1-2 times a year for in-person events and team meetings.
Participate in incident response and post-incident review rituals, conducting root cause analysis and implementing preventive measures.
Design and manage infrastructure security for large fleets of diverse services.

What We Are Looking For

6+ years of experience in an SRE/Operations/DevOps role as part of a team.
Experience with shell and any scripting languages used in an SRE context, such as Python, Go, Bash, or Ruby, and configuration management tools like Puppet or Ansible.
Experience designing and managing infrastructure security for large fleets of diverse services.
Experience with technical response during security incidents.
Experience with package management on Linux systems, such as Debian.
Strong Linux system-level troubleshooting skills.
History of automating tasks and processes, identifying process gaps, and finding automation opportunities.
Strong English language skills (verbal and written) and ability to work independently as an effective part of a globally distributed team working across multiple time zones.
Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures.

Nice to Have

Experience setting and implementing fleet-wide security policies.
Experience with cloud infrastructure, such as AWS or GCP.
Experience with containerization using Docker or Kubernetes.
Experience with monitoring and logging tools, such as Prometheus or Grafana.

Benefits and Perks

Opportunity to work on a global, top-10 website and contribute to the Wikimedia mission.
Collaborative, dynamic work environment with a globally distributed team.
Professional development opportunities, including training and conference attendance.
Flexible, remote work arrangement with a stipend for remote work expenses.
Comprehensive health insurance and retirement plan.
Paid time off and holidays.
Access to the latest technologies and tools,

How to Stand Out

Develop a strong understanding of infrastructure security and technical response to stand out in this role.
Familiarize yourself with configuration management tools like Puppet and Ansible, and scripting languages like Python or Go.
Highlight your experience with Linux system-level troubleshooting and automation of tasks and processes.
Be prepared to provide examples of your experience with incident response and post-incident review rituals.
Showcase your ability to work independently and collaboratively in a remote environment.
Prepare to discuss your experience with cloud infrastructure, containerization, and monitoring and logging tools.
Research Wikimedia's values and mission to demonstrate your alignment with the organization's goals.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.