Senior Site Reliability Engineer

WikimediaWikimedia·Remote
Software Development
Adjust

WFA Digital Insight

In the current remote job market, demand for skilled site reliability engineers has surged, driven by the need for seamless digital experiences. With the rise of distributed systems and cloud infrastructure, companies like Wikimedia are looking for experts who can ensure high availability and scalability. As a senior SRE at Wikimedia, you'll be part of a globally distributed team working on one of the world's most visited websites. Before applying, consider your experience with automation, distributed caching systems, and open-source software, as well as your ability to thrive in a remote-first environment.

Job Description

About the Role

The Senior Site Reliability Engineer role at Wikimedia is a unique opportunity to join a team of passionate engineers dedicated to ensuring the reliability and scalability of one of the world's most beloved websites. As a senior member of the team, you will be responsible for the day-to-day operations of Wikimedia's public-facing infrastructure, working closely with product teams to bring new features and services to users. The team is globally distributed, and you will be working in an asynchronous communication environment, collaborating with colleagues across multiple time zones.

Wikimedia's SRE team is committed to working in the open, publishing all documentation, code, and configuration as open source. This approach not only reflects the company's values but also ensures that the team is always learning and improving. As a senior SRE, you will be expected to mentor peers, share knowledge, and contribute to the growth and development of the team.

The role entails a mix of operational, technical, and collaborative work. You will be performing day-to-day DevOps tasks, implementing and utilizing configuration management tools, and leading continuous improvement initiatives. You will also be working closely with product teams to design and deploy new services, ensuring that they are scalable and reliable.

What You Will Do

  • Perform day-to-day operational and DevOps tasks on Wikimedia's public-facing infrastructure, including deployment, maintenance, configuration, and troubleshooting.
  • Implement and utilize configuration management and deployment tools, such as Puppet and Kubernetes.
  • Lead continuous improvement initiatives, automating the installation, configuration, and maintenance of services on the platform.
  • Work closely with product teams to design and deploy new services, ensuring they are scalable and reliable.
  • Participate in a 24/7 on-call rotation, responding to incidents, diagnosing issues, and following up on system outages or alerts.
  • Collaborate with a global, cross-functional team in an asynchronous communication environment.
  • Mentor peers in areas of technical and operational strength.
  • Develop and maintain documentation, ensuring that knowledge is shared across the team.
  • Stay up-to-date with the latest technologies and trends, applying this knowledge to improve the reliability and scalability of the platform.

What We Are Looking For

  • 6+ years of experience in an SRE, Operations, or DevOps role, preferably in a team environment.
  • Experience with shell and scripting languages, such as Python, Go, Bash, or Ruby.
  • Familiarity with configuration management tools, such as Puppet or Ansible.
  • Strong Linux system-level troubleshooting skills.
  • Experience with distributed caching systems, including their underlying algorithms and performance optimization.
  • History of automating tasks and processes, identifying gaps, and finding opportunities for improvement.
  • Strong English language skills, both verbal and written, with the ability to work independently and as part of a global team.
  • Experience with package management on Linux systems, preferably Debian.
  • Ability to travel 1-2 times a year for in-person events and team meetings.
  • Alignment with Wikimedia's values and a commitment to working in accordance with them.

Nice to Have

  • Experience with cloud infrastructure, such as AWS or GCP.
  • Familiarity with containerization technologies, such as Docker.
  • Knowledge of security best practices and experience with security tools and technologies.
  • Experience with monitoring and logging tools, such as Prometheus and Grafana.

Benefits and Perks

  • Remote work arrangement, with the flexibility to work from anywhere.
  • Competitive compensation package, reflecting your skills and experience.
  • Opportunities for professional growth and development, including training and mentorship.
  • Collaboration with a talented and dedicated team of engineers.
  • The chance to work on a high-impact project, contributing to the mission of making knowledge accessible to everyone.
  • Access to the latest technologies and tools, ensuring you stay up-to-date with industry trends.
  • A healthy work-life balance, with flexible working hours and generous PTO.

How to Stand Out

  • Highlight your automation skills: Showcase your experience with automation tools and scripts, and explain how you've applied them to improve efficiency in previous roles.
  • Be ready to talk about your experience with distributed systems: Make sure you can discuss your experience with distributed caching systems, including how you've optimized their performance and handled scalability challenges.
  • Emphasize your collaboration and communication skills: As a remote worker, you'll need to be able to work effectively with a global team, so highlight your experience with asynchronous communication and collaboration tools.
  • Prepare to discuss your approach to incident response: Be ready to walk through your process for responding to incidents, including how you diagnose issues, communicate with the team, and implement preventive measures.
  • Showcase your knowledge of open-source technologies: Wikimedia is committed to working in the open, so demonstrate your familiarity with open-source software and your willingness to contribute to the community.
  • Be prepared to discuss your long-term career goals: Wikimedia is looking for candidates who are committed to the company's mission and values, so be ready to discuss your long-term career aspirations and how they align with the company's goals.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.