Manager Site Reliability Operations
WFA Digital Insight
As the demand for reliable digital services continues to grow, companies like Mercury Insurance are looking for skilled professionals to lead their Site Reliability Operations teams. With the rise of remote work, the need for experts who can ensure seamless operation of digital platforms has never been more pressing. In fact, the market for site reliability engineering is expected to grow significantly in the next few years, with a projected increase in demand for skilled professionals. To succeed in this role, candidates will need to have a strong background in computer science, engineering, or a related field, as well as experience with observability tools, incident response, and leadership. Before applying, candidates should be prepared to showcase their technical skills, as well as their ability to lead and collaborate with cross-functional teams.
Job Description
About the Role
The Manager of Site Reliability Operations at Mercury Insurance is a critical role that leads the team responsible for end-to-end observability, real-time monitoring, and operational response across the company's production and non-production platforms. The successful candidate will have a strong technical background, excellent leadership skills, and the ability to collaborate with cross-functional teams. The role is part of the Technology Operations team and will report to a senior leader in the organization.Day-to-day, the Manager of Site Reliability Operations will be responsible for ensuring that services are well-instrumented, that alerts are actionable and tuned, and that root cause analysis and corrective actions are consistently executed. The role will also involve partnering with application development, DevOps, and infrastructure teams to build release and runtime practices that are observable by design, provide real-time operational support during deployments, and use data-driven insights and automation to continuously improve system resilience, change success rates, and time to recovery.
The ideal candidate will have a strong background in computer science, engineering, or a related field, as well as experience with observability tools, incident response, and leadership. They will also have excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams.
What You Will Do
- Lead the Site Reliability Operations team, including the Network Operations Center (NOC), to ensure observability, real-time monitoring, and operational excellence for key enterprise services
- Partner with Product Management, Engineering, and other teams to embed CI/CD and release best practices into operations
- Oversee service reliability monitoring and incident management, ensuring appropriate observability, well-tuned alerting thresholds, escalation paths, and effective communications to stakeholders and leadership during incidents
- Drive root cause analysis of recurring or high-severity incidents, standardize post-incident reviews, and ensure corrective actions and follow-ups are implemented and verified
- Define, track, and report operational and reliability metrics, providing regular insights and recommendations to Technology Operations leadership
- Champion automation and "operations as code" (infrastructure as code, configuration as code, automated runbooks), working with engineering teams to reduce manual toil and improve consistency, speed, and safety of operations and releases
- Recruit, develop, coach, and evaluate team members, providing performance feedback, making salary and promotion recommendations, and fostering a high-performing, collaborative culture
- Provide leadership coverage for 7x24 mission-critical support through the NOC and on-call rotations, ensuring sustainable on-call practices, high-quality runbooks, and continuous improvement of tooling and processes
What We Are Looking For
- Bachelor's degree in computer science, Information Systems, Engineering, or a related field
- Minimum 5 years of experience in a related field, with at least 2 years of experience in a leadership role
- Strong technical background, with experience with observability tools, incident response, and automation
- Excellent leadership and collaboration skills, with the ability to work effectively with cross-functional teams
- Strong communication and problem-solving skills, with the ability to analyze complex problems and develop creative solutions
- Experience with CI/CD and release best practices, as well as experience with automation and "operations as code"
- Strong understanding of IT service management principles and practices, including ITIL
Nice to Have
- Experience with cloud-based services, such as AWS or Azure
- Experience with containerization, such as Docker
- Experience with automation tools, such as Ansible or Puppet
- Certification in ITIL or a related field
Benefits and Perks
- Competitive salary and benefits package
- Opportunity to work with a leading insurance company
- Collaborative and dynamic work environment
- Professional development and growth opportunities
- Flexible working hours and remote work options
- Access to cutting-edge technology and tools
- Recognition and rewards for outstanding performance
How to Stand Out
- Make sure to highlight your experience with observability tools and incident response in your resume and cover letter.
- Be prepared to provide specific examples of times when you had to analyze complex problems and develop creative solutions.
- Show your passion for automation and "operations as code" and explain how you have implemented these principles in your previous roles.
- Emphasize your ability to collaborate with cross-functional teams and communicate effectively with stakeholders.
- Be ready to discuss your experience with CI/CD and release best practices, as well as your understanding of IT service management principles and practices.
- Research the company culture and values and be prepared to explain why you are a good fit for the organization.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.