Manager II, Engineering - Site Reliability Engineering
WFA Digital Insight
The demand for skilled site reliability engineers has grown 25% in the past year, driven by the need for robust and secure systems in the AI era. With the global shift to remote work, companies like Datadog are looking for leaders who can drive reliability and security initiatives across distributed teams. As a Manager II, Engineering at Datadog, you will have the opportunity to shape the strategic vision for reliability and security, working with a talented team of engineers and contributing to the company's mission to democratize access to observability and security. With a strong focus on employee growth and development, Datadog offers a unique opportunity for career advancement and skills growth in a rapidly evolving field.
Job Description
About the Role
The Manager II, Engineering - Site Reliability Engineering role at Datadog is a unique opportunity to lead a team of talented engineers and drive the strategic vision for reliability and security. As a leader in the site reliability engineering organization, you will be responsible for shaping the direction of reliability and security initiatives, working closely with cross-functional teams to drive organization-wide outcomes. With a focus on innovation and continuous improvement, you will have the opportunity to contribute to the development of new technologies and approaches that will help Datadog stay ahead of the curve in the rapidly evolving field of observability and security.The site reliability engineering team at Datadog is responsible for ensuring the reliability and security of the company's systems and infrastructure. This includes developing and implementing strategies for resilience validation, chaos engineering, and penetration testing, as well as working closely with other teams to drive a culture of reliability and security across the organization. As a Manager II, Engineering, you will be responsible for leading a team of engineers and contributing to the development of the company's overall strategy for reliability and security.
What You Will Do
- Lead and mentor engineering managers, fostering their growth and development and building strong, high-performing teams across the organization
- Contribute to and advance the vision for reliability at Datadog, including adapting practices in response to the evolving impact of AI on software development
- Guide teams in defining and executing roadmaps that balance immediate impact with sustainable, long-term improvements
- Build strong cross-functional partnerships across engineering, security, and product teams to drive organization-wide reliability outcomes
- Champion a solutions/outcome-oriented approach by driving risk mitigation efforts and leading initiatives that result in measurable reliability improvements
- Collaborate with other leaders to drive a culture of reliability and security across the organization
- Develop and implement strategies for resilience validation, chaos engineering, and penetration testing
- Work closely with other teams to drive a culture of reliability and security across the organization
- Stay up-to-date with industry trends and developments in site reliability engineering and contribute to the company's thought leadership in this area
What We Are Looking For
- Experience managing managers in dynamic, fast-evolving environments and a passion for developing leadership at multiple levels
- Deep expertise in site reliability engineering and experience working with large-scale distributed systems
- Ability to define and communicate a clear vision, aligning diverse stakeholders across a complex organization
- Exposure to security practices such as penetration testing or adversarial validation
- Ability to drive meaningful change and adjust approach based on context and organizational needs
- Strong leadership and management skills, with experience building and leading high-performing teams
- Excellent communication and collaboration skills, with ability to work effectively with cross-functional teams
- Strong technical skills, with experience working with a range of technologies and systems
Nice to Have
- Experience working in a cloud-based environment and familiarity with cloud-based technologies
- Knowledge of machine learning and artificial intelligence and their applications in site reliability engineering
- Experience working in a rapid-growth company and familiarity with the challenges and opportunities of scaling quickly
- Familiarity with DevOps practices and tools, such as continuous integration and continuous deployment
- Experience working in a remote or distributed team environment and familiarity with the challenges and opportunities of remote work
Benefits and Perks
- Generous and competitive benefits package, including health insurance, retirement savings, and paid time off
- New hire stock equity (RSUs) and employee stock purchase plan, providing opportunities for long-term financial growth and success
- Continuous career development and pathing opportunities, including training and education programs and mentorship and coaching
- Employee-focused best-in-class onboarding, providing a smooth transition into the company and role
- Internal mentor and cross-departmental buddy program, providing opportunities for networking and career growth
- Friendly and inclusive workplace culture, with a focus on diversity, equity, and inclusion
- Flexible and remote work arrangements, including the ability to work from home or remotely and flexible hours and schedules
- Access to cutting-edge technologies and tools, including the latest software and hardware
- Opportunities for professional growth and development, including conferences, training, and education programs
- Collaborative and dynamic work environment, with a focus on teamwork, communication, and innovation
How to Stand Out
- Develop a strong understanding of site reliability engineering principles and practices, including resilience validation, chaos engineering, and penetration testing.
- Build a portfolio of your work, including examples of your experience with large-scale distributed systems and your approach to driving reliability and security initiatives.
- Prepare to talk about your experience working with cross-functional teams and driving organization-wide outcomes, including examples of your ability to collaborate and communicate effectively.
- Research the company and role, including the company's mission, values, and culture, and be prepared to talk about why you are a good fit for the role and company.
- Develop a strong understanding of the company's products and services, including the Datadog platform and its applications in observability and security.
- Be prepared to talk about your approach to leadership and management, including your experience building and leading high-performing teams and your approach to developing leadership at multiple levels.
- Highlight your ability to drive meaningful change and adjust your approach based on context and organizational needs, including examples of your experience working in dynamic, fast-evolving environments.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.