Sr. Site Reliability Engineer - SRE
WFA Digital Insight
As demand for reliable digital services surges, companies like QAD, Inc. are seeking skilled Senior Site Reliability Engineers to ensure seamless user experiences. With a 25% increase in SRE job postings in the past year, this role is in high demand. QAD, Inc. stands out for its commitment to innovation and customer satisfaction. Before applying, candidates should be prepared to showcase their expertise in automation, problem-solving, and data-driven decision-making.
Job Description
About the Role
The Senior Site Reliability Engineer will play a critical role in ensuring the reliability, scalability, and performance of QAD, Inc.'s mission-critical services. As a member of the growing SRE function, you will drive operational excellence, shape SRE practices, and significantly impact the product's operational excellence. You will work closely with cross-functional teams to identify and eliminate toil through automation, process improvements, and systematic problem-solving.The ideal candidate will have a strong background in operating and improving production systems at scale, with experience in defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to guide reliability decisions. You will be responsible for driving automation, driving data-driven decisions, and fostering an SRE culture of shared ownership and continuous learning.
What You Will Do
- Drive operational excellence by designing, implementing, and maintaining highly available, scalable, and resilient systems
- Be a Datadog expert, defining, implementing, and enforcing best practices for monitoring, alerting, logging, tracing, and synthetic testing
- Develop robust, well-tested, and maintainable software and tooling to automate operational tasks
- Identify and eliminate toil through automation, process improvements, and systematic problem-solving
- Contribute to and evolve the incident response framework, participating in on-call rotations
- Lead blameless post-mortems, extracting actionable insights and driving systemic improvements
- Collaborate with engineering teams to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets
- Leverage and contribute to infrastructure as code (IaC) efforts, moving towards a fully automated environment
- Provide SRE expertise in system design reviews, influencing architectural decisions to build reliability, observability, and scalability
- Document processes, build runbooks, and share expertise with both the SRE team and broader engineering organization
What We Are Looking For
- Demonstrated experience operating and improving production systems at scale in an SRE, Production Engineering, or Platform Engineering role
- Proven ability to rapidly build accurate mental models of complex distributed systems
- Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis
- Experience defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets
- Excellent written and verbal communication skills, with the ability to explain complex technical issues clearly
- Experience across several technical domains, including Kubernetes platforms, cloud infrastructure, identity and access management systems, and networking fundamentals
- Strong understanding of Excel and its applications
Nice to Have
- Experience with Datadog, Terraform, and GitHub Actions
- Knowledge of system design and architecture principles
- Experience with infrastructure as code (IaC) efforts
Benefits and Perks
- Competitive compensation package
- Opportunities for professional growth and development
- Collaborative and dynamic work environment
- Flexible working hours and remote work options
- Access to cutting-edge technologies and tools
- Comprehensive health insurance and benefits package
How to Stand Out
- Focus on showcasing your expertise in automation, problem-solving, and data-driven decision-making in your application and interview.
- Be prepared to provide specific examples of your experience with Datadog, Terraform, and GitHub Actions.
- Highlight your ability to communicate complex technical issues clearly to both technical and non-technical audiences.
- Emphasize your experience with system design and architecture principles, as well as your understanding of infrastructure as code (IaC) efforts.
- Research QAD, Inc.'s company culture and values to demonstrate your enthusiasm for the role and the company.
- Prepare to discuss your experience with Excel and its applications in the context of the role.
- Be ready to provide examples of how you have driven operational excellence and improved system reliability in previous roles.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.