Sr. Site Reliability Engineer - SRE

QAD, Inc.·Remote(Spain)

Software Development

Excel

WFA Digital Insight

As demand for reliable digital services surges, companies like QAD, Inc. are seeking skilled Senior Site Reliability Engineers to ensure seamless user experiences. With a 25% increase in SRE job postings in the past year, this role is in high demand. QAD, Inc. stands out for its commitment to innovation and customer satisfaction. Before applying, candidates should be prepared to showcase their expertise in automation, problem-solving, and data-driven decision-making.

Job Description

About the Role

The Senior Site Reliability Engineer will play a critical role in ensuring the reliability, scalability, and performance of QAD, Inc.'s mission-critical services. As a member of the growing SRE function, you will drive operational excellence, shape SRE practices, and significantly impact the product's operational excellence. You will work closely with cross-functional teams to identify and eliminate toil through automation, process improvements, and systematic problem-solving.

The ideal candidate will have a strong background in operating and improving production systems at scale, with experience in defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to guide reliability decisions. You will be responsible for driving automation, driving data-driven decisions, and fostering an SRE culture of shared ownership and continuous learning.

What You Will Do

Drive operational excellence by designing, implementing, and maintaining highly available, scalable, and resilient systems
Be a Datadog expert, defining, implementing, and enforcing best practices for monitoring, alerting, logging, tracing, and synthetic testing
Develop robust, well-tested, and maintainable software and tooling to automate operational tasks
Identify and eliminate toil through automation, process improvements, and systematic problem-solving
Contribute to and evolve the incident response framework, participating in on-call rotations
Lead blameless post-mortems, extracting actionable insights and driving systemic improvements
Collaborate with engineering teams to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets
Leverage and contribute to infrastructure as code (IaC) efforts, moving towards a fully automated environment
Provide SRE expertise in system design reviews, influencing architectural decisions to build reliability, observability, and scalability
Document processes, build runbooks, and share expertise with both the SRE team and broader engineering organization

What We Are Looking For

Demonstrated experience operating and improving production systems at scale in an SRE, Production Engineering, or Platform Engineering role
Proven ability to rapidly build accurate mental models of complex distributed systems
Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis
Experience defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets
Excellent written and verbal communication skills, with the ability to explain complex technical issues clearly
Experience across several technical domains, including Kubernetes platforms, cloud infrastructure, identity and access management systems, and networking fundamentals
Strong understanding of Excel and its applications

Nice to Have

Experience with Datadog, Terraform, and GitHub Actions
Knowledge of system design and architecture principles
Experience with infrastructure as code (IaC) efforts

Benefits and Perks

Competitive compensation package
Opportunities for professional growth and development
Collaborative and dynamic work environment
Flexible working hours and remote work options
Access to cutting-edge technologies and tools
Comprehensive health insurance and benefits package

How to Stand Out

Focus on showcasing your expertise in automation, problem-solving, and data-driven decision-making in your application and interview.
Be prepared to provide specific examples of your experience with Datadog, Terraform, and GitHub Actions.
Highlight your ability to communicate complex technical issues clearly to both technical and non-technical audiences.
Emphasize your experience with system design and architecture principles, as well as your understanding of infrastructure as code (IaC) efforts.
Research QAD, Inc.'s company culture and values to demonstrate your enthusiasm for the role and the company.
Prepare to discuss your experience with Excel and its applications in the context of the role.
Be ready to provide examples of how you have driven operational excellence and improved system reliability in previous roles.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.