Senior Site Reliability Engineer

Transcend·Remote(United States)

Software Development

AdjustExcel

WFA Digital Insight

The demand for skilled Site Reliability Engineers has seen a significant surge, with the industry anticipating a 25% growth in the next two years. As companies like Transcend continue to expand their digital presence, the need for professionals who can ensure the reliability, scalability, and performance of their infrastructure has never been more pressing. With the rise of remote work, the ability to manage and improve system reliability from any location has become a highly sought-after skill. Transcend, known for its innovative approach to privacy infrastructure, is now looking for a Senior Site Reliability Engineer to join its team, offering a unique opportunity for candidates to make a significant impact in the field. Before applying, candidates should be aware of the importance of staying updated with the latest cloud infrastructure technologies and the value of effective communication in a remote work setting.

Job Description

About the Role

The Senior Site Reliability Engineer position at Transcend is a critical role that focuses on ensuring the reliability, scalability, and performance of the company's privacy infrastructure. As a seasoned technical leader, the successful candidate will partner closely with various teams including Product Engineering, Security, and Developer Experience to design, operate, and continuously improve the systems that Transcend's customers depend on daily. This role is remote, full-time, and based in the United States, requiring the candidate to have valid work authorization, as visa sponsorship is not available. The role reports directly to the Director of Information Systems and Head of Security.

The role entails leading reliability-focused design and readiness reviews for new and existing services, ensuring production readiness, and developing clear rollout and rollback strategies. It also involves building, operating, and continuously improving the observability stack to provide meaningful dashboards, alerts, and runbooks that enable fast, high-quality incident response across engineering teams.

As part of the team, the Senior Site Reliability Engineer will help define SRE practices, lead cross-team reliability initiatives, and turn incidents and risk analyses into durable improvements that keep the platform resilient as it grows. This involves collaborating closely with Developer Experience, Security, and product engineering teams to embed reliability best practices into shared tools and CI/CD pipelines.

What You Will Do

Lead reliability-focused design and readiness reviews for new and existing services, ensuring production readiness and clear rollout and rollback strategies.
Build, operate, and continuously improve the observability stack to provide meaningful dashboards, alerts, and runbooks.
Own and evolve incident management practices, including on-call participation, incident response processes, and post-incident reviews.
Plan and execute disaster recovery exercises and game days to validate resilience posture and test failover and backup strategies.
Perform capacity planning and cost optimization for cloud infrastructure, ensuring a cost-effective environment that meets performance and availability goals.
Identify and drive down systemic reliability risks across application, infrastructure, and process layers, owning cross-team projects to reduce incident frequency and severity over time.
Collaborate closely with Developer Experience, Security, and product engineering to embed reliability best practices into shared tools and CI/CD pipelines.
Participate in and help continuously improve the on-call rotation, using real incidents and near-misses to prioritize automation, better alerting, and clearer documentation.

What We Are Looking For

Required 5+ years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or a closely related role, including hands-on ownership of production systems.
Strong experience operating modern cloud infrastructure, ideally on AWS, including core services for compute, networking, storage, and security primitives.
Proficiency with at least one programming language used at Transcend (e.g., JavaScript, Typescript, or Python), and comfort reading and reviewing application code for reliability and performance concerns.
Hands-on experience with infrastructure-as-code and CI/CD tooling (e.g., Terraform, CloudFormation, or similar; modern build/deploy pipelines) to reliably provision and change infrastructure.
Deep familiarity with observability and monitoring systems (e.g., Datadog or equivalent), including designing alerts that balance coverage and noise to avoid alert fatigue while protecting customer experience.
Proven track record running incident response and post-incident analysis, including root cause identification, clear documentation, and driving follow-through on remediation work.
Excellent communication and collaboration skills, with experience working across multiple engineering teams to align on reliability goals, share context, and influence technical direction without formal authority.

Nice to Have

Experience with Adjust and Excel, as these are key skills for the role.
Knowledge of cloud security best practices and compliance frameworks.
Experience with containerization (e.g., Docker) and orchestration (e.g., Kubernetes).

Benefits and Perks

Competitive compensation package.
Opportunities for professional growth and development in a rapidly expanding company.
Collaborative and dynamic work environment with a team of experienced professionals.
Flexible remote work arrangement, allowing for a better work-life balance.
Access to the latest technologies and tools to support your work.

How to Stand Out

Ensure you have a strong foundation incloud infrastructure, particularly AWS, and experience with infrastructure-as-code tools like Terraform.
Develop a portfolio that showcases your ability to lead reliability-focused design and operations, including examples of incident management and post-incident analysis.
Highlight your proficiency in programming languages such as JavaScript, Typescript, or Python, and your experience with observability and monitoring systems like Datadog.
Prepare to discuss your approach to capacity planning and cost optimization, and how you stay updated with the latest in cloud technology.
Be ready to share your experience with on-call rotations and how you contribute to improving incident response processes and documentation.
Consider emphasizing your excellent communication and collaboration skills, and how you influence technical direction across teams without formal authority.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.