Manager, Software Engineering (Resilience Engineering)

AffirmAffirm·Remote(Remote Canada)
Software Development
Excel

WFA Digital Insight

As demand for skilled engineers in reliability and infrastructure grows, companies like Affirm are seeking leaders to ensure the safety and reliability of their systems. With the remote job market expanding, roles that combine technical expertise with strategic vision are highly sought after. Affirm's commitment to reinventing credit and its focus on honest and friendly consumer experiences make it an attractive option for those looking to make a meaningful impact. Candidates should be prepared to leverage their experience in production load testing, chaos engineering, and system validation to drive resilience and reliability improvements.

Job Description

About the Role

The Manager, Software Engineering (Resilience Engineering) at Affirm is a critical role focused on ensuring the safety and reliability of production systems through proactive validation techniques. This includes production load testing and chaos engineering, which are essential for discovering and mitigating issues before they impact real users. As the leader of the Resilience Engineering team, you will be responsible for defining and driving the vision for resilience engineering at Affirm, with a focus on making these practices first-class engineering disciplines.

The role involves partnering with various stakeholders, including infrastructure, product, and security leadership, to embed resilience validation into the software development lifecycle. This requires establishing best practices for safely testing system limits and failure scenarios in production, which is crucial for building trust with consumers and maintaining the integrity of Affirm's services.

What You Will Do

  • Define and drive the vision for resilience engineering at Affirm, focusing on production load testing and chaos engineering.
  • Lead and mentor a team of engineers building platforms and tooling for safe production experimentation.
  • Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle.
  • Establish best practices for safely testing system limits and failure scenarios in production.
  • Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection.
  • Ensure strong safeguards are in place, including isolation boundaries, approval workflows, and automated rollback mechanisms.
  • Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments.
  • Drive reliability improvements by systematically identifying weaknesses through load testing and chaos experiments.
  • Establish monitoring, alerting, and incident response practices tailored to proactive resilience validation.

What We Are Looking For

  • Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
  • Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
  • Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar.
  • Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages.
  • Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability).
  • Familiarity with cloud-native technologies and Excel for data analysis.

Nice to Have

  • Experience with agile development methodologies and version control systems like Git.
  • Knowledge of containerization using Docker and orchestration with Kubernetes.
  • Familiarity with CI/CD pipelines and automation tools.
  • Certifications in relevant engineering or management disciplines.

Benefits and Perks

  • Competitive salary package.
  • Comprehensive health insurance.
  • Remote work stipend.
  • Paid time off and holidays.
  • Professional development opportunities.
  • Access to the latest technologies and tools.
  • Collaborative and dynamic work environment.

How to Stand Out

  • Tip: Highlight your experience with chaos engineering vendors and production load testing tools in your resume and cover letter.
  • Ensure your portfolio includes examples of designing and implementing resilience engineering practices in previous roles.
  • Be prepared to discuss your approach to establishing best practices for safe production experimentation during the interview.
  • Familiarize yourself with Affirm's products and mission to demonstrate your passion for the company's vision.
  • Consider taking courses or certifications in resilience engineering and chaos engineering to enhance your skills and stand out as a candidate.
  • Prepare to negotiate your salary based on your experience and the market average for similar positions.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.