Manager, Software Engineering (Resilience Engineering)

AffirmAffirm·Remote(Remote US)
Software Development
Excel

WFA Digital Insight

As the demand for reliable and resilient systems grows, companies like Affirm are seeking seasoned engineering managers to lead their resilience engineering teams. With the rise of e-commerce and digital payments, the need for experts who can ensure system safety and reliability has never been more pressing. In this role, you'll have the opportunity to define and drive the vision for resilience engineering, leveraging production load testing and chaos engineering to ensure the stability of Affirm's production systems.

Job Description

About the Role

The Manager of Software Engineering at Affirm is a critical role that leads the resilience engineering team. This team is responsible for ensuring the safety and reliability of Affirm's production systems through proactive validation techniques. The ideal candidate will have proven experience in leading engineering teams in reliability, infrastructure, or distributed systems, as well as hands-on experience with production load testing, chaos engineering, or large-scale system validation.

As a key member of the engineering team, you will work closely with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle. You will also establish best practices for safely testing system limits and failure scenarios in production.

What You Will Do

  • Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices
  • Lead and mentor a team of engineers building platforms and tooling for safe production experimentation
  • Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle
  • Establish best practices for safely testing system limits and failure scenarios in production
  • Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection
  • Ensure strong safeguards are in place, including isolation boundaries, approval workflows, and automated rollback mechanisms to protect real users
  • Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments
  • Drive reliability improvements by systematically identifying weaknesses through load testing and chaos experiments
  • Establish monitoring, alerting, and incident response practices tailored to proactive resilience validation

What We Are Looking For

  • Proven experience leading engineering teams in reliability, infrastructure, or distributed systems
  • Hands-on experience with production load testing, chaos engineering, or large-scale system validation
  • Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar
  • Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages
  • Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails)
  • Excellent leadership and mentoring skills, with the ability to motivate and guide a team of engineers
  • Strong communication and collaboration skills, with the ability to work effectively with cross-functional teams

Nice to Have

  • Experience with cloud-based infrastructure and containerization (e.g., AWS, Kubernetes)
  • Familiarity with agile development methodologies and version control systems (e.g., Git)
  • Knowledge of programming languages such as Java, Python, or C++

Benefits and Perks

  • Competitive salary and equity package
  • Comprehensive health, dental, and vision insurance
  • Flexible PTO policy and remote work options
  • Access to professional development and training opportunities
  • Collaborative and dynamic work environment

How to Stand Out

  • Be prepared to discuss your experience with production load testing and chaos engineering, and how you've applied these techniques in previous roles
  • Emphasize your ability to lead and mentor a team of engineers, and provide examples of successful team management
  • Highlight your understanding of failure modes in distributed systems, and your experience with building systems with strong safety guarantees
  • Make sure your resume and online profiles are up-to-date and highlight your relevant experience and skills
  • Prepare to back up your claims with specific examples and metrics, such as 'Improved system reliability by 30% through targeted load testing and chaos engineering'

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.