Manager, Software Engineering (Resilience Engineering)

Affirm·Remote(Remote US)

Software Development

Excel

WFA Digital Insight

As the demand for reliable and resilient systems grows, companies like Affirm are seeking seasoned engineering managers to lead their resilience engineering teams. With the rise of e-commerce and digital payments, the need for experts who can ensure system safety and reliability has never been more pressing. In this role, you'll have the opportunity to define and drive the vision for resilience engineering, leveraging production load testing and chaos engineering to ensure the stability of Affirm's production systems.

Job Description

About the Role

The Manager of Software Engineering at Affirm is a critical role that leads the resilience engineering team. This team is responsible for ensuring the safety and reliability of Affirm's production systems through proactive validation techniques. The ideal candidate will have proven experience in leading engineering teams in reliability, infrastructure, or distributed systems, as well as hands-on experience with production load testing, chaos engineering, or large-scale system validation.

As a key member of the engineering team, you will work closely with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle. You will also establish best practices for safely testing system limits and failure scenarios in production.

What You Will Do

Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices
Lead and mentor a team of engineers building platforms and tooling for safe production experimentation
Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle
Establish best practices for safely testing system limits and failure scenarios in production
Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection
Ensure strong safeguards are in place, including isolation boundaries, approval workflows, and automated rollback mechanisms to protect real users
Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments
Drive reliability improvements by systematically identifying weaknesses through load testing and chaos experiments
Establish monitoring, alerting, and incident response practices tailored to proactive resilience validation

What We Are Looking For

Proven experience leading engineering teams in reliability, infrastructure, or distributed systems
Hands-on experience with production load testing, chaos engineering, or large-scale system validation
Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar
Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages
Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails)
Excellent leadership and mentoring skills, with the ability to motivate and guide a team of engineers
Strong communication and collaboration skills, with the ability to work effectively with cross-functional teams

Nice to Have

Experience with cloud-based infrastructure and containerization (e.g., AWS, Kubernetes)
Familiarity with agile development methodologies and version control systems (e.g., Git)
Knowledge of programming languages such as Java, Python, or C++

Benefits and Perks

Competitive salary and equity package
Comprehensive health, dental, and vision insurance
Flexible PTO policy and remote work options
Access to professional development and training opportunities
Collaborative and dynamic work environment

How to Stand Out

Be prepared to discuss your experience with production load testing and chaos engineering, and how you've applied these techniques in previous roles
Emphasize your ability to lead and mentor a team of engineers, and provide examples of successful team management
Highlight your understanding of failure modes in distributed systems, and your experience with building systems with strong safety guarantees
Make sure your resume and online profiles are up-to-date and highlight your relevant experience and skills
Prepare to back up your claims with specific examples and metrics, such as 'Improved system reliability by 30% through targeted load testing and chaos engineering'

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.