Staff Security Reliability Engineer
WFA Digital Insight
As demand for AI and machine learning specialists continues to soar, with a 25% increase in job postings over the past year, companies like Openai are looking for skilled professionals to ensure the reliability and security of their infrastructure. The role of a Staff Security Reliability Engineer is particularly crucial, requiring a deep understanding of security principles, infrastructure design, and scalable systems. With the rise of remote work, the need for robust and secure infrastructure has never been more pressing, making this role an exciting opportunity for those looking to make a real impact. Candidates should be prepared to showcase their technical expertise, as well as their ability to work collaboratively in a fast-paced environment.
Job Description
About the Role
The Staff Security Reliability Engineer plays a critical role in designing, building, and operating the reliable infrastructure that underpins Openai's internal services and R&D environments. This is an early, high-leverage technical role that requires a strong background in Site Reliability Engineering principles, as well as a deep understanding of security and scalability. The successful candidate will be responsible for establishing standardized infrastructure patterns, owning the lifecycle of critical infrastructure platforms, and building durable, production-grade platforms that remove operational friction and enable teams to move faster with confidence.As a key member of the Infrastructure Engineering function, the Staff Security Reliability Engineer will work closely with cross-functional partners across security, identity, network, and platform teams to design and implement secure and scalable infrastructure solutions. This will involve collaborating with identity engineering teams to build hardened, policy-enforced infrastructure, as well as partnering with platform teams to design and implement scalable and reliable systems.
The role is based in Openai's San Francisco HQ and requires in-office presence, although the company values flexibility and work-life balance.
What You Will Do
- Design, build, and operate reliable infrastructure across on-prem, hybrid, shared, and product-adjacent environments
- Establish standardized infrastructure patterns that replace bespoke implementations with repeatable, auditable, secure-by-default systems
- Own the lifecycle of critical infrastructure platforms, including provisioning, deployment, upgrades, patching, recovery, and long-term reliability
- Build infrastructure-as-code and configuration management using tools such as Terraform, Chef, and Ansible
- Mature identity-adjacent and policy-enforced infrastructure, including Microsoft Entra and Azure management patterns
- Build observability, alerting, and incident response mechanisms that improve availability, recoverability, and operational confidence
- Automate high-toil and high-risk workflows with guardrails, progressive rollout patterns, and safe rollback paths
- Translate incidents, design reviews, and operational learnings into durable fixes, reusable patterns, and stronger technical standards
What We Are Looking For
- 10+ years of hands-on experience operating and architecting mission-critical infrastructure in high-reliability environments
- Experience as the senior technical owner for the design and maturation of complex on-prem, hybrid, or cloud-integrated systems
- Strong background in Site Reliability Engineering principles, with a focus on security and scalability
- Experience with infrastructure-as-code and configuration management tools such as Terraform, Chef, and Ansible
- Strong understanding of identity and access management principles, including Microsoft Entra and Azure management patterns
- Experience with automation tools and scripting languages, such as Python and PowerShell
- Strong communication and collaboration skills, with the ability to work effectively with cross-functional partners
Nice to Have
- Experience operating infrastructure for R&D or specialized labs, manufacturing, or other safety-critical environments
- Experience with fleet, endpoint, or virtual desktop platforms such as FleetDM, Chef, or Azure Virtual Desktop
- Experience with cloud-based infrastructure, including AWS and Azure
- Strong understanding of security and compliance principles, including GDPR and HIPAA
Benefits and Perks
- Competitive salary and benefits package
- Opportunities for career growth and professional development
- Collaborative and dynamic work environment
- Flexible working hours and remote work options
- Access to cutting-edge technology and tools
- Comprehensive health and wellness programs
- Generous PTO and holiday allowance
- Professional development budget and conference attendance opportunities
How to Stand Out
- Ensure you have a strong understanding of Site Reliability Engineering principles and how they apply to security and scalability.
- Be prepared to showcase your experience with infrastructure-as-code and configuration management tools, such as Terraform and Ansible.
- Highlight your ability to work collaboratively with cross-functional partners, including security, identity, and platform teams.
- Emphasize your experience with automation tools and scripting languages, such as Python and PowerShell.
- Be prepared to discuss your approach to incident response and how you would handle a critical infrastructure failure.
- Make sure your resume and online profiles are up-to-date and showcase your technical skills and experience.
- Research Openai's company culture and values, and be prepared to discuss how you would contribute to and thrive in this environment.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.