Staff Security Reliability Engineer

Openai·Remote(San Francisco)

Software Development

WFA Digital Insight

As demand for AI and machine learning specialists continues to soar, with a 25% increase in job postings over the past year, companies like Openai are looking for skilled professionals to ensure the reliability and security of their infrastructure. The role of a Staff Security Reliability Engineer is particularly crucial, requiring a deep understanding of security principles, infrastructure design, and scalable systems. With the rise of remote work, the need for robust and secure infrastructure has never been more pressing, making this role an exciting opportunity for those looking to make a real impact. Candidates should be prepared to showcase their technical expertise, as well as their ability to work collaboratively in a fast-paced environment.

Job Description

About the Role

The Staff Security Reliability Engineer plays a critical role in designing, building, and operating the reliable infrastructure that underpins Openai's internal services and R&D environments. This is an early, high-leverage technical role that requires a strong background in Site Reliability Engineering principles, as well as a deep understanding of security and scalability. The successful candidate will be responsible for establishing standardized infrastructure patterns, owning the lifecycle of critical infrastructure platforms, and building durable, production-grade platforms that remove operational friction and enable teams to move faster with confidence.

As a key member of the Infrastructure Engineering function, the Staff Security Reliability Engineer will work closely with cross-functional partners across security, identity, network, and platform teams to design and implement secure and scalable infrastructure solutions. This will involve collaborating with identity engineering teams to build hardened, policy-enforced infrastructure, as well as partnering with platform teams to design and implement scalable and reliable systems.

The role is based in Openai's San Francisco HQ and requires in-office presence, although the company values flexibility and work-life balance.

What You Will Do

Design, build, and operate reliable infrastructure across on-prem, hybrid, shared, and product-adjacent environments
Establish standardized infrastructure patterns that replace bespoke implementations with repeatable, auditable, secure-by-default systems
Own the lifecycle of critical infrastructure platforms, including provisioning, deployment, upgrades, patching, recovery, and long-term reliability
Build infrastructure-as-code and configuration management using tools such as Terraform, Chef, and Ansible
Mature identity-adjacent and policy-enforced infrastructure, including Microsoft Entra and Azure management patterns
Build observability, alerting, and incident response mechanisms that improve availability, recoverability, and operational confidence
Automate high-toil and high-risk workflows with guardrails, progressive rollout patterns, and safe rollback paths
Translate incidents, design reviews, and operational learnings into durable fixes, reusable patterns, and stronger technical standards

What We Are Looking For

10+ years of hands-on experience operating and architecting mission-critical infrastructure in high-reliability environments
Experience as the senior technical owner for the design and maturation of complex on-prem, hybrid, or cloud-integrated systems
Strong background in Site Reliability Engineering principles, with a focus on security and scalability
Experience with infrastructure-as-code and configuration management tools such as Terraform, Chef, and Ansible
Strong understanding of identity and access management principles, including Microsoft Entra and Azure management patterns
Experience with automation tools and scripting languages, such as Python and PowerShell
Strong communication and collaboration skills, with the ability to work effectively with cross-functional partners

Nice to Have

Experience operating infrastructure for R&D or specialized labs, manufacturing, or other safety-critical environments
Experience with fleet, endpoint, or virtual desktop platforms such as FleetDM, Chef, or Azure Virtual Desktop
Experience with cloud-based infrastructure, including AWS and Azure
Strong understanding of security and compliance principles, including GDPR and HIPAA

Benefits and Perks

Competitive salary and benefits package
Opportunities for career growth and professional development
Collaborative and dynamic work environment
Flexible working hours and remote work options
Access to cutting-edge technology and tools
Comprehensive health and wellness programs
Generous PTO and holiday allowance
Professional development budget and conference attendance opportunities

How to Stand Out

Ensure you have a strong understanding of Site Reliability Engineering principles and how they apply to security and scalability.
Be prepared to showcase your experience with infrastructure-as-code and configuration management tools, such as Terraform and Ansible.
Highlight your ability to work collaboratively with cross-functional partners, including security, identity, and platform teams.
Emphasize your experience with automation tools and scripting languages, such as Python and PowerShell.
Be prepared to discuss your approach to incident response and how you would handle a critical infrastructure failure.
Make sure your resume and online profiles are up-to-date and showcase your technical skills and experience.
Research Openai's company culture and values, and be prepared to discuss how you would contribute to and thrive in this environment.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.