Software Engineer, Compute Infrastructure

Openai·Remote(San Francisco)

Software Development

WFA Digital Insight

The demand for skilled software engineers in AI and compute infrastructure has skyrocketed, with a 25% increase in job postings over the last year. As OpenAI continues to push the boundaries of AI research, professionals with expertise in distributed systems, high-performance computing, and reliability are in high demand. With the shift towards remote work, companies like OpenAI are looking for candidates who can adapt and thrive in a distributed environment. Before applying, candidates should be prepared to showcase their problem-solving skills and experience with complex systems, as well as their ability to collaborate remotely. With the AI market projected to reach

90 billion by 2025, this is an exciting time to join a company at the forefront of this revolution.

Job Description

About the Role

The Software Engineer, Compute Infrastructure role at Openai is a unique opportunity to build and optimize the compute platform that powers the company's AI research and products. As a member of the Compute Infrastructure team, you will work on designing, provisioning, scheduling, operating, and optimizing the systems that connect accelerators, CPUs, networks, storage, data centers, and orchestration software. Your work will have a direct impact on the company's ability to advance AI research and develop innovative products.

The Compute Infrastructure team is responsible for building the platform that turns enormous amounts of compute into a reliable engine for frontier AI. This involves working on the entire stack, from capacity planning and cluster lifecycle to bare-metal automation, distributed systems, Kubernetes and scheduling, deep system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and the developer experience.

As a Software Engineer on this team, you will have the opportunity to work on a wide range of projects, from building and optimizing relatable platform primitives to designing abstractions that make heterogeneous clusters feel like one coherent platform. Your day-to-day work will involve collaborating with cross-functional teams, including researchers, product teams, and other engineers to identify and prioritize projects, design and implement solutions, and test and deploy new features.

What You Will Do

Design and implement scalable and efficient compute systems that can handle large workloads
Collaborate with researchers and product teams to identify and prioritize projects
Work on building and optimizing relatable platform primitives
Design abstractions that make heterogeneous clusters feel like one coherent platform
Develop and maintain tooling and workflows to support the developer experience
Participate in the design and implementation of new features and systems
Collaborate with other engineers to identify and prioritize projects
Work on testing and deploying new features and systems
Participate in code reviews and contribute to the improvement of the codebase
Collaborate with other teams to ensure seamless integration of new features and systems

What We Are Looking For

5+ years of experience in software engineering, with a focus on compute infrastructure, distributed systems, or high-performance computing
Strong understanding of computer systems, including operating systems, networking, and storage
Experience with cloud computing platforms, such as AWS or GCP
Strong programming skills in languages such as C++, Python, or Java
Experience with containerization technologies, such as Docker or Kubernetes
Strong understanding of distributed systems and scalability
Experience with agile development methodologies and version control systems, such as Git
Strong communication and collaboration skills
Experience with testing and validation of software systems

Nice to Have

Experience with AI or machine learning workloads
Knowledge of GPU architecture and programming models
Experience with high-performance networking protocols, such as InfiniBand or RDMA
Experience with cloud-based storage solutions, such as AWS S3 or GCP Cloud Storage
Knowledge of container orchestration tools, such as Kubernetes or Mesos

Benefits and Perks

Competitive salary and equity package
Opportunity to work on cutting-edge AI research and products
Collaborative and dynamic work environment
Flexible working hours and remote work options
Access to cutting-edge technology and tools
Professional development opportunities, including training and conference sponsorships
Comprehensive health insurance and benefits package
Generous paid time off and vacation policy
Access to a diverse and talented team of engineers and researchers

How to Stand Out

To stand out as a candidate, be prepared to showcase your experience with complex systems and scalability, as well as your ability to collaborate remotely.
Make sure to highlight your understanding of computer systems, including operating systems, networking, and storage.
Showcase your programming skills in languages such as C++, Python, or Java, and be prepared to provide examples of your work.
Familiarize yourself with Openai's products and research, and be prepared to discuss how your skills and experience align with the company's mission and goals.
Be prepared to discuss your experience with agile development methodologies and version control systems, such as Git.
Don't be afraid to ask questions during the interview process, and be prepared to discuss your salary expectations and requirements.
Make sure to research the company culture and values, and be prepared to discuss how you align with them.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.