Software Engineer, Compute Infrastructure

OpenaiOpenai·Remote(San Francisco)
Software Development

WFA Digital Insight

The demand for skilled software engineers in AI and compute infrastructure has skyrocketed, with a 25% increase in job postings over the last year. As OpenAI continues to push the boundaries of AI research, professionals with expertise in distributed systems, high-performance computing, and reliability are in high demand. With the shift towards remote work, companies like OpenAI are looking for candidates who can adapt and thrive in a distributed environment. Before applying, candidates should be prepared to showcase their problem-solving skills and experience with complex systems, as well as their ability to collaborate remotely. With the AI market projected to reach

90 billion by 2025, this is an exciting time to join a company at the forefront of this revolution.

Job Description

About the Role

The Software Engineer, Compute Infrastructure role at Openai is a unique opportunity to build and optimize the compute platform that powers the company's AI research and products. As a member of the Compute Infrastructure team, you will work on designing, provisioning, scheduling, operating, and optimizing the systems that connect accelerators, CPUs, networks, storage, data centers, and orchestration software. Your work will have a direct impact on the company's ability to advance AI research and develop innovative products.

The Compute Infrastructure team is responsible for building the platform that turns enormous amounts of compute into a reliable engine for frontier AI. This involves working on the entire stack, from capacity planning and cluster lifecycle to bare-metal automation, distributed systems, Kubernetes and scheduling, deep system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and the developer experience.

As a Software Engineer on this team, you will have the opportunity to work on a wide range of projects, from building and optimizing relatable platform primitives to designing abstractions that make heterogeneous clusters feel like one coherent platform. Your day-to-day work will involve collaborating with cross-functional teams, including researchers, product teams, and other engineers to identify and prioritize projects, design and implement solutions, and test and deploy new features.

What You Will Do

  • Design and implement scalable and efficient compute systems that can handle large workloads
  • Collaborate with researchers and product teams to identify and prioritize projects
  • Work on building and optimizing relatable platform primitives
  • Design abstractions that make heterogeneous clusters feel like one coherent platform
  • Develop and maintain tooling and workflows to support the developer experience
  • Participate in the design and implementation of new features and systems
  • Collaborate with other engineers to identify and prioritize projects
  • Work on testing and deploying new features and systems
  • Participate in code reviews and contribute to the improvement of the codebase
  • Collaborate with other teams to ensure seamless integration of new features and systems

What We Are Looking For

  • 5+ years of experience in software engineering, with a focus on compute infrastructure, distributed systems, or high-performance computing
  • Strong understanding of computer systems, including operating systems, networking, and storage
  • Experience with cloud computing platforms, such as AWS or GCP
  • Strong programming skills in languages such as C++, Python, or Java
  • Experience with containerization technologies, such as Docker or Kubernetes
  • Strong understanding of distributed systems and scalability
  • Experience with agile development methodologies and version control systems, such as Git
  • Strong communication and collaboration skills
  • Experience with testing and validation of software systems

Nice to Have

  • Experience with AI or machine learning workloads
  • Knowledge of GPU architecture and programming models
  • Experience with high-performance networking protocols, such as InfiniBand or RDMA
  • Experience with cloud-based storage solutions, such as AWS S3 or GCP Cloud Storage
  • Knowledge of container orchestration tools, such as Kubernetes or Mesos

Benefits and Perks

  • Competitive salary and equity package
  • Opportunity to work on cutting-edge AI research and products
  • Collaborative and dynamic work environment
  • Flexible working hours and remote work options
  • Access to cutting-edge technology and tools
  • Professional development opportunities, including training and conference sponsorships
  • Comprehensive health insurance and benefits package
  • Generous paid time off and vacation policy
  • Access to a diverse and talented team of engineers and researchers

How to Stand Out

  • To stand out as a candidate, be prepared to showcase your experience with complex systems and scalability, as well as your ability to collaborate remotely.
  • Make sure to highlight your understanding of computer systems, including operating systems, networking, and storage.
  • Showcase your programming skills in languages such as C++, Python, or Java, and be prepared to provide examples of your work.
  • Familiarize yourself with Openai's products and research, and be prepared to discuss how your skills and experience align with the company's mission and goals.
  • Be prepared to discuss your experience with agile development methodologies and version control systems, such as Git.
  • Don't be afraid to ask questions during the interview process, and be prepared to discuss your salary expectations and requirements.
  • Make sure to research the company culture and values, and be prepared to discuss how you align with them.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.