Research Infrastructure Engineer, Training Systems

Openai·Remote(San Francisco)

Software Development

WFA Digital Insight

The demand for skilled ML engineers has surged in recent years, with expertise in building scalable training infrastructure being a top priority. According to industry trends, the need for professionals with strong systems instincts and ML knowledge has grown significantly. Openai's cutting-edge approach to AI research and deployment stands out in the industry, and this role offers a unique opportunity to contribute to advancing frontier models. With the remote job market booming, candidates should be prepared to showcase their skills in digital collaboration, problem-solving, and API design. As the AI landscape continues to evolve, professionals with a strong foundation in ML and software engineering are in high demand, with some reports indicating a 25% increase in job postings for related roles over the past year.

Job Description

About the Role

The Research Infrastructure Engineer role at Openai is an exciting opportunity to work on the systems layer that enables the development of large-scale ML models. As part of the research team, you will be responsible for building and maintaining the infrastructure that supports the training of these models, working closely with researchers and engineers to design and implement scalable solutions. The team's work is focused on advancing the state-of-the-art in AI research, and this role plays a critical part in making that happen.

The day-to-day responsibilities of this role will involve designing and implementing APIs and interfaces that make complex training workflows easier to express and harder to misuse. You will also work on improving the reliability, debuggability, and performance of the training and data pipelines, as well as debugging issues that span multiple systems and technologies. Collaboration with other teams and stakeholders is essential, as you will be working to ensure that the infrastructure you build meets the needs of the researchers and engineers using it.

What You Will Do

Build and maintain infrastructure for large-scale model training and experimentation
Design APIs and interfaces that make complex training workflows easier to express and harder to misuse
Improve reliability, debuggability, and performance across training and data pipelines
Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage
Write tests, benchmarks, and diagnostics that catch meaningful regressions
Work with researchers and engineers to design and implement scalable solutions
Collaborate with other teams to ensure seamless integration of the infrastructure
Develop and maintain documentation for the infrastructure and its usage
Participate in code reviews and contribute to the improvement of the codebase
Stay up-to-date with the latest developments in ML and software engineering

What We Are Looking For

Strong systems instincts and care deeply about performance, reliability, and clean abstractions
Experience with building and maintaining large-scale ML training infrastructure
Good taste in API and interface design, with empathy for the researchers and engineers using your tools
Comfortable working across ML research code and production-quality infrastructure
Experience with debugging from evidence: profiles, traces, logs, tests, and minimal reproductions
Strong programming skills in languages such as Python and C++
Experience with distributed systems, GPUs, and networking
Strong understanding of software engineering principles and practices
Experience with collaboration tools such as Git and GitHub

Nice to Have

Experience with cloud-based infrastructure and containerization
Knowledge of ML frameworks such as PyTorch and TensorFlow
Experience with automation tools such as Ansible and Terraform
Familiarity with Agile development methodologies

Benefits and Perks

Competitive salary and equity package
Comprehensive health, dental, and vision insurance
Generous PTO and holiday schedule
Remote work stipend and support for home office setup
Access to cutting-edge technologies and tools
Opportunities for professional growth and development
Collaborative and dynamic work environment
Flexible working hours and autonomy to manage your schedule

How to Stand Out

Make sure to highlight your experience with building and maintaining large-scale ML training infrastructure in your resume and cover letter.
Showcase your skills in API design, debugging, and collaboration by including relevant examples in your portfolio.
Be prepared to talk about your experience with distributed systems, GPUs, and networking during the interview process.
Demonstrate your understanding of software engineering principles and practices, and be prepared to discuss your approach to coding and testing.
Consider learning more about Openai's specific technologies and tools before applying, such as their use of PyTorch and containerization.
Be prepared to discuss your experience with remote collaboration and working with distributed teams.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.