Technical Lead Manager - Training Runtime, Data(set) Movement

Openai·Remote(San Francisco)

Other

WFA Digital Insight

With the explosive growth of AI research, demand for technical leaders in data movement and reliability has never been higher. According to industry reports, the need for skilled professionals in this space grew by 28% in 2025 alone. Openai, a pioneer in AI research, is now seeking a Technical Lead Manager to spearhead its training runtime and dataset movement efforts. As a hub for innovation, Openai offers a unique opportunity for tech professionals to make a significant impact. Before applying, candidates should be aware that this role requires a deep understanding of distributed systems, data loading, and reliability engineering, as well as experience in leading technical teams and managing complex data pipelines.

Job Description

About the Role

The Technical Lead Manager position at Openai is a critical role that involves overseeing the development and maintenance of the company's training runtime and dataset movement infrastructure. As a key member of the Training Runtime team, the successful candidate will be responsible for designing and building a unified dataset read platform for multiple current and future training frameworks. This role requires a deep understanding of distributed systems, data loading, and reliability engineering, as well as experience in leading technical teams and managing complex data pipelines.

The Training Runtime team is responsible for building the distributed systems that power Openai's largest model training runs. The Data Movement area, in particular, focuses on the infrastructure that keeps training jobs supplied with the right data at the right time, and keeps model state moving safely and efficiently across large clusters. The team's work spans machine learning systems, distributed storage, high-throughput data loading, reliability engineering, and developer experience.

What You Will Do

Design and build a unified dataset read platform for multiple current and future training frameworks
Define dataset APIs, storage-format expectations, registration/versioning, and migration paths that make data access reproducible and maintainable
Build reliability into the read path, including stateful iteration, caching, fast restart, recovery, and clear operational contracts
Build terminal and web-based visualizers that let teams inspect text, multimodal, and reinforcement learning data late in the pipeline, where bugs are most visible
Write and review production code in core data loading, service, caching, and reliability paths
Partner with teams working on training frameworks, reinforcement learning, multimodal models, storage, runtime, and cluster infrastructure
Lead the development of a durable platform that supports pretraining, reinforcement learning, and multimodal training
Collaborate with researchers, training framework owners, storage teams, and infrastructure partners to align around a unified platform
Identify and mitigate potential failure modes of large distributed training jobs and develop strategies to prevent them

What We Are Looking For

Experience building or owning dataset, data loading, storage, or distributed training infrastructure at large scale
Strong understanding of API design, debugging ergonomics, performance, and bit-level correctness
Knowledge of the failure modes of large distributed training jobs and how data systems can create or prevent them
Experience with stateful iterators, checkpoint/restart semantics, caching, remote services, or high-throughput storage reads
Comfort working across Python and lower-level systems code, with Rust or C++ experience being a plus
Experience working with multimodal, video, reinforcement learning, or pretraining data pipelines
Ability to lead through code and technical judgment, with a focus on eliminating friction and ensuring a reliable and efficient experience for researchers

Nice to Have

Experience with machine learning frameworks and technologies, such as TensorFlow or PyTorch
Knowledge of distributed systems and cloud computing platforms, such as AWS or GCP
Familiarity with data visualization tools and technologies, such as Tableau or D3.js
Experience with Agile development methodologies and version control systems, such as Git

Benefits and Perks

Competitive salary and equity package
Opportunity to work on cutting-edge AI research and development projects
Collaborative and dynamic work environment with a team of experienced professionals
Flexible working hours and remote work options
Access to professional development and training opportunities
Comprehensive health and wellness benefits package
Generous paid time off and vacation policy
Remote work stipend and equipment allowance
Opportunity to contribute to open-source projects and participate in industry conferences and events

How to Stand Out

Develop a strong understanding of distributed systems, data loading, and reliability engineering to stand out in the application process
Create a portfolio that showcases your experience with dataset, data loading, storage, or distributed training infrastructure at large scale
Be prepared to discuss your experience with stateful iterators, checkpoint/restart semantics, caching, remote services, or high-throughput storage reads during the interview process
Emphasize your ability to lead through code and technical judgment, with a focus on eliminating friction and ensuring a reliable and efficient experience for researchers
Research Openai's current projects and initiatives to demonstrate your passion for AI research and development
Practice whiteboarding exercises to improve your problem-solving skills and ability to communicate complex technical concepts
Be prepared to negotiate salary and benefits, and do not be afraid to ask about opportunities for professional development and growth

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.