Datacenter Hardware Operations Technician, AI Compute Infrastructure - Stargate
WFA Digital Insight
The demand for skilled datacenter hardware operations technicians is on the rise, particularly in the AI sector, where companies like Openai are pushing the boundaries of innovation. With the global AI market expected to reach
Job Description
About the Role
The Datacenter Hardware Operations Technician role at Openai is a critical position that requires collaboration with Oracle teams and vendors to ensure the smooth operation of high-density compute environments. As a senior datacenter hardware operations technician, you will be responsible for coordinating physical hardware activities at a large partner-operated campus, working side-by-side with Oracle and their delivery teams. Your primary focus will be on technical alignment, shared problem-solving, and ensuring that maintenance, repairs, and lifecycle activities support the performance and reliability goals of both organizations.The Stargate program at Openai is dedicated to building the world's most advanced AI infrastructure ecosystem. This involves developing and deploying massive, state-of-the-art data center campuses in partnership with industry leaders. As part of this program, you will play a key role in designing for scale, speed, and reliability, and ensuring that the high-density compute environment operates at peak performance.
What You Will Do
- Serve as Openai's primary on-site hardware contact, collaborating with Oracle teams and vendors to plan and coordinate maintenance, repairs, and lifecycle activities.
- Share technical requirements and verify that work performed supports Openai's compute needs and agreed quality targets.
- Coordinate schedules, spare-parts planning, and issue escalation with partner teams to minimize downtime and keep operations running smoothly.
- Work with Openai fleet-health engineers to translate software-detected issues into on-site hardware actions in partnership with Oracle.
- Track hardware trends and provide joint recommendations with partner teams for design or operational improvements.
- Prepare documentation and runbooks that capture joint best practices and can be applied at additional campuses.
- Offer technical guidance and context to partner personnel while respecting their operational ownership.
- Collaborate with supply-chain teams to plan spares and manage hardware lifecycle activities.
- Develop and maintain relationships with key stakeholders, including Oracle teams, vendors, and internal engineering stakeholders.
- Participate in the development of standards and playbooks to guide hardware operations at future Openai infrastructure projects.
What We Are Looking For
- 7+ years of experience in datacenter hardware operations, hardware engineering, or large-scale server maintenance, with at least 2 years in a senior or lead technician capacity.
- Deep knowledge of high-density server hardware, including x86 platforms, GPUs, storage devices, and power/cooling systems.
- Experience in diagnosing hardware issues, coordinating complex repairs, and maintaining strong working relationships across organizations.
- Ability to set technical expectations and validate outcomes through collaboration, not direct management.
- Adaptability to changing operational conditions and enjoyment of solving problems at both strategic and on-site levels.
- Strong communication skills, with the ability to build trust across partner teams, vendors, and internal engineering stakeholders.
- Willingness to be based full-time at a partner-operated campus.
- Familiarity with large-scale cluster management or monitoring tools.
- Experience with GPU-accelerated compute clusters or other high-performance computing hardware.
- Knowledge of Linux/Unix system administration and command-line diagnostic tools for hardware validation.
Nice to Have
- Familiarity with large-scale cluster management or monitoring tools (IPMI, BMC, Prometheus, Nagios) to interpret alerts and coordinate partner responses.
- Experience with GPU-accelerated compute clusters or other high-performance computing hardware.
- Knowledge of Linux/Unix system administration and command-line diagnostic tools for hardware validation.
- Industry certifications such as CompTIA Server+, OEM hardware certifications, or equivalent.
Benefits and Perks
- Opportunity to work with a leading AI research and deployment company.
- Collaborative and dynamic work environment.
- Professional development opportunities.
- Competitive compensation package.
- Health insurance benefits.
- Remote work stipend.
- Flexible PTO policy.
- Access to cutting-edge technology and tools.
How to Stand Out
- Ensure your resume and cover letter highlight your experience with high-density server hardware and large-scale cluster management.
- Be prepared to discuss specific examples of diagnosing hardware issues and coordinating complex repairs in your previous roles.
- Familiarize yourself with Openai's Stargate program and the company's mission to advance AI infrastructure.
- Develop a strong understanding of the skills and qualifications required for the role, and be prepared to explain how your experience aligns with them.
- Practice your communication skills, as building trust across partner teams, vendors, and internal engineering stakeholders is crucial in this role.
- Research the current market trends and demands for datacenter hardware operations technicians to negotiate your salary effectively.
- Be prepared to discuss your experience working in a fast-paced, dynamic environment and your ability to adapt to changing operational conditions.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.