AI Infrastructure & Platform Operations Engineer
WFA Digital Insight
The demand for skilled AI infrastructure engineers is on the rise, with a 25% increase in job openings in the past year alone. As companies invest heavily in AI infrastructure, professionals with expertise in NVIDIA GPUs, Kubernetes, and high-performance networking are in high demand. With the global AI market expected to reach
Job Description
About the Role
As an AI Infrastructure & Platform Operations Engineer, you will play a crucial role in the European AI Infrastructure & Platform Operations team. Your primary responsibility will be to monitor, operate, and support large-scale AI infrastructure environments powered by cutting-edge technologies such as NVIDIA GPUs, high-performance networking, and Kubernetes. The team is responsible for ensuring the smooth operation of these environments, and your expertise will be instrumental in resolving infrastructure-related incidents and improving overall system efficiency.The AI Infrastructure & Platform Operations team is at the forefront of AI innovation, working with the latest technologies to drive business growth. As a key member of this team, you will have the opportunity to gain exposure to next-generation AI infrastructure and contribute to shaping the future of AI-powered operations. Your work will have a direct impact on the company's ability to deliver high-quality AI solutions, making this a highly rewarding role for those passionate about AI and infrastructure.
What You Will Do
- Monitor and operate production AI infrastructure platforms to ensure high availability and performance
- Investigate and resolve infrastructure, networking, hardware, and platform-related incidents
- Collaborate with cross-functional teams to implement new technologies and improve existing infrastructure
- Develop and maintain operational documentation and runbooks for AI infrastructure environments
- Participate in shift-based operational environments, providing 24/7 support for critical systems
- Work closely with the development team to ensure seamless integration of new features and technologies
- Analyze system performance and provide recommendations for optimization
- Develop and implement automation scripts to improve efficiency and reduce manual errors
- Stay up-to-date with the latest advancements in AI infrastructure and platform technologies
What We Are Looking For
- At least 3+ years of experience in infrastructure operations, platform operations, network operations, site reliability engineering, cloud operations, or related technical roles
- Strong Linux administration and troubleshooting skills
- Good understanding of networking concepts and experience diagnosing infrastructure-related issues
- Working knowledge of Kubernetes in production environments
- Experience supporting production infrastructure and services
- Strong analytical and problem-solving skills
- Experience working within structured operational and incident management processes
- Excellent communication and collaboration skills
Nice to Have
- Experience with NVIDIA GPU technologies and high-performance computing environments
- Knowledge of cloud platforms such as AWS or Azure
- Familiarity with containerization technologies like Docker
- Experience with automation tools like Ansible or Terraform
Benefits and Perks
- Competitive salary and benefits package
- Opportunity to work with cutting-edge AI technologies and contribute to the development of next-generation AI infrastructure
- Collaborative and dynamic work environment with a team of experienced professionals
- Professional development opportunities, including training and conference attendance
- Flexible working hours and remote work options
- Access to the latest tools and technologies
- Recognition and reward for outstanding performance
- Comprehensive health insurance and retirement plan
How to Stand Out
- Ensure your resume highlights specific experience with Linux administration, Kubernetes, and high-performance networking.
- Be prepared to provide examples of complex infrastructure issues you've resolved in the past.
- Familiarize yourself with NVIDIA GPU technologies and their applications in AI infrastructure.
- Showcase your ability to work collaboratively in a team environment and effectively communicate technical concepts.
- Consider creating a personal project or contributing to open-source projects to demonstrate your skills in AI infrastructure and platform operations.
- Prepare to discuss your experience with automation tools and scripting languages.
- Research the company's approach to AI infrastructure and be ready to discuss how your skills align with their vision.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.