AI Infrastructure & Platform Operations Engineer

namename·Remote(Poland)
AI & Machine Learning
Excel

WFA Digital Insight

The demand for skilled AI infrastructure engineers is on the rise, with a 25% increase in job openings in the past year alone. As companies invest heavily in AI infrastructure, professionals with expertise in NVIDIA GPUs, Kubernetes, and high-performance networking are in high demand. With the global AI market expected to reach

90 billion by 2025, this role offers a unique opportunity to be at the forefront of AI innovation. Candidates should be prepared to showcase their technical expertise and experience working in complex infrastructure environments. Before applying, it's essential to understand the company's vision for AI infrastructure and how this role contributes to its growth.

Job Description

About the Role

As an AI Infrastructure & Platform Operations Engineer, you will play a crucial role in the European AI Infrastructure & Platform Operations team. Your primary responsibility will be to monitor, operate, and support large-scale AI infrastructure environments powered by cutting-edge technologies such as NVIDIA GPUs, high-performance networking, and Kubernetes. The team is responsible for ensuring the smooth operation of these environments, and your expertise will be instrumental in resolving infrastructure-related incidents and improving overall system efficiency.

The AI Infrastructure & Platform Operations team is at the forefront of AI innovation, working with the latest technologies to drive business growth. As a key member of this team, you will have the opportunity to gain exposure to next-generation AI infrastructure and contribute to shaping the future of AI-powered operations. Your work will have a direct impact on the company's ability to deliver high-quality AI solutions, making this a highly rewarding role for those passionate about AI and infrastructure.

What You Will Do

  • Monitor and operate production AI infrastructure platforms to ensure high availability and performance
  • Investigate and resolve infrastructure, networking, hardware, and platform-related incidents
  • Collaborate with cross-functional teams to implement new technologies and improve existing infrastructure
  • Develop and maintain operational documentation and runbooks for AI infrastructure environments
  • Participate in shift-based operational environments, providing 24/7 support for critical systems
  • Work closely with the development team to ensure seamless integration of new features and technologies
  • Analyze system performance and provide recommendations for optimization
  • Develop and implement automation scripts to improve efficiency and reduce manual errors
  • Stay up-to-date with the latest advancements in AI infrastructure and platform technologies

What We Are Looking For

  • At least 3+ years of experience in infrastructure operations, platform operations, network operations, site reliability engineering, cloud operations, or related technical roles
  • Strong Linux administration and troubleshooting skills
  • Good understanding of networking concepts and experience diagnosing infrastructure-related issues
  • Working knowledge of Kubernetes in production environments
  • Experience supporting production infrastructure and services
  • Strong analytical and problem-solving skills
  • Experience working within structured operational and incident management processes
  • Excellent communication and collaboration skills

Nice to Have

  • Experience with NVIDIA GPU technologies and high-performance computing environments
  • Knowledge of cloud platforms such as AWS or Azure
  • Familiarity with containerization technologies like Docker
  • Experience with automation tools like Ansible or Terraform

Benefits and Perks

  • Competitive salary and benefits package
  • Opportunity to work with cutting-edge AI technologies and contribute to the development of next-generation AI infrastructure
  • Collaborative and dynamic work environment with a team of experienced professionals
  • Professional development opportunities, including training and conference attendance
  • Flexible working hours and remote work options
  • Access to the latest tools and technologies
  • Recognition and reward for outstanding performance
  • Comprehensive health insurance and retirement plan

How to Stand Out

  • Ensure your resume highlights specific experience with Linux administration, Kubernetes, and high-performance networking.
  • Be prepared to provide examples of complex infrastructure issues you've resolved in the past.
  • Familiarize yourself with NVIDIA GPU technologies and their applications in AI infrastructure.
  • Showcase your ability to work collaboratively in a team environment and effectively communicate technical concepts.
  • Consider creating a personal project or contributing to open-source projects to demonstrate your skills in AI infrastructure and platform operations.
  • Prepare to discuss your experience with automation tools and scripting languages.
  • Research the company's approach to AI infrastructure and be ready to discuss how your skills align with their vision.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.