Lead Platform Engineer (HPC & Stateless Linux)

PFX·Remote(United States)
Software Development

WFA Digital Insight

The shift to stateless architectures is driving demand for skilled platform engineers, with a 25% increase in job postings over the past year. PFX is at the forefront of this trend, seeking a lead engineer to design and deploy a cutting-edge Linux cluster. With the global HPC market projected to reach $45 billion by 2027, this role offers a chance to work on a high-impact project. Candidates should be prepared to showcase their expertise in Linux system administration, infrastructure architecture, and container technologies. As a remote role, flexibility and self-motivation are essential.

Job Description

About the Role

The Lead Platform Engineer will play a crucial role in designing and deploying a stateless Linux cluster to support rendering and production workloads across PFX's European branches. This is a contract-based position that requires close collaboration with the R&D and IT teams to build the foundational layer of the infrastructure. The successful candidate will have a deep understanding of Linux system administration, infrastructure architecture, and container technologies.

As a lead engineer, you will be responsible for making high-level architectural decisions and owning the 'foundational layer' of the project. This is an exciting opportunity to work on a high-impact project that will drive the company's growth and success. You will be working in a fast-paced environment with a team of experienced professionals who are passionate about innovation and excellence.

PFX is committed to fostering a culture of innovation and collaboration. The company is looking for a talented and motivated individual who can bring their expertise and passion to the team. If you are a skilled platform engineer with a passion for stateless architectures and Linux, this could be the perfect opportunity for you.

What You Will Do

  • Design and deploy a stateless Linux cluster using technologies like Warewulf
  • Implement and configure SLURM as the primary scheduler for workload management
  • Manage the environment through Proxmox and design container images via Singularity/Apptainer
  • Implement Icinga for monitoring and build a custom Conda repository for reproducible deployment
  • Collaborate on network architecture and support CI workflows via GitLab CI
  • Work closely with the R&D and IT teams to ensure seamless integration of the new infrastructure
  • Develop and maintain technical documentation for the infrastructure
  • Troubleshoot and resolve technical issues related to the infrastructure
  • Participate in the planning and implementation of future infrastructure upgrades and improvements
  • Collaborate with the DevOps team to ensure continuous integration and delivery of applications

What We Are Looking For

  • Expert-level Linux system administration skills (Red Hat/Rocky Linux preferred)
  • Proven experience building or operating large-scale compute environments (HPC, large-scale K8s, or distributed systems)
  • Hands-on experience with stateless deployments, Proxmox/KVM, and container technologies
  • Proficiency in Python or Bash for complex system automation
  • Ability to own the 'foundational layer' of a project and make high-level architectural decisions
  • Strong understanding of infrastructure architecture and design principles
  • Experience with agile development methodologies and version control systems
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team

Nice to Have

  • Prior experience with SLURM, Warewulf, xCAT, or similar provisioning/scheduling tools
  • Experience with Infrastructure-as-Code tools like Ansible, Terraform, or Puppet
  • Background in research computing, AI infrastructure, or advanced university/HPC labs
  • Familiarity with cloud computing platforms and migration strategies

Benefits and Perks

  • Competitive hourly rate
  • Opportunity to work on a high-impact project with a leading company
  • Flexible working hours and remote work arrangement
  • Access to cutting-edge technologies and tools
  • Collaborative and dynamic work environment
  • Professional development and growth opportunities
  • Recognition and rewards for outstanding performance
  • Comprehensive health insurance and benefits package
  • Generous paid time off and holiday allowance
  • Remote work stipend and equipment allowance

How to Stand Out

  • Make sure to highlight your experience with stateless architectures and Linux system administration in your application.
  • Showcase your proficiency in Python or Bash for complex system automation by providing examples of scripts or projects you have worked on.
  • Be prepared to discuss your experience with container technologies and orchestration tools like Kubernetes.
  • Demonstrate your understanding of infrastructure architecture and design principles by explaining your approach to building and deploying large-scale compute environments.
  • Be prepared to talk about your experience with agile development methodologies and version control systems like Git.
  • Showcase your ability to work independently and as part of a team by providing examples of successful collaborations or leadership roles.
  • Consider creating a personal project or contributing to open-source projects to demonstrate your skills and passion for stateless architectures and Linux.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.