Lead Platform Engineer (HPC & Stateless Linux)
WFA Digital Insight
The shift to stateless architectures is driving demand for skilled platform engineers, with a 25% increase in job postings over the past year. PFX is at the forefront of this trend, seeking a lead engineer to design and deploy a cutting-edge Linux cluster. With the global HPC market projected to reach $45 billion by 2027, this role offers a chance to work on a high-impact project. Candidates should be prepared to showcase their expertise in Linux system administration, infrastructure architecture, and container technologies. As a remote role, flexibility and self-motivation are essential.
Job Description
About the Role
The Lead Platform Engineer will play a crucial role in designing and deploying a stateless Linux cluster to support rendering and production workloads across PFX's European branches. This is a contract-based position that requires close collaboration with the R&D and IT teams to build the foundational layer of the infrastructure. The successful candidate will have a deep understanding of Linux system administration, infrastructure architecture, and container technologies.As a lead engineer, you will be responsible for making high-level architectural decisions and owning the 'foundational layer' of the project. This is an exciting opportunity to work on a high-impact project that will drive the company's growth and success. You will be working in a fast-paced environment with a team of experienced professionals who are passionate about innovation and excellence.
PFX is committed to fostering a culture of innovation and collaboration. The company is looking for a talented and motivated individual who can bring their expertise and passion to the team. If you are a skilled platform engineer with a passion for stateless architectures and Linux, this could be the perfect opportunity for you.
What You Will Do
- Design and deploy a stateless Linux cluster using technologies like Warewulf
- Implement and configure SLURM as the primary scheduler for workload management
- Manage the environment through Proxmox and design container images via Singularity/Apptainer
- Implement Icinga for monitoring and build a custom Conda repository for reproducible deployment
- Collaborate on network architecture and support CI workflows via GitLab CI
- Work closely with the R&D and IT teams to ensure seamless integration of the new infrastructure
- Develop and maintain technical documentation for the infrastructure
- Troubleshoot and resolve technical issues related to the infrastructure
- Participate in the planning and implementation of future infrastructure upgrades and improvements
- Collaborate with the DevOps team to ensure continuous integration and delivery of applications
What We Are Looking For
- Expert-level Linux system administration skills (Red Hat/Rocky Linux preferred)
- Proven experience building or operating large-scale compute environments (HPC, large-scale K8s, or distributed systems)
- Hands-on experience with stateless deployments, Proxmox/KVM, and container technologies
- Proficiency in Python or Bash for complex system automation
- Ability to own the 'foundational layer' of a project and make high-level architectural decisions
- Strong understanding of infrastructure architecture and design principles
- Experience with agile development methodologies and version control systems
- Excellent communication and collaboration skills
- Ability to work independently and as part of a team
Nice to Have
- Prior experience with SLURM, Warewulf, xCAT, or similar provisioning/scheduling tools
- Experience with Infrastructure-as-Code tools like Ansible, Terraform, or Puppet
- Background in research computing, AI infrastructure, or advanced university/HPC labs
- Familiarity with cloud computing platforms and migration strategies
Benefits and Perks
- Competitive hourly rate
- Opportunity to work on a high-impact project with a leading company
- Flexible working hours and remote work arrangement
- Access to cutting-edge technologies and tools
- Collaborative and dynamic work environment
- Professional development and growth opportunities
- Recognition and rewards for outstanding performance
- Comprehensive health insurance and benefits package
- Generous paid time off and holiday allowance
- Remote work stipend and equipment allowance
How to Stand Out
- Make sure to highlight your experience with stateless architectures and Linux system administration in your application.
- Showcase your proficiency in Python or Bash for complex system automation by providing examples of scripts or projects you have worked on.
- Be prepared to discuss your experience with container technologies and orchestration tools like Kubernetes.
- Demonstrate your understanding of infrastructure architecture and design principles by explaining your approach to building and deploying large-scale compute environments.
- Be prepared to talk about your experience with agile development methodologies and version control systems like Git.
- Showcase your ability to work independently and as part of a team by providing examples of successful collaborations or leadership roles.
- Consider creating a personal project or contributing to open-source projects to demonstrate your skills and passion for stateless architectures and Linux.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.