Senior Linux Kernel Engineer - High-Performance Computing
WFA Digital Insight
As the demand for high-performance computing and AI technology continues to grow, the need for skilled Linux kernel engineers has never been more pressing. With a reported 25% increase in HPC investments in 2025, companies like The Next Chapter W&S are at the forefront of this revolution. To succeed in this role, candidates will need a deep understanding of Linux systems, performance optimization, and a passion for pushing the limits of what's possible. With the right skills and experience, this role offers a unique opportunity to work on cutting-edge infrastructure and make a real impact on the future of computing.
Job Description
About the Role
The Senior Linux Kernel Engineer will play a critical role in optimizing the performance of The Next Chapter W&S's high-performance computing infrastructure, which includes over 100,000 GPUs and 10+ InfiniBand fabrics across five global data centers. This is a hands-on, high-impact engineering role that requires a deep understanding of Linux systems, hardware, and software. The successful candidate will join a small, senior team that works between the hardware and Linux OS layers, solving complex performance problems that affect tens of thousands of GPUs.The HPC cluster engineering team is responsible for enhancing and optimizing the core components of the Cloud platform, with a specific focus on High-Performance Computing, InfiniBand networks, and the KVM/QEMU stack. This role involves analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system.
As a key member of the team, the Senior Linux Kernel Engineer will work closely with hardware virtualization and device emulation technologies, ensuring high performance and security in multi-GPU, HPC environments.
What You Will Do
- Tune the performance of clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
- Analyze and troubleshoot the root cause of issues related to GPUs and InfiniBand networks, and propose corrective actions.
- Integrate new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM.
- Enhance automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
- Configure and manage GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.
- Collaborate with cross-functional teams to ensure seamless integration of new hardware and software components.
- Develop and maintain detailed documentation of system configurations, performance benchmarks, and troubleshooting procedures.
- Participate in the design and implementation of new features and technologies to improve system performance and scalability.
- Stay up-to-date with the latest developments in HPC, Linux, and related technologies, and apply this knowledge to continuously improve system performance and efficiency.
What We Are Looking For
- 5+ years of professional experience in system-level software development, with a focus on performance optimization and low-level programming.
- 3+ years of hands-on experience with Linux systems, including administration, troubleshooting, and performance tuning.
- Experience with relevant tools for kernel profiling and tuning, such as perf, ftrace, and eBPF.
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and related technologies.
- Strong proficiency in one or more performance-oriented programming languages, such as C/C++, Go, or Python.
- Excellent grasp of data structures and algorithms, with the ability to apply this knowledge to real-world problems.
- Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking.
- Proven track record of analyzing and optimizing the performance of HPC workloads, such as simulations, data analysis, and AI/ML workloads.
Nice to Have
- Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication.
- Background in Software-Defined Networking (SDN) and experience with HPC cluster networking.
- Understanding of QEMU/KVM virtualization and managing virtualized environments.
- Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems.
- Familiarity with collective communication libraries like MPI and NCCL for distributed computing.
Benefits and Perks
- Competitive salary and benefits package.
- Flexible working arrangements, including remote work options.
- Opportunity to work on cutting-edge infrastructure and technology.
- Collaborative and dynamic work environment that values initiative and innovation.
- Professional development opportunities, including training and conference attendance.
- Access to the latest tools and technologies, including GPU hardware and software stacks.
- Recognition and reward for outstanding performance and contributions to the team.
- Comprehensive health and wellness package, including mental health support and employee assistance programs.
How to Stand Out
- Tip: Make sure your resume and cover letter are tailored to the specific requirements of the role, highlighting your experience with Linux systems, performance optimization, and HPC technology.
- Tip: Be prepared to provide specific examples of your experience with kernel profiling and tuning tools, as well as your understanding of server architecture and related technologies.
- Tip: Show enthusiasm and passion for working on cutting-edge infrastructure and technology, and be prepared to discuss your ideas for improving system performance and efficiency.
- Tip: Highlight your experience with collaborative development tools, such as Git, and your ability to work effectively in a team environment.
- Tip: Be prepared to discuss your experience with automation systems and scripting languages, such as Python or Bash, and how you have applied these skills in previous roles.
- Tip: Research the company and the role thoroughly, and be prepared to ask informed questions during the interview process.
- Tip: Consider creating a personal project or contributing to open-source projects to demonstrate your skills and experience with HPC and Linux technology.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.