Network Reliability Engineer

MARGO·Remote(Poland)
Software Development

WFA Digital Insight

The demand for skilled network reliability engineers with expertise in AI infrastructure has never been higher, with the global AI market projected to reach

90 billion by 2027. As remote work continues to reshape the digital landscape, companies like MARGO are at the forefront, seeking professionals who can ensure the stability, security, and scalability of their systems. With a focus on proactive and solution-oriented mindsets, MARGO is looking for individuals who can collaborate effectively and drive continuous improvement. Before applying, candidates should be aware of the evolving nature of AI and digital technologies and be prepared to showcase their hands-on experience with Linux systems, networking fundamentals, and automation tools.

Job Description

About the Role

The Network Reliability Engineer position at MARGO is a critical role focused on building, maintaining, and troubleshooting large AI infrastructures. This involves working closely with various engineering teams to ensure the reliability, scalability, and security of AI systems across different environments and countries. The successful candidate will be part of a dynamic team that values collaboration, continuous learning, and innovation.

Daily routines will include diagnosing and remediating production incidents, participating in on-call rotations, and implementing observability solutions to monitor infrastructure and application health. The ability to work independently and as part of a team, along with a passion for automation and continuous improvement, is essential.

MARGO's commitment to best practices in stability, resiliency, scalability, and security means that the Network Reliability Engineer will play a pivotal role in maintaining and evolving the company's technical infrastructure. This includes promoting and applying these best practices and ensuring clear technical documentation for tools and procedures.

What You Will Do

  • Build and maintain large AI infrastructures with a focus on monitoring, diagnosis, and remediation of production incidents.
  • Troubleshoot high-impact production issues in collaboration with other engineering teams.
  • Participate in an on-call rotation to handle incidents and ensure service continuity.
  • Implement and maintain observability solutions to monitor AI infrastructure and application health.
  • Contribute to AI infrastructure lifecycle management across different environments and countries.
  • Promote and apply best practices in terms of stability, resiliency, scalability, and security.
  • Maintain clear technical documentation for tools and procedures.
  • Contribute to system and tool evolution based on production feedback.
  • Collaborate closely with development teams to ensure infrastructure readiness.
  • Participate in team rituals and knowledge-sharing initiatives.

What We Are Looking For

  • Experience with Go or Python, with strong scripting skills (Bash, Python).
  • Hands-on experience with Linux systems (Ubuntu/Debian).
  • Preferred hands-on experience with GPU & HPC infrastructure.
  • Knowledge of networking fundamentals (VLAN/LAN, TCP/IP, DNS, BGP, load-balancing, IPv6, etc.).
  • Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.).
  • Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.).
  • Experience managing relational databases (MariaDB).
  • Understanding of CI/CD pipelines (GitLab).
  • Comfortable with English (written and spoken).
  • Proactive and solution-oriented mindset.
  • Strong collaboration and communication skills.

Nice to Have

  • Prior experience working in a remote or distributed team environment.
  • Familiarity with cloud computing platforms (AWS, Azure, Google Cloud).
  • Knowledge of containerization (Docker) and orchestration (Kubernetes).
  • Experience with agile development methodologies.

Benefits and Perks

  • Competitive compensation package.
  • Opportunities for professional growth and continuous learning.
  • Collaborative and dynamic work environment.
  • Flexible working hours and remote work options.
  • Access to cutting-edge technologies and tools.
  • Participation in company-wide initiatives and team-building activities.
  • Comprehensive health and wellness programs.
  • Generous paid time off policy.

How to Stand Out

  • Highlight your problem-solving skills: Be prepared to provide examples of how you've diagnosed and resolved complex technical issues in the past.
  • Emphasize your collaboration experience: Showcase your ability to work effectively in teams, including distributed teams, and your strong communication skills.
  • Showcase your automation skills: Demonstrate your knowledge of automation tools and languages, such as Python or Go, and how you've applied them to improve system efficiency.
  • Prepare to discuss your understanding of AI infrastructure: Be ready to talk about your experience with AI systems, including your knowledge of GPU and HPC infrastructure.
  • Be ready to talk about your continuous learning approach: Highlight your passion for staying updated with the latest technologies and methodologies in the field of network reliability engineering.
  • Prepare examples of your infrastructure management experience: Provide specific examples of how you've managed and improved the scalability, security, and reliability of systems in previous roles.
  • Research MARGO's technology stack and culture: Understanding the company's current projects, technologies, and values can help you tailor your application and prepare for interviews.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.