Senior Infra Engineer: Observability

RailwayRailway·Remote(Global)
Software Development

WFA Digital Insight

As the remote job market continues to evolve, demand for skilled infrastructure engineers has grown significantly. With the rise of complex digital systems, companies like Railway are seeking experts who can design and implement scalable observability solutions. According to recent trends, the need for professionals with expertise in distributed systems and fault-tolerant services has increased by over 25% in the past year. Railway stands out for its mission to empower software engineers with powerful tools, and this role offers a unique chance to contribute to that vision. Before applying, candidates should be prepared to showcase their understanding of distributed systems and experience with technologies like VictoriaMetrics and ClickHouse.

Job Description

About the Role

The Senior Infra Engineer: Observability role at Railway is a high-impact position that requires a deep understanding of distributed systems and a passion for building scalable, fault-tolerant services. As a member of the platform engineering team, you will be responsible for designing and implementing observability solutions that can handle millions of requests per second. Your work will have a direct impact on the company's trajectory and outcome, and you will be expected to own your solutions from concept to delivery.

The ideal candidate for this role is someone who enjoys building complex systems and is passionate about ensuring that they are scalable, reliable, and efficient. You should have a strong understanding of distributed systems and experience with technologies like VictoriaMetrics, ClickHouse, and other systems for building observability stacks from the ground up.

What You Will Do

  • Build ingestion pipelines to consume 1M+ RPS streams of logs, metrics, and other telemetry
  • Design and implement scalable, fault-tolerant alerting engines for notifying users of threshold breaches
  • Craft rich backend observability APIs to provide amazing experiences for instantly understanding application performance
  • Provide APIs to access real-time log/metrics streams for consumption by the Dashboard and Product Teams
  • Build Golang/Rust GRPC services from scratch capable of supporting tens of thousands of users
  • Define infrastructure that can be torn down, failed over, and reconstituted from scratch using immutable infrastructure principles with Terraform and Ansible
  • Write Engineering Requirement Documents to take ideas from concept to implementation and monitor their success
  • Interface with our TypeScript and GraphQL edge to expose your microservice APIs for internal and external consumption

What We Are Looking For

  • A strong understanding of distributed systems and experience with building fault-tolerant, resilient, and scalable services
  • Interest in VictoriaMetrics, ClickHouse, and other systems for building observability stacks from the ground up
  • Solid intuition about the longevity of your solutions and the ability to plan for their maintenance and replacement
  • Excellent communication skills for getting your point across and implementing solutions
  • A great sense of direction and prioritization when dealing with ambiguity and uncertainty
  • Experience with Terraform, Ansible, and other infrastructure as code tools
  • Familiarity with Golang, Rust, and other programming languages

Nice to Have

  • Experience with Kubernetes and container orchestration
  • Knowledge of security best practices for infrastructure and applications
  • Familiarity with agile development methodologies and version control systems

Benefits and Perks

  • Competitive compensation package
  • Opportunities for professional growth and development in a rapidly evolving field
  • Flexible, remote work arrangements with a global team
  • Access to cutting-edge technologies and tools
  • Comprehensive health insurance and benefits package
  • Generous PTO and holidays

How to Stand Out

  • Be prepared to showcase your understanding of distributed systems and experience with technologies like VictoriaMetrics and ClickHouse.
  • Make sure your resume and cover letter highlight your ability to design and implement scalable, fault-tolerant services.
  • Prepare to discuss your experience with infrastructure as code tools like Terraform and Ansible.
  • Be ready to explain your approach to building ingestion pipelines and alerting engines.
  • Show your passion for building complex systems and your ability to communicate technical ideas effectively.
  • Highlight any experience you have with Golang, Rust, or other programming languages relevant to the role.
  • Emphasize your ability to work independently and prioritize tasks effectively in a remote work environment.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.