Senior Infra Engineer: Observability

Railway·Remote(Global)

Software Development

WFA Digital Insight

As the remote job market continues to evolve, demand for skilled infrastructure engineers has grown significantly. With the rise of complex digital systems, companies like Railway are seeking experts who can design and implement scalable observability solutions. According to recent trends, the need for professionals with expertise in distributed systems and fault-tolerant services has increased by over 25% in the past year. Railway stands out for its mission to empower software engineers with powerful tools, and this role offers a unique chance to contribute to that vision. Before applying, candidates should be prepared to showcase their understanding of distributed systems and experience with technologies like VictoriaMetrics and ClickHouse.

Job Description

About the Role

The Senior Infra Engineer: Observability role at Railway is a high-impact position that requires a deep understanding of distributed systems and a passion for building scalable, fault-tolerant services. As a member of the platform engineering team, you will be responsible for designing and implementing observability solutions that can handle millions of requests per second. Your work will have a direct impact on the company's trajectory and outcome, and you will be expected to own your solutions from concept to delivery.

The ideal candidate for this role is someone who enjoys building complex systems and is passionate about ensuring that they are scalable, reliable, and efficient. You should have a strong understanding of distributed systems and experience with technologies like VictoriaMetrics, ClickHouse, and other systems for building observability stacks from the ground up.

What You Will Do

Build ingestion pipelines to consume 1M+ RPS streams of logs, metrics, and other telemetry
Design and implement scalable, fault-tolerant alerting engines for notifying users of threshold breaches
Craft rich backend observability APIs to provide amazing experiences for instantly understanding application performance
Provide APIs to access real-time log/metrics streams for consumption by the Dashboard and Product Teams
Build Golang/Rust GRPC services from scratch capable of supporting tens of thousands of users
Define infrastructure that can be torn down, failed over, and reconstituted from scratch using immutable infrastructure principles with Terraform and Ansible
Write Engineering Requirement Documents to take ideas from concept to implementation and monitor their success
Interface with our TypeScript and GraphQL edge to expose your microservice APIs for internal and external consumption

What We Are Looking For

A strong understanding of distributed systems and experience with building fault-tolerant, resilient, and scalable services
Interest in VictoriaMetrics, ClickHouse, and other systems for building observability stacks from the ground up
Solid intuition about the longevity of your solutions and the ability to plan for their maintenance and replacement
Excellent communication skills for getting your point across and implementing solutions
A great sense of direction and prioritization when dealing with ambiguity and uncertainty
Experience with Terraform, Ansible, and other infrastructure as code tools
Familiarity with Golang, Rust, and other programming languages

Nice to Have

Experience with Kubernetes and container orchestration
Knowledge of security best practices for infrastructure and applications
Familiarity with agile development methodologies and version control systems

Benefits and Perks

Competitive compensation package
Opportunities for professional growth and development in a rapidly evolving field
Flexible, remote work arrangements with a global team
Access to cutting-edge technologies and tools
Comprehensive health insurance and benefits package
Generous PTO and holidays

How to Stand Out

Be prepared to showcase your understanding of distributed systems and experience with technologies like VictoriaMetrics and ClickHouse.
Make sure your resume and cover letter highlight your ability to design and implement scalable, fault-tolerant services.
Prepare to discuss your experience with infrastructure as code tools like Terraform and Ansible.
Be ready to explain your approach to building ingestion pipelines and alerting engines.
Show your passion for building complex systems and your ability to communicate technical ideas effectively.
Highlight any experience you have with Golang, Rust, or other programming languages relevant to the role.
Emphasize your ability to work independently and prioritize tasks effectively in a remote work environment.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.