Staff Site Reliability Engineer, Core AI Infrastructure
WFA Digital Insight
The demand for skilled site reliability engineers in the AI sector has skyrocketed, with a 25% increase in job postings over the last year. As companies like Coinbase continue to invest in AI, experts with experience in cloud infrastructure, containerization, and automation are in high demand. With its commitment to innovation and remote work, Coinbase stands out as an attractive employer for those looking to make a real impact. Before applying, candidates should be prepared to showcase their technical expertise and experience in fast-paced, high-growth environments.
Job Description
About the Role
As a Staff Site Reliability Engineer at Coinbase, you will be part of a high-performing team driving AI transformation. Your primary focus will be on building and scaling the infrastructure that powers Coinbase's AI products, working closely with senior leadership in a fast-paced environment. This role offers a unique opportunity to own the reliability and automation of critical AI infrastructure, ensuring systems are resilient, observable, and secure at scale.The IT Operations team is responsible for the development and maintenance of the infrastructure that supports Coinbase's AI products. As a Staff Site Reliability Engineer, you will play a key role in this team, working on the design, implementation, and operation of scalable and reliable systems. Your expertise in cloud infrastructure, containerization, and automation will be essential in driving the team's success.
Coinbase is a remote-first company, but you can expect to participate in quarterly in-person working sessions. This is a great opportunity to collaborate with your colleagues and contribute to the company's mission to increase economic freedom.
What You Will Do
- Own the reliability, monitoring, and incident response lifecycle for AI infrastructure services, including on-call support for AWS deployment pipelines, root cause analysis, and blameless retros.
- Build automation and tooling to streamline operational IT workflows, eliminate manual tasks, and improve deployment velocity across CI/CD frameworks and Kubernetes environments.
- Partner with the Coinbase Infrastructure team to extend CI/CD frameworks supporting IT services and enterprise network platforms, and with Security and Compliance to integrate surveillance tooling into deployment pipelines.
- Strengthen observability and documentation standards across IT engineering by defining metrics, implementing monitoring solutions, and maintaining technical documentation that sets a standard of excellence.
- Develop full-stack applications that power internal AI products and infrastructure with Go or Python.
- Collaborate with cross-functional teams to identify and prioritize infrastructure needs, and to develop and implement solutions that meet those needs.
- Participate in on-call rotations to ensure 24/7 coverage of critical systems and respond to incidents as needed.
- Develop and maintain technical documentation for infrastructure components and systems.
- Stay up-to-date with industry trends and emerging technologies, and apply that knowledge to improve the reliability and efficiency of Coinbase's AI infrastructure.
What We Are Looking For
- 8+ years of experience automating and supporting cloud infrastructure (AWS) and network environments, with hands-on use of infrastructure-as-code tools (Terraform, Ansible, Chef, Puppet, or Salt).
- Proven experience deploying, managing, and troubleshooting containerized workloads using Docker and Kubernetes in production environments.
- Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelines.
- Track record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvements.
- Utilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and quality.
- Experience with linux, bash, ruby, python and/or go.
- Strong understanding of network security fundamentals and experience with log aggregation.
Nice to Have
- Expertise automating EC2 or containers deployment with terraform.
- Strong network security fundamentals.
- Experience managing and leveraging log aggregation.
- Experience working in a highly regulated environment.
- Experience in a fast-paced, high-growth company.
- Experience in a Remote-first IT environment.
Benefits and Perks
- Competitive salary and equity package.
- Comprehensive health insurance, including medical, dental, and vision.
- Flexible PTO policy, with a minimum of 4 weeks per year.
- Remote work stipend to support your home office setup.
- Access to professional development opportunities, including training and conference sponsorships.
- Quarterly in-person working sessions to collaborate with your colleagues.
- A dynamic and supportive work environment, with a team of experienced professionals who are passionate about what they do.
How to Stand Out
- Make sure to highlight your experience with cloud infrastructure, containerization, and automation in your resume and cover letter.
- Be prepared to talk about your experience with incident response and root cause analysis in your interview.
- Showcase your proficiency in at least one scripting or programming language and version control workflows using Git-based CI/CD pipelines.
- Emphasize your ability to work in a fast-paced, high-growth environment and your experience with remote work.
- Research Coinbase's company culture and values, and be prepared to discuss how you align with them.
- Prepare examples of your experience with network security fundamentals and log aggregation.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.