Staff Site Reliability Engineer, Production Engineering

DropboxDropbox·Remote(Remote - US: Select locations)
Software Development
Excel

WFA Digital Insight

As the shift to AI-driven software development accelerates, demand for skilled site reliability engineers is on the rise. Dropbox is at the forefront of this trend, and this role offers a unique chance to shape the company's reliability strategy. With the global remote workforce projected to reach 73% by 2028, companies are looking for experts who can ensure seamless operations. To succeed in this field, candidates need a strong foundation in software engineering, a keen understanding of system reliability, and excellent collaboration skills. Before applying, candidates should be aware of the complexities of large-scale system management and the importance of effective communication in a distributed team environment.

Job Description

About the Role

The Staff Site Reliability Engineer position at Dropbox is a critical component of the company's Production Engineering team. As a key player in advancing Dropbox's stability, observability, incident response, and operational excellence, you will be instrumental in shaping the reliability strategy for the company's next phase of growth. This involves preparing Dropbox for the increased complexity and demand that comes with AI-assisted software development. Your work will have a direct impact on the millions of users who rely on Dropbox for their daily operations.

The role of a Site Reliability Engineer at Dropbox is multifaceted, requiring a deep understanding of software engineering principles, a keen sense of operational efficiency, and the ability to collaborate effectively across various teams. You will be working closely with Engineering, Product, and leadership teams to define and implement multi-year reliability goals, standards, and roadmaps. This is an exciting opportunity for someone who is passionate about reliability, scalability, and the application of AI in software development.

What You Will Do

  • Define and evolve Dropbox’s company-wide technical reliability strategy to support the changing engineering environment.
  • Set multi-year reliability goals, standards, and roadmaps across observability, debugging, incident management, service health, and operational readiness.
  • Lead cross-team initiatives to reduce reliability risk as software delivery velocity, pull request volume, service complexity, and incident volume increase.
  • Partner with engineering leaders and platform teams to improve monitoring, alerting, debugging, SLOs, SLAs, and incident response systems at a company scale.
  • Identify emerging reliability challenges and opportunities, proposing innovative solutions to address them.
  • Develop and maintain deep technical expertise in areas relevant to Dropbox’s reliability, such as cloud computing, distributed systems, and artificial intelligence.
  • Collaborate with external partners and the open-source community to leverage best practices and contribute to the advancement of site reliability engineering.
  • Stay up-to-date with industry trends and advancements, applying this knowledge to continuously improve Dropbox’s reliability posture.
  • Engage in incident response and post-incident analysis to identify areas for improvement and implement changes to prevent future incidents.

What We Are Looking For

  • 5+ years of experience in a Site Reliability Engineering role, preferably in a cloud-based environment.
  • Strong foundation in software engineering principles, including design patterns, testing, and validation.
  • Experience with distributed systems, microservices architecture, and containerization (e.g., Docker, Kubernetes).
  • Proficiency in programming languages such as Python, Java, C++, or equivalent.
  • Strong understanding of cloud platforms (AWS, GCP, Azure), including their services and limitations.
  • Experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK Stack).
  • Knowledge of reliability engineering principles, including SLOs, SLAs, and error budgets.
  • Excellent communication and collaboration skills, with the ability to work effectively in a remote team environment.
  • Strong problem-solving skills, with the ability to debug complex issues in distributed systems.
  • Experience with incident management and post-incident reviews.

Nice to Have

  • Experience with AI and machine learning technologies, particularly in the context of software development and operations.
  • Knowledge of DevOps practices and tools (e.g., Jenkins, GitLab CI/CD, CircleCI).
  • Experience with security practices and compliance in a cloud environment.
  • Participation in open-source projects or contributions to the site reliability engineering community.

Benefits and Perks

  • Competitive salary and benefits package.
  • Opportunities for career growth and professional development.
  • Collaborative and dynamic work environment.
  • Flexible working hours and remote work options.
  • Access to the latest technologies and tools.
  • Health insurance, retirement plans, and other benefits.
  • Paid time off and holidays.
  • Opportunities for professional training and education.
  • Recognition and reward programs for outstanding performance.

How to Stand Out

  • Develop a strong understanding of cloud computing platforms and distributed systems to stand out in the application process.
  • Showcase your experience with monitoring, logging, and alerting tools in your portfolio or during interviews.
  • Highlight your problem-solving skills, particularly in the context of complex system issues.
  • Be prepared to discuss your approach to reliability engineering, including SLOs, SLAs, and error budgets.
  • Emphasize your ability to collaborate effectively in a remote team environment and your excellent communication skills.
  • Consider sharing your experience or interest in AI and machine learning as they relate to software development and operations.
  • Prepare examples of your experience with incident management and post-incident reviews to demonstrate your capabilities in handling critical system issues.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.