System Reliability Engineer
WFA Digital Insight
The demand for skilled System Reliability Engineers has seen significant growth, with a 25% increase in the past year alone. As more companies shift towards remote work, the need for experts who can ensure seamless system operations has never been higher. Zoom, a leader in communication and collaboration, is seeking a talented SRE to join their team. With a strong background in Linux systems administration and scripting languages, the ideal candidate will be proficient in CI/CD pipelines and version control systems. Before applying, it's essential for candidates to understand the importance of analytical and troubleshooting skills in this role and be prepared to participate in on-call shifts and incident management.
Job Description
About the Role
As a System Reliability Engineer at Zoom, you will play a critical role in ensuring the smooth operation of the company's hybrid systems across the globe. This will involve installing, configuring, and monitoring new systems within a network of global data centers, as well as patching and maintaining thousands of physical and cloud systems worldwide. You will be part of a team committed to delivering customer happiness, improving business efficiency, and promoting agility through innovation, data-driven insights, and automation.The SRE team at Zoom is focused on streamlining operations, which includes developing automation to reduce repetitive tasks and analyzing and addressing performance bottlenecks. You will also be responsible for updating and troubleshooting user access permissions, resolving network connectivity issues, and maintaining system firewalls. The goal is to provide a seamless user experience, optimize processes, and support Zoom's expansion in the realm of communication and collaboration.
Your day-to-day work will involve a mix of technical tasks, collaboration with other teams, and participation in on-call shifts and incident management. You will need to apply analytical and troubleshooting skills to diagnose complex system issues and utilize CI/CD pipelines and version control systems. Your expertise in Linux systems administration, scripting languages, and automation tools will be essential in managing bare metal infrastructure and datacenter operations.
What You Will Do
- Install, configure, and monitor new systems within a network of global data centers.
- Patch and maintain thousands of physical and cloud systems worldwide.
- Develop automation to reduce repetitive tasks and analyze and address performance bottlenecks.
- Update and troubleshoot user access permissions.
- Resolve network connectivity issues and maintain system firewalls.
- Participate in on-call shifts and incident management.
- Apply analytical and troubleshooting skills to diagnose complex system issues.
- Utilize CI/CD pipelines (e.g., Jenkins, GitLab CI) and version control systems (e.g., Git).
- Implement build automation, configuration management tools (e.g., Ansible), and IaC provisioning tools (e.g., Packer/Terraform).
- Manage bare metal infrastructure and datacenter operations, including proficiency in operating system deployment tools (Foreman, Cobbler, MAAS, etc.).
What We Are Looking For
- A BS/MS in Computer Science or a related field.
- 2-5 years of hands-on experience in Site Reliability Engineering, DevOps, or Production Operations roles.
- Proficiency in scripting languages, including Python and Shell.
- Expertise in Linux systems administration with a focus on Ubuntu.
- Participation in on-call shifts and incident management, with the ability to work after hours/weekends for infra change/deployment.
- Analytical and troubleshooting skills to diagnose complex system issues.
- Experience with CI/CD pipelines and version control systems.
- Knowledge of build automation, configuration management tools, and IaC provisioning tools.
- Ability to manage bare metal infrastructure and datacenter operations.
Nice to Have
- Experience with Kubernetes.
- Linux certification.
- Knowledge of diverse cloud platforms.
- Experience with operating system deployment tools.
Benefits and Perks
- Competitive compensation package.
- Equity.
- Paid time off.
- Health benefits.
- Remote work stipend.
- Opportunities for professional growth and development.
- Collaborative, growth-focused environment.
- Access to cutting-edge technologies and tools.
- Recognition and rewards for outstanding performance.
How to Stand Out
- Ensure you have a strong foundation in Linux systems administration and scripting languages, such as Python and Shell, as these skills are crucial for the role.
- Familiarize yourself with CI/CD pipelines, version control systems, and automation tools, as experience with these technologies is highly valued.
- Highlight your ability to work independently and as part of a team, with strong analytical and troubleshooting skills, in your application and during interviews.
- Prepare examples of your experience with build automation, configuration management, and IaC provisioning tools to demonstrate your expertise.
- Be ready to discuss your approach to managing complex system issues and your experience with on-call shifts and incident management.
- Showcase any certifications, such as Linux certification, and experience with Kubernetes and cloud platforms, as these can be significant advantages.
- Research Zoom's company culture and values to understand how your skills and experience align with their mission and goals.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.