Database Reliability Engineer - Core Team

ClickhouseClickhouse·Remote(Australia (remote))
Software Development
AdjustExcel

WFA Digital Insight

The demand for skilled database reliability engineers has surged, with companies like ClickHouse leading the charge in real-time analytics and data warehousing. As the remote job market continues to evolve, professionals with expertise in distributed database internals and SQL are in high demand. With ClickHouse's impressive growth and innovative approach, this role stands out as an exciting opportunity for those looking to make a significant impact. Before applying, candidates should be prepared to showcase their problem-solving skills, experience with cloud computing platforms, and ability to thrive in a fast-paced environment.

Job Description

About the Role

The Database Reliability Engineer position at ClickHouse is a unique opportunity to join a fast-growing company that is revolutionizing the way businesses use data. As a key member of the Site Reliability Engineering team, you will play a crucial role in ensuring the reliability, availability, scalability, and performance of ClickHouse. You will collaborate with various teams, including Control Plane, Dataplane, Security, Support, and Operations, to implement best practices and guide them in using ClickHouse effectively. With a strong focus on innovation and customer satisfaction, ClickHouse is committed to providing its customers with reliable and secure services.

The successful candidate will be responsible for building and leading processes to improve the reliability and performance of ClickHouse. This will involve continuously monitoring and improving the system, identifying and preventing problems before they affect customers, and collaborating with the engineering team to implement fixes and improvements. You will also own the areas of managing engineering escalation management and response, investigations, post-mortem analysis, and continuous improvement of how ClickHouse is run and optimized in the cloud.

ClickHouse is a company that values innovation, teamwork, and customer satisfaction. As a member of the team, you will be expected to thrive in a fast-paced environment, be a strong problem-solver, and have solid production debugging skills. You will also be expected to have excellent communication skills, be able to work effectively with cross-functional teams, and be passionate about delivering high-quality results.

What You Will Do

  • Continuously improve the reliability and performance of ClickHouse core
  • Improve and create metrics and alerts for ClickHouse to identify and prevent problems in production before they affect customers
  • Dig deeper into the most common problems encountered by customers in ClickHouse Core to identify the root cause of problems and submit bug fixes, issue reports, and suggest improvements
  • Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages
  • Plan, enable, and drive Chaos initiatives across Engineering teams based on internal priorities
  • Manage on-call processes to respond to performance and reliability issues and establish best practices for coordinating escalation to resolve issues and minimize customer impact
  • Collaborate with the engineering team to implement fixes and improvements
  • Work with the support and cloud teams to communicate with impacted customers
  • Participate in the development of new features and services
  • Stay up-to-date with industry trends and emerging technologies

What We Are Looking For

  • Bachelor’s or Master’s degree in Computer Science or a related field
  • At least 5 years of experience in Reliability Engineering, QA, or customer-facing engineering
  • Previous experience operating ClickHouse or other SQL databases in production
  • Excellent understanding of distributed database internals and SQL, particularly ClickHouse
  • Scripting experience with Shell or Python, and ability to read and understand C++ code
  • Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
  • Strong problem-solving skills and solid production debugging skills
  • Excellent communication skills
  • Ability to work effectively with cross-functional teams

Nice to Have

  • Experience with Chaos engineering and testing
  • Knowledge of DevOps practices and tools
  • Familiarity with Agile development methodologies
  • Certification in a related field, such as AWS or Google Cloud Platform

Benefits and Perks

  • Competitive salary and benefits package
  • Opportunity to work with a fast-growing and innovative company
  • Collaborative and dynamic work environment
  • Flexible working hours and remote work options
  • Professional development and growth opportunities
  • Access to the latest technologies and tools
  • Recognition and reward for outstanding performance

How to Stand Out

  • Highlight your experience with distributed database internals and SQL, particularly ClickHouse, in your resume and cover letter.
  • Be prepared to provide specific examples of how you have improved the reliability and performance of databases in previous roles.
  • Showcase your problem-solving skills and ability to work effectively with cross-functional teams.
  • Familiarize yourself with ClickHouse’s products and services, and be prepared to discuss how you can contribute to the company’s mission.
  • Consider creating a personal project or contributing to open-source projects to demonstrate your skills and passion for database reliability engineering.
  • Prepare to discuss your experience with cloud computing platforms, such as AWS, Azure, or Google Cloud Platform, and how you have used them to improve database reliability and performance.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.