Database Reliability Engineer - Core Team
WFA Digital Insight
As demand for cloud-based data solutions grows, Database Reliability Engineers are in high demand. ClickHouse, a leader in real-time analytics, is seeking an experienced engineer to join their core team. With the global cloud market projected to reach
Job Description
About the Role
The Database Reliability Engineer position at ClickHouse is a crucial part of the company's core team, focusing on ensuring the reliability, availability, scalability, and performance of ClickHouse's database solutions. This role involves collaborating with various teams, including Control Plane, Dataplane, Security, Support, and Operations, to implement best practices for ClickHouse deployment and management. The successful candidate will be responsible for leading processes that enhance the overall database experience for customers, which is vital for ClickHouse's continued growth and success in the competitive cloud and data analytics market.As a member of the Reliability Engineering Team, the Database Reliability Engineer will play a key role in identifying and resolving issues before they impact customers. This includes developing and improving metrics and alerts, conducting deep dives into common problems, and suggesting improvements. The ability to work closely with the support and cloud teams to communicate issues and resolutions to customers is essential. The role also involves managing on-call processes, driving Chaos initiatives across engineering teams, and enhancing incident response processes and post-mortem analyses.
ClickHouse is committed to providing reliable and secure services, and this role is central to achieving that goal. The company has experienced rapid growth, with a $400M Series D financing round and notable customer acquisitions, including Capital One, Lovable, and Tesla. This position offers the opportunity to be part of a dynamic team that is shaping the future of real-time analytics and data management.
What You Will Do
- Continuously improve the reliability and performance of ClickHouse core services.
- Develop and enhance metrics and alerts for early identification and prevention of production issues.
- Investigate the root causes of common problems encountered by customers and propose solutions.
- Improve incident response processes, including post-mortem analyses and communication strategies.
- Manage and refine on-call processes to ensure timely and effective responses to performance and reliability issues.
- Collaborate with support and cloud teams to minimize customer impact during outages or service disruptions.
- Plan and execute Chaos initiatives to test system resilience and identify areas for improvement.
- Enhance and document best practices for database management and reliability engineering.
- Work closely with the engineering teams to implement reliability-focused changes and improvements.
What We Are Looking For
- Bachelor's or Master's degree in Computer Science or a related field.
- At least 5 years of experience in Reliability Engineering, QA, or customer-facing engineering roles.
- Previous experience with operating ClickHouse or other SQL databases in production environments.
- Strong understanding of distributed database internals and SQL.
- Scripting experience with Shell or Python and the ability to read and understand C++ code.
- Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
- Proven problem-solving skills and experience with production debugging.
- Ability to thrive in a fast-paced environment as part of a global team.
- High level of responsibility, ownership, and accountability.
- Excellent communication skills.
Nice to Have
- Experience with Adjust and Excel for data analysis and optimization.
- Familiarity with Agile development methodologies and version control systems like Git.
- Knowledge of containerization using Docker and orchestration with Kubernetes.
- Experience with monitoring and logging tools such as Prometheus and Grafana.
Benefits and Perks
- Competitive salary and benefits package.
- Opportunity to work with a cutting-edge, fast-growing company in the cloud and data analytics space.
- Collaborative and dynamic work environment with a global team.
- Professional development and growth opportunities.
- Flexible and remote work arrangements.
- Access to the latest technologies and tools.
- Recognition and reward for outstanding performance and contributions.
How to Stand Out
- Ensure your resume and cover letter highlight specific experiences with database reliability engineering, particularly with SQL databases and cloud computing platforms.
- Be prepared to discuss scenarios where you identified and resolved complex database issues, and how you communicated these resolutions to stakeholders.
- Familiarize yourself with ClickHouse's technology and mission to demonstrate your interest and understanding of the company's goals and challenges.
- Showcase your problem-solving skills by describing your approach to debugging and optimizing database performance.
- Consider creating a personal project or contributing to open-source database reliability initiatives to demonstrate your skills and passion for the field.
- Prepare questions about the company culture, team dynamics, and opportunities for professional growth to ask during the interview.
- Highlight any experience with Adjust and Excel, as these are key skills mentioned in the job description.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.