Database Reliability Engineer - Core Team
WFA Digital Insight
The demand for skilled database reliability engineers has surged, with a 25% increase in job postings over the past year. As companies like ClickHouse continue to innovate in real-time analytics and data warehousing, the need for experts who can ensure seamless database operations has never been more pressing. With the rise of cloud computing, professionals with experience in distributed databases and cloud platforms are highly sought after. ClickHouse, a leader in its field, offers a unique opportunity for engineers to make a significant impact. Before applying, candidates should be prepared to demonstrate their problem-solving skills, knowledge of SQL, and experience with cloud computing platforms.
Job Description
About the Role
The Database Reliability Engineer role at ClickHouse is a critical position focused on ensuring the reliability, availability, scalability, and performance of ClickHouse's core database services. As a member of the Site Reliability Engineering team, you will work closely with various teams, including Control Plane, Dataplane, Security, Support, and Operations, to implement best practices and guide the adoption of ClickHouse. Your primary objective will be to build and lead processes that enhance the reliability and performance of ClickHouse, making it an indispensable tool for customers.The role requires a deep understanding of distributed database internals, SQL, and cloud computing platforms. With the ability to work independently and collaboratively, you will drive initiatives to improve incident response, post-mortem analysis, and continuous improvement of ClickHouse operations in the cloud. Your expertise will be crucial in managing on-call processes, establishing best practices for issue escalation, and minimizing customer impact.
What You Will Do
- Continuously improve the reliability and performance of ClickHouse core services
- Develop and refine metrics and alerts to identify and prevent problems in production
- Conduct in-depth analyses of common issues to identify root causes and submit bug fixes or suggestions for improvement
- Enhance incident response processes and post-mortem analyses for ClickHouse core-related outages
- Plan and drive Chaos initiatives across engineering teams based on internal priorities
- Manage on-call processes and establish best practices for coordinating escalations to resolve issues promptly
- Collaborate with support and cloud teams to communicate with impacted customers
- Own areas of managing engineering escalation management and response
- Lead investigations, post-mortem analyses, and blameless postmortems
What We Are Looking For
- Bachelor's or Master's degree in Computer Science or a related field
- At least 5 years of experience in Reliability Engineering, QA, or customer-facing engineering
- Previous experience operating ClickHouse or other SQL databases in production
- Excellent understanding of distributed database internals and SQL
- Scripting experience with Shell or Python and the ability to read and understand C++ code
- Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
- Strong problem-solving skills and solid production debugging abilities
- Excellent communication skills and the ability to thrive in a fast-paced, global team environment
- High level of responsibility, ownership, and accountability
Nice to Have
- Experience with ClickHouse specifically
- Familiarity with Chaos engineering principles and practices
- Certifications in cloud computing or database administration
- Experience with containerization and orchestration tools like Docker and Kubernetes
Benefits and Perks
- Opportunity to work with a leading, innovative company in real-time analytics and data warehousing
- Collaborative, global team environment
- Professional growth and development opportunities
- Flexible, remote work arrangements
- Competitive compensation and benefits package
- Access to cutting-edge technologies and tools
- Recognition and reward for outstanding performance and contributions
How to Stand Out
- Ensure your resume highlights specific experience with distributed databases, SQL, and cloud computing platforms.
- Prepare to discuss your approach to problem-solving and incident response in a cloud environment.
- Review ClickHouse's documentation and be ready to discuss its features and how you can contribute to its reliability and performance.
- Develop a portfolio or be prepared to provide examples of your work in database reliability engineering, including scripts or code snippets.
- Research the company culture and be prepared to discuss how your skills and experience align with ClickHouse's mission and values.
- Practice explaining complex technical concepts simply and clearly, as excellent communication skills are essential for this role.
This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.