This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We run one of the largest self-managed ClickHouse installations on AWS, already at petabyte scale, and we’re actively preparing it for the next 10–50× of growth. This role sits at the centre of that effort. You won’t be in a typical “keep the lights on” SRE role. The work is about turning a fast-growing, stateful system into a predictable, well-automated platform. You’ll work on the kind of problems that only show up at large scale (petabytes of data, thousands of cores, constant ingestion). You’ll have room to design and automate, not just respond to alerts.
Job Responsibility:
Turning a fast-growing, stateful system into a predictable, well-automated platform (provisioning, scaling, rebalancing, recovery)
Reducing operational stress, designing safe automation for data-heavy workloads, and building tooling and patterns for scale
Managing large fleets of EC2-based VMs, disks, and networking for data-intensive workloads
Improving operational tooling around deploys, schema changes, backups, restores, and incident response
Working closely with ClickHouse engineers to turn database-level needs into infra-level solutions
Reducing operational load by identifying repeat pain points and eliminating them through code and self-healing automation
Participating in on-call and incident response, with a focus on making incidents rarer over time
Requirements:
Strong experience operating production infrastructure on AWS
Hands-on experience with VM-based systems (EC2), not just managed PaaS
Experience automating infrastructure using tools like Terraform, Ansible, or similar
Solid understanding of Linux systems (disk, memory, networking, failure modes)
Experience supporting stateful systems (databases, queues, storage systems, etc.)
Ability to debug and reason about performance and reliability issues in production
Comfortable owning systems end-to-end, including on-call responsibilities
Nice to have:
Prior experience with ClickHouse or other analytical databases
Experience operating systems at very large data scale
Familiarity with Kubernetes (helpful, but not the core of this role)