This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
This role is critical to how our platform scales and stays reliable. As an SRE Lead, you will shape our infrastructure foundations, define our approach to resilience and observability, and ensure our systems are ready to grow at the same pace as our product. You will work closely with the engineering team and our DevOps partners, bringing both leadership and hands-on expertise. This is a role where strategic thinking meets real execution, with a direct impact on product stability and long-term growth.
Job Responsibility:
Drive resilience and reliability improvements across our infrastructure
Design and implement monitoring, alerting, and observability practices (SLIs/SLOs, dashboards, alert hygiene)
Lead incident response culture: reduce MTTR, create playbooks, and ensure we learn from every incident
Collaborate with our external DevOps team — guiding priorities, reviewing proposals, and ensuring consistency
Keep our Kubernetes clusters (running on our hardware) healthy, scalable, and predictable
Partner with product engineers to make deployments smooth and safe
Write and maintain technical documentation for infra and ops processes
Evolve our Helm-based deployments, GitHub CI/CD pipelines, and automation workflows
Requirements:
Solid experience running Kubernetes in production
Strong knowledge of observability stacks (Prometheus, Grafana, Loki, etc.)
Proven track record in incident management: designing processes, running postmortems, and preventing recurrence
Familiar with Helm, GitHub Actions, and containerized workflows
Experience working in an SRE culture: SLOs, error budgets, and balancing reliability with delivery speed
Strong communication skills and can confidently lead conversations with both engineers and external partners
Comfortable operating both hands-on (debugging pod crashes, tuning alert rules) and at a higher level (roadmaps, resilience strategies)
Nice to have:
Exposure to Cloudflare Workers/R2 or other edge compute/storage
Experience with Go services, Postgres, and ClickHouse
Background in cost optimization and capacity planning
Previous work in a startup or high-growth environment