This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Director of Site Reliability Engineering (SRE) will provide strategic leadership and technical direction for the reliability, scalability, and performance of our mission‑critical systems and services. This role combines deep SRE expertise with strong engineering leadership to drive organizational transformation toward reliability-first principles. The ideal candidate brings a strong software engineering foundation, a passion for automation, and a proven ability to develop and lead high‑performing teams. The Director will partner with engineering, product, operations, and business stakeholders to design, deliver, and operate resilient, high‑availability systems that support our customers and business objectives at scale.
Job Responsibility:
Drive organizational transformation toward SRE principles and own the strategic direction for reliability maturity, cultivating a culture centered on reliability, efficiency, and continuous improvement
Develop and oversee automation strategies, tools, and frameworks that improve system reliability, reduce operational toil, and enhance team productivity
Architect and evolve robust observability, monitoring, and alerting systems
champion chaos engineering and resilience testing practices to proactively validate system behavior under failure conditions
Partner with engineering, product, and operations teams to embed SRE practices throughout the development lifecycle and influence architectural decisions for reliability
Build, mentor, and develop a high‑performing global SRE organization, fostering technical excellence, career growth, and a strong culture of knowledge sharing
Oversee capacity planning, scalability assessments, and future‑state demand forecasting across critical systems
Lead and govern high‑severity incident response practices—ensuring rapid triage, thorough root cause analysis, and follow‑through on corrective and preventative actions
Requirements:
BS, MS, or PhD degree in Computer Science, Engineering, or related field, or related experience
7+ years of experience in the field, including 3+ years leading SRE teams or a team in a similar role
Strong experience with container orchestration (Kubernetes), infrastructure as code (Terraform), and CI/CD pipelines
Hands-on experience with observability platforms (e.g., Datadog, Prometheus, Grafana) and incident management tools (e.g., incident.io, PagerDuty)
Proficiency in at least one programming language (Python, Go, or Java) with the ability to review code and guide system design decisions
Proven experience in architecting and managing highly available, scalable, and fault-tolerant systems
Ability to define a clear reliability vision and inspire teams and stakeholders toward long‑term reliability goals
Demonstrated sound judgment and calm decision‑making under pressure, particularly during high‑severity incidents
Strong people leadership skills, with experience coaching and mentoring engineering talent, developing future leaders, and aligning peer engineering managers and leaders on reliability best practices
Strategic planning skills with a track record of aligning technical direction with organizational objectives
Excellent communication skills
able to translate complex technical issues into clear, actionable insights for executive and non‑technical audiences
Highly collaborative, with the ability to work effectively across engineering, product, operations, and business functions and leaders