This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Site Reliability Engineer (SRE) you will play a pivotal role in the design, implementation, and maintenance of the infrastructure that supports our software development lifecycle. You will work closely with software engineers, QA, and IT teams to ensure the availability, reliability, and performance of our systems. Your primary focus will be on streamlining our deployment processes, improving system scalability, and ensuring a robust, secure, and cost-efficient infrastructure.
Job Responsibility:
Partner with product engineering squads to design, build, and operate highly reliable services
Own and improve production reliability end-to-end: Define and measure SLOs/SLIs, error budgets, and reliability goals, Lead incident response, postmortems, and follow-up action items, Participate in on-call rotation and drive rapid, effective resolution of production issues
Build and maintain world-class observability: Create comprehensive dashboards, alerts, metrics, structured logging, and distributed tracing, Enable squads to understand system behavior and debug effectively
Develop automation, tooling, and infrastructure as code to reduce toil and increase developer velocity
Collaborate closely with Staff Engineers / Team Leads to: Embed reliability best practices into the development lifecycle, Review architectural decisions with a production lens, Mentor engineers on operational excellence, observability, and on-call mindset
Champion modern engineering and DevOps practices: CI/CD pipelines, Progressive delivery (feature flags, canaries, blue-green), Infrastructure as code (Terraform, Pulumi, CDK), Effective use of AI-assisted tools to accelerate scripting, debugging, and documentation
Proactively identify and eliminate classes of failure through chaos engineering, capacity planning, and performance tuning
Help evolve our technical strategy for reliability, scalability, and cost-efficiency
Requirements:
5+ years of professional experience in SRE, DevOps, or software engineering with a strong focus on production systems
Deep hands-on experience operating distributed cloud systems (AWS / GCP / Azure — at least one in depth, preferably AWS)
Proficiency in at least one modern programming language used for tooling & automation (Go, Python, TypeScript/JavaScript, Rust)
Strong observability expertise: Building dashboards and alerts (Grafana, Groundcover, Datadog, New Relic, Prometheus, etc.), Distributed tracing (OpenTelemetry, Jaeger, Zipkin), Structured logging and metrics at scale
Proven track record of incident management, postmortems, and driving reliability improvements
Experience defining and working with SLOs, SLIs, and error budgets
Comfort with infrastructure as code and modern DevOps practices (CI/CD, GitOps, containers/Kubernetes)
Excellent collaboration skills — you enjoy partnering with product engineers and teaching reliability concepts
Bias toward automation and reducing manual toil
Effective Communication
Problem-Solving Attitude
Collaboration and Teamwork
Adaptability and Flexibility
Nice to have:
Previous on-call leadership or incident commander experience
Background in performance engineering or capacity planning at scale
Familiarity with service meshes, API gateways, or zero-trust networking
Contributions to open-source reliability/observability tools
Experience mentoring or embedding within product squads within product squads
What we offer:
Competitive compensation and performance-based incentives
Opportunities for professional growth through workshops and certifications
Flexible work-life balance with remote options
Collaborative culture
Exposure to diverse projects across various industries