This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Join a team of senior engineers operating in a large-scale, multi-cloud production environment supporting tens of thousands of enterprise customers worldwide. This is not a typical SRE role — you’ll work at the core of a complex, high-impact system alongside experienced DevOps professionals in a fast-paced, cybersecurity-focused organization.
Job Responsibility:
Own and operate large-scale, global production environments across multiple cloud providers (GCP, AWS, Azure)
Actively monitor, investigate, and resolve incidents triggered by automated alerting systems (PagerDuty / Incident Response)
Drive end-to-end troubleshooting across complex, distributed systems with high context switching
Design, deploy, and improve monitoring and observability systems (e.g., Prometheus, Grafana) — not just react to alerts
Collaborate closely with internal teams (CX, CS, Engineering) to ensure system reliability and performance
Work hands-on with modern DevOps and infrastructure tools including Kubernetes, Terraform, CI/CD pipelines, and GitOps workflows
Develop and maintain automation and tooling (primarily in Python)
Gain deep understanding of system architecture and interconnected services
Contribute to a culture of operational excellence in a high-scale, high-availability environment
On call responsibilities: Daytime hours (12:00–20:00)
Occasional weekends and holidays (rotation-based)
Requirements:
5+ years of experience in SRE roles in production environments at scale
Strong hands-on experience with Kubernetes and Terraform
Strong hands-on experience with at least one major cloud platform (GCP or AWS required)
Experience building and configuring monitoring systems (e.g., Prometheus, Grafana)
Familiarity with CI/CD and GitOps tools (GitLab CI, GitHub Actions, Jenkins, Flux)
Proficiency in Python for scripting and automation
Strong troubleshooting and problem-solving skills with a passion for incident handling
Ability to work in fast-paced environments with high context switching
Highly responsive, proactive, and ownership-driven