This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experience Management) team, you will be part of a team supporting the services that provide end-to-end visibility and self-healing capabilities for our global customers. This includes automation, architecture, performance, observability, troubleshooting, security, and reliability. Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI, ArgoCD, Prometheus, Grafana, Loki, Docker, GCP, AWS, Vault, Kafka, MySQL, Python, Bash, and Go.
Job Responsibility:
Drive the success of SRE and DevOps through expert contributions in CI/CD and AIOps initiatives, moving the organization toward self-healing infrastructure
Architect "Golden Paths" for service delivery, ensuring that SLOs, error budgets, and automated canary analysis are integrated by default
Design, build, and operate reliable, secure Cloud infrastructure that supports high-scale synthetic monitoring and Real User Monitoring (RUM)
Ensure applications are production-ready, scalable, and resilient, collaborating closely with developers, researchers, and data scientists
Develop tools and automation frameworks that champion Infrastructure as Code (IaC) and Monitoring as Code (MaC)
Lead root cause analysis (RCA) of critical business and production issues, driving improvements that prevent recurrence
Requirements:
7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
The candidate must be familiar with and demonstrate proficiency in using code assist and AI productivity tools such as Claude code, Cursor, Windsurf, or GitHub Copilot to accelerate development and troubleshooting
Expertise in building high-availability, scalable cloud-native applications on GCP (preferred) or AWS
Expertise in configuration management and IaC (Terraform, Helm, Ansible)
Strong proficiency in programming languages like Python, Go, or Java
Deep experience in Kubernetes (GKE/EKS), container networking, and Linux internals
Experience with GitOps principles and tools like GitLab CI and ArgoCD
Familiarity with compliance and security frameworks (FedRAMP, SOC2) and automating policy-as-code
Excellent communication skills, with a "rally support" mindset to collaborate across multi-functional teams
BS or MS in Computer Science, a related field, or equivalent professional/military experience
Nice to have:
experience with data streaming frameworks like Kafka or Apache Pulsar is a plus