This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a Senior Site Reliability Engineer to design, scale, and secure our internal infrastructure. You will bridge the gap between high-level system architecture and deep-dive technical troubleshooting, with a specific focus observability, and high availability.
Job Responsibility:
Architect and manage highly available, distributed systems across multiple global data centers with a focus on optimised performance and disaster recovery
Define and enforce SLOs/SLIs, manage error budgets, and lead post-mortems
Participate in an on-call rotation, acting as a point of escalation for complex infrastructure outages
Identify and automate manual operations to effectively reduce toil
Design and implement multi-layered monitoring strategies (synthetic, blackbox and whitebox) for both on-premise and SaaS tools using tools like Prometheus, Grafana, and ELK
Act as a technical mentor within the team, facilitating the upskilling of team members across different global regions
Requirements:
10+ years of experience in high-traffic environments where downtime has a direct financial or operational impact
Advanced experience managing production Kubernetes clusters and apps using Helm and ArgoCD
Proficient with Infrastructure as Code (IaC) for provisioning both cloud or on-premise resources, ideally with Terraform
Hands-on experience with Consul and Vault, HAProxy
Experience managing and troubleshooting large-scale Mail Transfer Agents (MTAs) and postfix
Proficiency in one of the following programming languages: Go or Python
Nice to have:
Experience managing Next Generation Firewalls (NGFW), ideally Palo Alto GlobalProtect
Experience managing and maintaining LDAP infrastructure