This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Site Reliability Engineer (SRE) is a strategic professional accountable for the daily operations, architectural resilience, and overall implementation of SRE principles in a complex, critical, and largescale multi-disciplinary environment. This role requires a comprehensive understanding of multiple technology domains and their interaction to achieve business objectives. As a recognized technical authority, you will apply an in depth understanding of the business impact of technical contributions and provide advice and counsel on strategic solutions. We are seeking a passionate and experienced SRE to join our Production Management team. In this role, you will be instrumental in enhancing the reliability, performance, and efficiency of our Applications and Services. You will drive our strategy for end-to-end observability and resiliency, collaborating across the organization to ensure our services are stable, scalable, and fault tolerant. This is a key role that will influence strategic decisions and foster a culture of technical excellence and accountability.
Job Responsibility:
Foster a culture of transparency, innovation, and accountability that encourages continuous improvement
Communicate the progress and impact of SRE initiatives to stakeholders at all levels
Operate effectively within a highly regulated environment, ensuring compliance with all relevant requirements
Ensure critical business applications meet stringent operational resilience requirements, including adherence to defined impact tolerances
Oversee advanced recovery testing, including Production Swing Tests, Data Recovery Tests, and chaos engineering practices
Drive the adoption and development of automation, such as One Touch Recovery solutions, to minimize recovery time
Partner with development teams to leverage cloud native services and established resiliency patterns to enhance application reliability
Collaborate across the organization to develop and scale observability solutions using modern tools for metrics, logging, and tracing
Partner with development teams to effectively instrument applications, providing deep insights into system health and performance
Requirements:
Deep understanding of SRE concepts, including SLOs, SLIs, error budgets, and toil reduction
Demonstrable experience with Disaster Recovery planning, resiliency testing, and fault tolerant distributed system design
Proficiency in deploying, managing, and troubleshooting applications on OpenShift/Kubernetes
Hands on experience with modern observability tools (e.g., Prometheus, Grafana, Loki, Mimir, Tempo, AppDynamics)
Experience with Infrastructure as Code (IaC), configuration management, and automation tools (e.g., Ansible, Terraform)
Experience creating, modifying, and managing Helm charts for application deployment
10+ years of significant professional experience in production management, software development, or an equivalent field, with a strong focus on Site Reliability Engineering
Expertise in analyzing complex application, database, network, and OS issues within large scale, customer facing systems
A service-oriented attitude combined with excellent problem-solving and strategic thinking skills
Strong communication and diplomacy skills, with a proven ability to work effectively across multiple business and technical teams
Nice to have:
Experience with major public cloud providers (e.g., Google Cloud, AWS, Azure)
Proven experience delivering software and infrastructure using Agile frameworks
Experience presenting technical strategy to senior and executive level audiences
Experience writing or maintaining code in Java, Python, Go, or similar languages
What we offer:
medical, dental & vision coverage
401(k)
life, accident, and disability insurance
wellness programs
paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays
discretionary and formulaic incentive and retention awards