This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Responsible for ensuring highly reliable, scalable, and resilient production systems across cloud and on‑prem environments. Ensures high availability, disaster recovery readiness, and continuous improvement of service performance. Leads automation initiatives for provisioning, deployment, monitoring, and self‑healing to reduce manual effort and improve stability. Owns the event catalog, operational readiness, and reliability engineering practices to prevent recurrence of incidents and strengthen system resilience. Drives collaboration across Product, Engineering, T&E ICE, and Service Support Architects to ensure provider‑grade reliability and seamless operational integration of new releases.
Job Responsibility
Design & maintain resilient systems ensuring high availability, scalability, and fault tolerance
Ensure effective Disaster Recovery (DR), failover strategies, and resilience engineering across environments
Improve platform reliability, observability, and performance across cloud and on‑premises systems
Establish and maintain SLIs, SLOs, and error budgets to measure and govern service reliability
Take ownership of production availability, capacity planning, performance tuning, and long‑term reliability initiatives
Drive automation for infrastructure provisioning, deployment, monitoring, and operational workflows
Develop and implement auto‑remediation and self‑healing solutions to reduce manual intervention
Manage CI/CD pipelines and Infrastructure as Code (IaC) frameworks for secure, repeatable deployments
Implement and manage zero‑downtime deployment strategies (blue‑green, canary, rolling)
Support containerized and cloud‑native platforms including Kubernetes, Docker, and distributed systems
Support NetOps tooling and network observability, ensuring visibility into network performance, events, and operational health
Perform incident management, production troubleshooting, and lead RCA/PMIR (Postmortem) for critical outages
Proactively identify reliability gaps, performance bottlenecks, and operational risks
Optimize incident, event, and problem management processes to reduce MTTR and improve operational efficiency
Define and maintain the event catalog, thresholds, and remediation workflows
Develop event response protocols and ensure teams are trained for rapid incident handling
Build and maintain observability solutions using monitoring, logging, tracing, and alerting platforms
Implement APM, distributed tracing, and proactive alerting to detect issues early
Integrate network telemetry and NetOps monitoring tools into the overall observability stack
Collaborate with stakeholders to improve event coverage and post‑event learning
Experience with AI‑assisted observability, anomaly detection, and predictive alerting
Own the quality of new release deployments for the PSO
Conduct operational readiness assessments and manage deployment risk
Ensure supportability for new applications, platform releases, and infrastructure changes
Coordinate with internal/external stakeholders to drive continuous service improvement
Work closely with Development, Platform Engineering, Product, T&E ICE, and Service Support Architects to embed reliability best practices
Collaborate with vendors and engineering teams to enhance system reliability and operational excellence
Support new product productization as SGS technical expert and ensure operational readiness
Requirements
Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. Master’s degree preferred for senior roles
Relevant certifications such as ITIL, CCNP/CCIE, Palo Alto Security, SASE, SDWAN, Juniper Mist/Aruba, CompTIA Security+, or Certified Kubernetes Administrator (CKA)
Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies
Certifications in automation and IaC tools (Ansible, Terraform)
Certifications in observability and monitoring platforms (Dynatrace, Prometheus, Grafana, ELK)
Certifications in ServiceNow, Jira, or other operational tooling
8+ years in IT operations, service management, or infrastructure reliability, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Engineer
Strong experience with high availability systems, resilience engineering, and DR readiness
Deep expertise in RCA, incident management, PMIR, and implementing permanent fixes for recurring issues
Hands on experience with CI/CD, automation, IaC, and self healing/auto remediation workflows
Proficiency in observability platforms (APM, logging, tracing, alerting) and integrating network telemetry / NetOps monitoring
Experience defining and governing SLIs, SLOs, and error budgets to improve service reliability
Experience with Kubernetes, containerized workloads, and distributed systems