Lead Site Reliability Engineer/ Expert Job at SITA (Cairo)

Job Description

Responsible for ensuring highly reliable, scalable, and resilient production systems across cloud and on‑prem environments. Ensures high availability, disaster recovery readiness, and continuous improvement of service performance. Leads automation initiatives for provisioning, deployment, monitoring, and self‑healing to reduce manual effort and improve stability. Owns the event catalog, operational readiness, and reliability engineering practices to prevent recurrence of incidents and strengthen system resilience. Drives collaboration across Product, Engineering, T&E ICE, and Service Support Architects to ensure provider‑grade reliability and seamless operational integration of new releases.

Job Responsibility

Design & maintain resilient systems ensuring high availability, scalability, and fault tolerance
Ensure effective Disaster Recovery (DR), failover strategies, and resilience engineering across environments
Improve platform reliability, observability, and performance across cloud and on‑premises systems
Establish and maintain SLIs, SLOs, and error budgets to measure and govern service reliability
Take ownership of production availability, capacity planning, performance tuning, and long‑term reliability initiatives
Drive automation for infrastructure provisioning, deployment, monitoring, and operational workflows
Develop and implement auto‑remediation and self‑healing solutions to reduce manual intervention
Manage CI/CD pipelines and Infrastructure as Code (IaC) frameworks for secure, repeatable deployments
Implement and manage zero‑downtime deployment strategies (blue‑green, canary, rolling)
Support containerized and cloud‑native platforms including Kubernetes, Docker, and distributed systems
Support NetOps tooling and network observability, ensuring visibility into network performance, events, and operational health
Perform incident management, production troubleshooting, and lead RCA/PMIR (Postmortem) for critical outages
Proactively identify reliability gaps, performance bottlenecks, and operational risks
Optimize incident, event, and problem management processes to reduce MTTR and improve operational efficiency
Define and maintain the event catalog, thresholds, and remediation workflows
Develop event response protocols and ensure teams are trained for rapid incident handling
Build and maintain observability solutions using monitoring, logging, tracing, and alerting platforms
Implement APM, distributed tracing, and proactive alerting to detect issues early
Integrate network telemetry and NetOps monitoring tools into the overall observability stack
Collaborate with stakeholders to improve event coverage and post‑event learning
Experience with AI‑assisted observability, anomaly detection, and predictive alerting
Own the quality of new release deployments for the PSO
Conduct operational readiness assessments and manage deployment risk
Ensure supportability for new applications, platform releases, and infrastructure changes
Coordinate with internal/external stakeholders to drive continuous service improvement
Work closely with Development, Platform Engineering, Product, T&E ICE, and Service Support Architects to embed reliability best practices
Collaborate with vendors and engineering teams to enhance system reliability and operational excellence
Support new product productization as SGS technical expert and ensure operational readiness

Requirements

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. Master’s degree preferred for senior roles
Relevant certifications such as ITIL, CCNP/CCIE, Palo Alto Security, SASE, SDWAN, Juniper Mist/Aruba, CompTIA Security+, or Certified Kubernetes Administrator (CKA)
Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies
Certifications in automation and IaC tools (Ansible, Terraform)
Certifications in observability and monitoring platforms (Dynatrace, Prometheus, Grafana, ELK)
Certifications in ServiceNow, Jira, or other operational tooling
8+ years in IT operations, service management, or infrastructure reliability, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Engineer
Strong experience with high availability systems, resilience engineering, and DR readiness
Deep expertise in RCA, incident management, PMIR, and implementing permanent fixes for recurring issues
Hands on experience with CI/CD, automation, IaC, and self healing/auto remediation workflows
Proficiency in observability platforms (APM, logging, tracing, alerting) and integrating network telemetry / NetOps monitoring
Experience defining and governing SLIs, SLOs, and error budgets to improve service reliability
Experience with Kubernetes, containerized workloads, and distributed systems
Experience managing deployments, operational readiness, risk assessments, and improving event/problem management processes
Strong cross functional collaboration with Development, Operations, Engineering, Product, T&E ICE, and SSA
Familiarity with cloud platforms, scalable architectures, and zero downtime deployment strategies
Cloud Infrastructure — AWS/Azure, Linux, virtualization, HA/DR architecture
Automation & IaC — Ansible, Terraform, CI/CD pipelines, self‑healing workflows
Observability & Monitoring — APM, logging, tracing, alerting, Dynatrace, Prometheus, Grafana, ELK
NetOps Monitoring — network telemetry, event monitoring, and operational visibility tools
Containerization & Orchestration — Docker, Kubernetes, distributed systems
Deployment & Release Engineering — zero‑downtime strategies (blue‑green, canary), operational readiness
Programming & Scripting — Python, Bash, PowerShell for automation and tooling
Reliability Engineering — SLIs/SLOs, error budgets, capacity planning, performance tuning

What we offer

Work from home up to 2 days/week (depending on your team's needs)
Make your workday suit your life and plans
Take up to 30 days a year to work from any location in the world
Employee Assistance Program (EAP), for you and your dependents 24/7, 365 days/year
Champion Health - a personalized platform that supports a range of wellbeing needs
Access to world-class learning platforms and programs (LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Airport Council International, Pluralsight, Harvard Business Publishing, Stanford)
Competitive benefits that make sense with both your local market and employment status

SITA - All Job Offers

Select Country

Lead Site Reliability Engineer/ Expert

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Lead Site Reliability Engineer/ Expert

Lead Site Reliability Engineer

Technical Lead-Site Reliability Engineer

Site Reliability Engineer Application Development Technical Lead Analyst

Site Reliability Engineer

Site Reliability Engineer

Principal Site Reliability Engineer

Site Reliability Engineer

Principal Site Reliability Engineer

Our AI answers in your language