CrawlJobs Logo

Lead Site Reliability Engineer/ Expert

Egypt; India, Cairo · Job Posted June 09, 2026
Apply Position
Job Link Share

Job Description

Responsible for ensuring highly reliable, scalable, and resilient production systems across cloud and on‑prem environments. Ensures high availability, disaster recovery readiness, and continuous improvement of service performance. Leads automation initiatives for provisioning, deployment, monitoring, and self‑healing to reduce manual effort and improve stability. Owns the event catalog, operational readiness, and reliability engineering practices to prevent recurrence of incidents and strengthen system resilience. Drives collaboration across Product, Engineering, T&E ICE, and Service Support Architects to ensure provider‑grade reliability and seamless operational integration of new releases.

Job Responsibility

  • Design & maintain resilient systems ensuring high availability, scalability, and fault tolerance
  • Ensure effective Disaster Recovery (DR), failover strategies, and resilience engineering across environments
  • Improve platform reliability, observability, and performance across cloud and on‑premises systems
  • Establish and maintain SLIs, SLOs, and error budgets to measure and govern service reliability
  • Take ownership of production availability, capacity planning, performance tuning, and long‑term reliability initiatives
  • Drive automation for infrastructure provisioning, deployment, monitoring, and operational workflows
  • Develop and implement auto‑remediation and self‑healing solutions to reduce manual intervention
  • Manage CI/CD pipelines and Infrastructure as Code (IaC) frameworks for secure, repeatable deployments
  • Implement and manage zero‑downtime deployment strategies (blue‑green, canary, rolling)
  • Support containerized and cloud‑native platforms including Kubernetes, Docker, and distributed systems
  • Support NetOps tooling and network observability, ensuring visibility into network performance, events, and operational health
  • Perform incident management, production troubleshooting, and lead RCA/PMIR (Postmortem) for critical outages
  • Proactively identify reliability gaps, performance bottlenecks, and operational risks
  • Optimize incident, event, and problem management processes to reduce MTTR and improve operational efficiency
  • Define and maintain the event catalog, thresholds, and remediation workflows
  • Develop event response protocols and ensure teams are trained for rapid incident handling
  • Build and maintain observability solutions using monitoring, logging, tracing, and alerting platforms
  • Implement APM, distributed tracing, and proactive alerting to detect issues early
  • Integrate network telemetry and NetOps monitoring tools into the overall observability stack
  • Collaborate with stakeholders to improve event coverage and post‑event learning
  • Experience with AI‑assisted observability, anomaly detection, and predictive alerting
  • Own the quality of new release deployments for the PSO
  • Conduct operational readiness assessments and manage deployment risk
  • Ensure supportability for new applications, platform releases, and infrastructure changes
  • Coordinate with internal/external stakeholders to drive continuous service improvement
  • Work closely with Development, Platform Engineering, Product, T&E ICE, and Service Support Architects to embed reliability best practices
  • Collaborate with vendors and engineering teams to enhance system reliability and operational excellence
  • Support new product productization as SGS technical expert and ensure operational readiness

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. Master’s degree preferred for senior roles
  • Relevant certifications such as ITIL, CCNP/CCIE, Palo Alto Security, SASE, SDWAN, Juniper Mist/Aruba, CompTIA Security+, or Certified Kubernetes Administrator (CKA)
  • Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies
  • Certifications in automation and IaC tools (Ansible, Terraform)
  • Certifications in observability and monitoring platforms (Dynatrace, Prometheus, Grafana, ELK)
  • Certifications in ServiceNow, Jira, or other operational tooling
  • 8+ years in IT operations, service management, or infrastructure reliability, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Engineer
  • Strong experience with high availability systems, resilience engineering, and DR readiness
  • Deep expertise in RCA, incident management, PMIR, and implementing permanent fixes for recurring issues
  • Hands on experience with CI/CD, automation, IaC, and self healing/auto remediation workflows
  • Proficiency in observability platforms (APM, logging, tracing, alerting) and integrating network telemetry / NetOps monitoring
  • Experience defining and governing SLIs, SLOs, and error budgets to improve service reliability
  • Experience with Kubernetes, containerized workloads, and distributed systems
  • Experience managing deployments, operational readiness, risk assessments, and improving event/problem management processes
  • Strong cross functional collaboration with Development, Operations, Engineering, Product, T&E ICE, and SSA
  • Familiarity with cloud platforms, scalable architectures, and zero downtime deployment strategies
  • Cloud Infrastructure — AWS/Azure, Linux, virtualization, HA/DR architecture
  • Automation & IaC — Ansible, Terraform, CI/CD pipelines, self‑healing workflows
  • Observability & Monitoring — APM, logging, tracing, alerting, Dynatrace, Prometheus, Grafana, ELK
  • NetOps Monitoring — network telemetry, event monitoring, and operational visibility tools
  • Containerization & Orchestration — Docker, Kubernetes, distributed systems
  • Deployment & Release Engineering — zero‑downtime strategies (blue‑green, canary), operational readiness
  • Programming & Scripting — Python, Bash, PowerShell for automation and tooling
  • Reliability Engineering — SLIs/SLOs, error budgets, capacity planning, performance tuning

What we offer

  • Work from home up to 2 days/week (depending on your team's needs)
  • Make your workday suit your life and plans
  • Take up to 30 days a year to work from any location in the world
  • Employee Assistance Program (EAP), for you and your dependents 24/7, 365 days/year
  • Champion Health - a personalized platform that supports a range of wellbeing needs
  • Access to world-class learning platforms and programs (LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Airport Council International, Pluralsight, Harvard Business Publishing, Stanford)
  • Competitive benefits that make sense with both your local market and employment status

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Lead Site Reliability Engineer/ Expert

8 matching positions

New

Lead Site Reliability Engineer

Trimble is looking for a Site Reliability Engineering Lead to join Business Syst...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Engineering, Computer Science, or a related field
  • 7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles with at least 2+ years in a leadership or mentoring capacity
  • Deep AWS expertise (EC2, S3, RDS, IAM, VPC, Lambda, CloudFormation/Terraform, etc.)
  • Strong knowledge of Infrastructure-as-Code (IaC) using Terraform, AWS CDK, or CloudFormation
  • Proven experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, or similar)
  • Proficiency in containerization and orchestration (Docker, Kubernetes, ECS, or EKS)
  • Expertise in monitoring and observability tools (Datadog, New Relic, Prometheus, Grafana, ELK, CloudWatch, etc.)
  • Strong scripting or programming background (Python, Bash, or Go)
  • Sound understanding of networking, security, and identity/access management in the cloud
  • Experience designing high-availability and disaster recovery strategies for critical workloads
Job Responsibility
Job Responsibility
  • Become well-versed in the opportunities and challenges of the business and Trimble's customers
  • Become an expert in Business Systems services, especially the interfaces—APIs, protocols (e.g. OAuth), and user interfaces
  • Establish, then utilize tight working relationships with stakeholders across the company, especially Trimble's engineering community
  • Prototype and create proofs of concept as required
  • Scope and deploy new integrations
  • Investigate, diagnose, and solve customer integration issues
  • Effectively communicate technical issues with stakeholders in non-technical language
  • Contribute to utilities and SDKs to help integration and migration efforts
  • Fulltime
Read More
Arrow Right

Technical Lead-Site Reliability Engineer

We are seeking an experienced Site Reliability Engineer to support Vodafone’s st...
Location
Location
Egypt , Cairo
Salary
Salary:
Not provided
vodafone.com Logo
Vodafone
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experienced in Site Reliability Engineering, DevOps, or production support roles within complex, enterprise-scale environments
  • Skilled in Unix/Linux administration with strong shell scripting experience
  • Experienced with CI/CD tools such as Git, Jenkins, Nexus, SonarQube, and configuration or automation tools
  • Proficient in infrastructure as code using tools such as Terraform or CloudFormation
  • Comfortable working with public cloud platforms such as AWS or Azure
  • Able to develop using one or more high-level programming languages, including Python, Java, or JavaScript
  • Experienced in containerisation and orchestration technologies, including Docker and Kubernetes
  • Familiar with monitoring and observability tools such as Prometheus, Grafana, CloudWatch, or Centreon
  • Knowledgeable in microservices architecture, APIs, and web services (REST, SOAP, JSON, XML)
  • Experienced with relational and NoSQL data stores such as PostgreSQL, MariaDB, Redis, MongoDB, or similar technologies
Job Responsibility
Job Responsibility
  • Drive reliability, availability, and performance across IoT platforms through proactive monitoring, automation, and operational improvements
  • Design, deploy, review, and troubleshoot technical integrations with multiple platforms, services, and connected devices
  • Implement and enhance CI/CD practices to enable high levels of operational automation and zero-touch operations
  • Partner with development teams to improve services through rigorous testing, release management, and operational readiness
  • Act as a technical subject matter expert, supporting and coaching team members to build capability across relevant technologies
  • Lead and support incident and problem management activities, ensuring timely resolution, root cause analysis, and preventive actions in line with agreed SLAs
  • Contribute to system design reviews, including HLDs and LLDs, translating architectural decisions into operational requirements
  • Balance feature delivery speed with platform reliability through clearly defined service level objectives
  • Design, implement, and continuously enhance monitoring, alerting, and observability solutions to maintain a holistic view of system health
  • Manage production environments through proactive capacity planning, performance optimisation, and release deployments
What we offer
What we offer
  • The opportunity to work on large-scale, business-critical IoT platforms with global reach
  • Exposure to modern cloud-native architectures, DevOps practices, and automation at enterprise scale
  • Collaboration with international teams across Vodafone Group and strategic partners
  • A role that blends hands-on engineering with system design, reliability strategy, and continuous improvement
  • A supportive environment that values learning, knowledge sharing, and professional growth
Read More
Arrow Right

Site Reliability Engineer Application Development Technical Lead Analyst

The Applications Development Technology Lead Analyst is a senior level position ...
Location
Location
Canada , Mississauga
Salary
Salary:
120800.00 - 170800.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of relevant experience in Apps Development or systems analysis role
  • 5+ years extensive experience system analysis and in programming of software applications with Python and RHEL
  • 5+ years with Site reliability & CI/CD pipelines
  • Previous experience with containerization orchestration
  • Experience in managing and implementing successful projects
  • Subject Matter Expert (SME) in at least one area of Applications Development
  • Ability to adjust priorities quickly as circumstances dictate
  • Demonstrated leadership and project management skills
  • Consistently demonstrates clear and concise written and verbal communication
  • Bachelor's degree/University degree or equivalent experience
Job Responsibility
Job Responsibility
  • Partner with multiple management teams to ensure appropriate integration of functions to meet goals
  • Identify and define necessary system enhancements to deploy new products and process improvements
  • Resolve variety of high impact problems/projects through in-depth evaluation of complex business processes, system processes, and industry standards
  • Provide expertise in area and advanced knowledge of applications programming and ensure application design adheres to the overall architecture blueprint
  • Utilize advanced knowledge of system flow and develop standards for coding, testing, debugging, and implementation
  • Develop comprehensive knowledge of how areas of business integrate to accomplish business goals
  • Provide in-depth analysis with interpretive thinking to define issues and develop innovative solutions
  • Serve as advisor or coach to mid-level developers and analysts, allocating work as necessary
  • Appropriately assess risk when business decisions are made
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

NetApp is looking for a Senior TechOps Engineer - Cassandra to join our growing ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
netapp.com Logo
NetApp
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in Apache Cassandra administration and architecture, with a desire to continuously learn and develop to an expert level
  • Experience in diagnosing and recommending mitigation strategies for Cassandra-related issues, including performance degradation due to resource bottlenecks, suboptimal data modeling leading to hot partitions, excessive tombstones, and inefficiencies caused by range slices and poorly constructed queries
  • Hands-on experience with Cassandra architecture and core administrative tasks, including compactions, repairs, backup and recovery, schema disagreement resolution, and configuration management
  • Experience handling Cassandra maintenance activities, including upgrades and migrations
  • Ability to investigate and research Cassandra issues by reviewing the Apache Cassandra codebase
  • Strong knowledge and experience with Linux, with the ability to work comfortably from the command line
  • Exceptional ability to communicate clearly and professionally in written and verbal English
  • Experience working with at least one public cloud platform, preferably AWS
  • Prior IT customer service or support experience within an ITIL-based environment
  • Strong fundamental computer science and software engineering skills, particularly in operating system internals, memory management, and networking
Job Responsibility
Job Responsibility
  • Your work will ensure the security, reliability, and performance of world-class systems and databases
  • You will collaborate with the technical teams of our customers, who are globally recognized companies in the gaming, banking, and logistics industries, ranging from large multinationals to emerging start-ups
What we offer
What we offer
  • Volunteer time off
  • Well-being
  • Time away
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As Site Reliability Engineer you will contribute to the overarching implementati...
Location
Location
Romania , Bucuresti
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field
  • Minimum 5 years proven work experience as a Reliability Engineer or similar role
  • Expert knowledge and hands-on experience with applications hosted on cloud platforms such as Google Cloud Platform as well as with Docker / Kubernetes in combination with Google Kubernetes Engine (GKE), Terraform or similar technology
  • Experience in resilient software development in Python/JAVA and the usage of modern CI/CD pipelines e.g. Github, Github Actions, Bitbucket, Helm
  • Strong experience in the setup of observability, monitoring and self-healing solutions for instance with New Relic, Splunk, Google Cloud Operations, Lightstep and Ansible
  • Very good knowledge of security standards (e.g.: TLS, OAuth2, KMS, Vault, Admission Controllers, let's encrypt), microservice architectures and experience with API Management with Apigee or WSO2
  • Proactive attitude and collaborative Team player mindset paired with self confidence
  • Not losing your coolness and keep your eye for details even in stressful situations where time matters
  • Having a creative approach towards solving technical problems
  • Excellent communication skills in English
Job Responsibility
Job Responsibility
  • Define Service Level Objectives (SLOs), and enable an end-to-end view on customer satisfaction based on best practices for setting up Service Level Indicators (SLIs) to create effective strategies for maintaining and improving system performance and availability
  • Collaborate with Business Functional Analysts and Solution Architects to find improvements in the solution design to improve the resilience of technical solutions early on
  • Consult and guide the squad on the prioritization of reliability improvement and actively deliver them as part of the sprint
  • Hands-on experience in implementing reliability and resilience patterns like auto-scaling, circuit breakers, bulk-heads, rate limiter, retry mechanisms, etc.
  • Actively work on service request fulfilment, incident and problem mgmt. to identify and reduce toil and the MTTR with engineering best practices
  • Align and contribute on state-of-the-art SRE best practices e.g. Distributed Tracing, Open Telemetry and Chaos Engineering with the SRE chapter function
  • Be a knowledge- and skill multiplicator of your profession by being a Lead of the Site Reliability engineer population
  • Increase the seniority of the overall Site Reliability Engineer chapter by establishing events and procedures, and foster a culture of high standards
  • Lead people of your engineer profession and make them become better each day
What we offer
What we offer
  • Smooth integration and a supportive mentor
  • Pick your working style: choose from Remote, Hybrid or Office work opportunities
  • Our projects have different working hours to suit your needs
  • Sponsored certifications, trainings and top e-learning platforms
  • Private Health Insurance – custom-made for you
  • Individual coaching sessions or accredited Coaching School
  • Epic parties or themed events – lovingly designed for our people and their families
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Experience with CI/CD pipelines, GitLab, and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Engineering to make a system more resilient and efficient frees up time and mone...
Location
Location
United States , Annapolis Junction
Salary
Salary:
86900.00 - 198000.00 USD / Year
boozallen.com Logo
Booz Allen Hamilton
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience creating and maintaining highly reliable and scalable systems to reduce issues and downtime, including design and implementation of physical servers, storage systems, and network infrastructures
  • 5+ years of experience providing technical support for system upgrades, rollouts, and enhancements
  • 3+ years of experience developing and deploying infrastructure solutions
  • 3+ years of experience employing and sustaining VMware for v6.x and later, including the design and implementation of virtual data centers
  • 3+ years of experience designing and deploying highly available storage solutions for technologies, including SAN storage and high-capacity storage solutions
  • Experience with data center design and buildout
  • Experience transforming large-scale software, data center, or on-premises infrastructure programs to a virtualized architecture
  • Ability to interact with clients and lead, train, and mentor junior system administrators
  • Top Secret clearance
  • Bachelor's degree
Job Responsibility
Job Responsibility
  • Lead the development of more robust systems for Booz Allen by building a resilient infrastructure
  • Build in redundancy, implement monitoring tools, and automate wherever possible
  • Reduce toil by scripting routine tasks and automating self-repair
  • Support your team of engineers and act as a subject matter expert for our clients
What we offer
What we offer
  • Health benefits
  • Life benefits
  • Disability benefits
  • Financial benefits
  • Retirement benefits
  • Paid leave
  • Professional development
  • Tuition assistance
  • Work-life programs
  • Dependent care
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubenetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
  • Excellent written and verbal communication, able to collaborate and rally support
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
What we offer
What we offer
  • restricted stock units and a bonus
  • Fulltime
Read More
Arrow Right