CrawlJobs Logo

Site Reliability Engineering (SRE) / Observability Technical Lead

United Kingdom, London · Job Posted January 26, 2026
Apply Position
Job Link Share

Job Description

Join a dynamic team as a Site Reliability Engineer, leading observability and reliability projects. Leverage your expertise in APM, IaC, and automation to enhance system performance and scalability. Collaborate with cross-functional teams and mentor junior engineers to foster a culture of operational excellence.

Job Responsibility

  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

Requirements

  • 5+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Hands-on experience with OpenTelemetry (OTel) for distributed tracing and observability instrumentation
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills

What we offer

  • Tailored benefits that support your physical, emotional, and financial wellbeing
  • Continuous growth and development opportunities
  • Flexible work options

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineering (SRE) / Observability Technical Lead

8 matching positions

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills
Job Responsibility
Job Responsibility
  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Lead

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...
Location
Location
Canada , Mississauga
Salary
Salary:
120800.00 - 170800.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6–10 years of relevant experience in a hands‑on technical role
  • Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
  • Experience working with senior stakeholders or technology partners
  • Demonstrated experience supporting IT service improvements or platform stability initiatives
  • Strong communication and presentation skills, with the ability to convey technical concepts clearly
  • Experience supporting or contributing to technical roadmaps or operational workstreams
  • Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
  • Ability to collaborate with cross‑functional support teams and technology groups
  • Strong organizational and workload‑planning skills
  • Consistently demonstrates clear and concise written and verbal communication skills
Job Responsibility
Job Responsibility
  • Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
  • Assist with vendor relationship management, including coordination with offshore managed services
  • Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
  • Partner with development teams to guide improvements in application stability and supportability
  • Contribute to frameworks for managing capacity, throughput, and latency
  • Assist in defining and implementing application onboarding guidelines and standards
  • Support team members by fostering a collaborative environment and encouraging skill development
  • Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
  • Participate in business review meetings to help align technology tools and strategies with business requirements
  • Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Lead

We are seeking an experienced and motivated team member to support our AI and De...
Location
Location
Canada , Mississauga
Salary
Salary:
120800.00 - 170800.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of relevant experience in a hands‑on technical or support leadership role
  • Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
  • Experience working with senior stakeholders or technology partners
  • Demonstrated experience supporting IT service improvements or platform stability initiatives
  • Strong communication and presentation skills, with the ability to convey technical concepts clearly
  • Experience supporting or contributing to technical roadmaps or operational workstreams
  • Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
  • Ability to collaborate with cross‑functional support teams and technology groups
  • Strong organizational and workload‑planning skills
  • Consistently demonstrates clear and concise written and verbal communication skills
Job Responsibility
Job Responsibility
  • Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
  • Assist with vendor relationship management, including coordination with offshore managed services
  • Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
  • Partner with development teams to guide improvements in application stability and supportability
  • Contribute to frameworks for managing capacity, throughput, and latency
  • Assist in defining and implementing application onboarding guidelines and standards
  • Support team members by fostering a collaborative environment and encouraging skill development
  • Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
  • Participate in business review meetings to help align technology tools and strategies with business requirements
  • Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program
  • Fulltime
Read More
Arrow Right

Executive Principal, Site Reliability Engineering (SRE) – DevOps

The Executive Principal of Infra Engineering is a senior leader responsible for ...
Location
Location
United States , Irvine
Salary
Salary:
180000.00 - 210000.00 USD / Year
haeaus.com Logo
Hyundai AutoEver America
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in IT/IS or equivalent experience
  • 10 years of infrastructure engineering experience
  • 8+ years of management experience required
  • High availability, fault tolerance, and incident management
  • Automation of infrastructure and operations
  • CI/CD pipeline design and maintenance
  • Monitoring, metrics, and performance tuning
  • Multi-platform expertise (Windows, Linux, VMware, cloud)
  • Security, audit, and identity/access management
  • Change control and risk management
Job Responsibility
Job Responsibility
  • Guide the Site Reliability Engineering (SRE) function, integrating DevOps principles to drive operational excellence, reliability, and innovation across infrastructure platforms
  • Lead multiple technical teams, including Platform Engineering, Data Center Management, Infrastructure Planning & Architecture and Network & Telecommunications, ensuring 24x7 support and continuous improvement within a complex, hybrid environment
  • Mentor and develop infrastructure managers and SMEs
  • Lead onshore/offshore teams and manage service providers
  • Oversee 24x7 operations, incident response, and problem management
  • Manage OpEx/CapEx, SLAs, KPIs, and OKRs
  • Ensure reliability, disaster recovery, and lifecycle management
  • Champion automation, CI/CD, and Infrastructure as Code
  • Direct monitoring, observability, and performance optimization
  • Align with security and compliance requirements
  • Fulltime
Read More
Arrow Right

Engineering Manager - Observability & Reliability Engineering Obsession

We are looking for an Engineering Manager to join the OREO (Observability Reliab...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of software engineering or SRE experience, with a strong technical background in cloud-native environments (preferably AWS, GCP, and/or Kubernetes-based)
  • 3+ years of engineering management experience, leading technical teams (ideally SRE, platform, or infrastructure teams)
  • Deep understanding of observability tooling and architecture (Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Prometheus, Thanos, Datadog)
  • Experience with infrastructure as code (Terraform, OpenTofu) and secrets management systems (Vault, AWS Secrets Manager)
  • Proven ability to balance technical depth with people leadership, able to mentor engineers, review technical designs, and guide architectural decisions
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a team of Site Reliability Engineers, supporting their technical development and career progression
  • Create a culture of operational excellence, continuous improvement, and psychological safety within the team
  • Conduct regular 1:1s, performance reviews, and career development conversations
  • Recruit, onboard, and retain top SRE talent aligned with Doctolib's mission and values
  • Partner with SREs and senior engineers to define and evolve the observability strategy across the platform, focusing on logging, metrics, tracing, and alerting
  • Own the strategy and evolution of critical transversal services including HashiCorp Vault and Terraform Enterprise
  • Drive prioritization and roadmap planning for large-scale reliability and observability initiatives
  • Ensure alignment between team objectives and broader engineering and business goals
  • Advocate for and allocate resources toward reducing technical debt and improving developer experience
  • Own the team's on-call experience and contribute to the incident response processes, ensuring sustainable practices and continuous improvement
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive one additional month of leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • A subsidy from the work council to refund part of the membership to a sport club or a creative class
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right

Engineering Manager - Observability & Reliability Engineering Obsession

We are looking for an Engineering Manager to join the OREO (Observability Reliab...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of software engineering or SRE experience, with a strong technical background in cloud-native environments (preferably AWS, GCP, and/or Kubernetes-based)
  • 3+ years of engineering management experience, leading technical teams (ideally SRE, platform, or infrastructure teams)
  • Deep understanding of observability tooling and architecture (Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Prometheus, Thanos, Datadog)
  • Experience with infrastructure as code (Terraform, OpenTofu) and secrets management systems (Vault, AWS Secrets Manager)
  • Proven ability to balance technical depth with people leadership, able to mentor engineers, review technical designs, and guide architectural decisions
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a team of Site Reliability Engineers, supporting their technical development and career progression
  • Create a culture of operational excellence, continuous improvement, and psychological safety within the team
  • Conduct regular 1:1s, performance reviews, and career development conversations
  • Recruit, onboard, and retain top SRE talent aligned with Doctolib's mission and values
  • Partner with SREs and senior engineers to define and evolve the observability strategy across the platform, focusing on logging, metrics, tracing, and alerting
  • Own the strategy and evolution of critical transversal services including HashiCorp Vault and Terraform Enterprise
  • Drive prioritization and roadmap planning for large-scale reliability and observability initiatives
  • Ensure alignment between team objectives and broader engineering and business goals
  • Advocate for and allocate resources toward reducing technical debt and improving developer experience
  • Own the team's on-call experience and contribute to the incident response processes, ensuring sustainable practices and continuous improvement
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive one additional month of leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • A subsidy from the work council to refund part of the membership to a sport club or a creative class
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right
New

Director, Site Reliability Engineering

As our Director of Site Reliability Engineering, reporting to our VP of Platform...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in software engineering, including 5+ years leading managers and running infrastructure or SRE organisations at scale
  • Track record of taking SRE practices from reactive to proactive — with measurable reductions in incidents and MTTR
  • Strong multi-cloud and network infrastructure experience: load balancing, CDN/WAF, VPCs, peering, at high-traffic scale
  • Deep database operations background: large-scale transactional systems (PostgreSQL, Aurora), streaming/CDC (Kafka), data layer FinOps
  • Experience building observability platforms that give teams genuine visibility — metrics, logs, traces, alerting
  • Sharp process thinking: SLOs, error budgets, incident management, blameless post-mortems
  • Outcome-driven: you track reliability, cost efficiency, and engineering velocity as business metrics, not just technical ones
  • Strong communicator and influencer at executive level — equally credible with senior engineers and business stakeholders
  • Builder of high-performing, people-first engineering cultures
  • Fluent in English
Job Responsibility
Job Responsibility
  • Build and run a world-class SRE org of 25+ engineers across Cloud Infrastructure, Database & Storage, Network Infrastructure, Observability Tooling, and the Doctolib Operations Center
  • Own the infrastructure strategy and roadmap — cloud, database, network, observability — and deliver against company OKRs
  • Lead the Doctolib Operations Center: set incident response standards, drive MTTR reduction, embed blameless post-mortem culture across engineering
  • Architect and execute our multi-cloud strategy — reducing vendor lock-in, cutting migration costs, and enabling international expansion
  • Own network infrastructure at scale: load balancing, CDN/WAF, VPCs, peering, zero-trust networking across a high-traffic, multi-country platform
  • Drive observability as a product — give 700+ engineers true visibility into system health and turn observability maturity into an operational excellence lever
  • Lead from the front as a senior technical voice in the Platform org and broader Tech leadership team
What we offer
What we offer
  • A Deutschlandticket (Germany-wide public transport pass) fully paid for by Doctolib
  • 28 vacation days + 1 additional day for each full calendar year of employment (up to a maximum of 30 days)
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Company health insurance with great supplementary benefits through our partner Allianz
  • Company pension scheme (bAV) through Allianz with an employer subsidy of 40% (15% within the probationary period)
  • Enrollment in Doctolib's long-term employee value sharing plan called DoctoGrowth
  • The Doctolib Parent Care program, which includes one month additional parental leave and much more
  • Free mental health and coaching services through our partner Moka.care
  • Subsidized sports membership through our partner Urban Sports Club
  • A flexible workplace policy offering both hybrid and office-based mode
  • Fulltime
Read More
Arrow Right

Director, Site Reliability Engineering

As our Director of Infrastructure platform, you will be a key driver of Doctolib...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in software engineering, including 6+ years leading large (30+) distributed, international platform or infrastructure teams
  • Proven experience driving platform-as-a-product transformations and modularizing large monolithic architectures at scale
  • Demonstrated ability to architect, deliver, and operate secure, reliable, and scalable developer platforms in SaaS, multi-product, or regulated environments
  • Strong process orientation: experience implementing OKRs, robust monitoring/observability, and best-in-class incident management
  • Measurable impact on developer productivity, platform adoption, reliability, and cost-efficiency
  • Effective communicator and influencer, with the ability to align and inspire cross-functional stakeholders
  • Experience leading change and building high-performing, people-first engineering cultures
  • Fluent in English and comfortable in fast-paced, international environments
Job Responsibility
Job Responsibility
  • Lead and scale a high-performing infrastructure organization of 30+ engineers across Infrastructure, Automation, SRE, and Database teams, while maintaining strong engagement and fostering a culture of excellence and ownership
  • Own the infrastructure platform strategy and roadmap that enables Doctolib's modularization journey, delivers on company OKRs, and ensures predictable execution across all infrastructure and automation initiatives
  • Champion platform-as-a-product by building self-service capabilities (infrastructure provisioning, CI/CD, observability, database management) that transform developer experience and unlock team autonomy across the engineering organization
  • Be the guardian of quality and reliability by establishing world-class incident management, driving measurable improvements in availability and performance, and ensuring infrastructure components operate at the highest standards of security and resilience
  • Accelerate engineering velocity by reducing platform friction, enabling faster modularization, and leveraging AI-augmented development tools to multiply productivity across feature teams
  • Drive the infrastructure transformation from monolith-supporting infrastructure to a modular, multi-service platform architecture - enabling international expansion, product velocity, and operational excellence at scale
  • Act as a senior technical leader within the Platform organization and broader Tech leadership team, bringing strong technical opinions and challenging architectural decisions while clearly articulating how infrastructure investments contribute to company strategy and business outcomes
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive additional leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right