Manager of Site Reliability Engineering (SRE) Job at Genuine Parts Company (Birmingham)

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
Strong proficiency in Infrastructure as Code (IaC) using Terraform
Solid understanding of cloud platforms including AWS, GCP, or Azure
Experience with automation/configuration management tools like Ansible, Chef, or Puppet
Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
Experience managing Kubernetes and containerized environments (Docker, Helm)
Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
Excellent leadership, communication, and collaboration skills

Job Responsibility

Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

Fulltime

Site Reliability Engineering (SRE) / Observability Technical Lead

Join a dynamic team as a Site Reliability Engineer, leading observability and re...

Location

United Kingdom , London

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
Hands-on experience with OpenTelemetry (OTel) for distributed tracing and observability instrumentation
Strong proficiency in Infrastructure as Code (IaC) using Terraform
Solid understanding of cloud platforms including AWS, GCP, or Azure
Experience with automation/configuration management tools like Ansible, Chef, or Puppet
Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
Experience managing Kubernetes and containerized environments (Docker, Helm)
Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
Excellent leadership, communication, and collaboration skills

Job Responsibility

Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

What we offer

Tailored benefits that support your physical, emotional, and financial wellbeing
Continuous growth and development opportunities
Flexible work options

Fulltime

Executive Principal, Site Reliability Engineering (SRE) – DevOps

The Executive Principal of Infra Engineering is a senior leader responsible for ...

Location

United States , Irvine

Salary:

180000.00 - 210000.00 USD / Year

Hyundai AutoEver America

Expiration Date

Until further notice

Requirements

Bachelor's degree in IT/IS or equivalent experience
10 years of infrastructure engineering experience
8+ years of management experience required
High availability, fault tolerance, and incident management
Automation of infrastructure and operations
CI/CD pipeline design and maintenance
Monitoring, metrics, and performance tuning
Multi-platform expertise (Windows, Linux, VMware, cloud)
Security, audit, and identity/access management
Change control and risk management

Job Responsibility

Guide the Site Reliability Engineering (SRE) function, integrating DevOps principles to drive operational excellence, reliability, and innovation across infrastructure platforms
Lead multiple technical teams, including Platform Engineering, Data Center Management, Infrastructure Planning & Architecture and Network & Telecommunications, ensuring 24x7 support and continuous improvement within a complex, hybrid environment
Mentor and develop infrastructure managers and SMEs
Lead onshore/offshore teams and manage service providers
Oversee 24x7 operations, incident response, and problem management
Manage OpEx/CapEx, SLAs, KPIs, and OKRs
Ensure reliability, disaster recovery, and lifecycle management
Champion automation, CI/CD, and Infrastructure as Code
Direct monitoring, observability, and performance optimization
Align with security and compliance requirements

Fulltime

Site Reliability Engineering (SRE) Team Lead

We are looking for a highly skilled and experienced Site Reliability Engineering...

Location

United States , Irving

Salary:

Not provided

OneMain Financial

Expiration Date

Until further notice

Requirements

BA/BS in Computer Science, Engineering, related field, or equivalent experience
7+ years of experience in site reliability engineering, systems engineering, or related roles, with at least 2 years in a leadership position
Proven experience leading and scaling high-performing engineering teams
Deep expertise in cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker)
Strong skills in infrastructure as code tools (Terraform, Ansible, CloudFormation) and CI/CD pipelines
Proficiency with monitoring and alerting systems (Prometheus, Grafana, ELK, Datadog)
Solid programming and scripting skills (Python, Go, Bash, or similar)
Strong understanding of distributed systems, networking, security, and databases
Excellent leadership, communication, and collaboration skills
Experience managing incident response and on-call rotations

Job Responsibility

Lead, mentor, and grow a team of site reliability engineers, promoting a culture of reliability, automation, and continuous improvement
Drive the design, implementation, and maintenance of scalable and fault-tolerant infrastructure to support high-availability services
Oversee incident management processes, including triage, root cause analysis, and postmortems to improve system reliability and prevent recurrence
Collaborate cross-functionally with software engineering, product, and operations teams to integrate reliability best practices into the software development lifecycle
Define and implement operational metrics, SLIs/SLOs, and dashboards to monitor system health and drive proactive improvements
Manage and assess the observability of critical environments proactively addressing gaps that may arise
Oversee the release management processes, artifacts and tools that drive a repeatable software delivery lifecycle
Champion automation efforts to reduce manual intervention, improve deployment pipelines, and optimize infrastructure management
Lead capacity planning, disaster recovery, and performance tuning efforts
Ensure security and compliance standards are upheld across infrastructure and operations

What we offer

Health and wellbeing options including medical, prescription, dental, vision, hearing, accident, hospital indemnity, and life insurances
Up to 4% matching 401(k)
Employee Stock Purchase Plan (10% share discount)
Tuition reimbursement
Paid time off (15 days’ vacation per year, plus 2 personal days, prorated based on start date)
Paid sick leave as determined by state or local ordinance, prorated based on start date
Paid holidays (7 days per year, based on start date)
Paid volunteer time (3 days per year, prorated based on start date)
Access to Talkspace and Hinge for on-demand physical therapy via an app
Family back-up care

Fulltime

Engineering Manager - Observability & Reliability Engineering Obsession

We are looking for an Engineering Manager to join the OREO (Observability Reliab...

Location

France , Paris

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of software engineering or SRE experience, with a strong technical background in cloud-native environments (preferably AWS, GCP, and/or Kubernetes-based)
3+ years of engineering management experience, leading technical teams (ideally SRE, platform, or infrastructure teams)
Deep understanding of observability tooling and architecture (Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Prometheus, Thanos, Datadog)
Experience with infrastructure as code (Terraform, OpenTofu) and secrets management systems (Vault, AWS Secrets Manager)
Proven ability to balance technical depth with people leadership, able to mentor engineers, review technical designs, and guide architectural decisions

Job Responsibility

Lead, coach, and grow a team of Site Reliability Engineers, supporting their technical development and career progression
Create a culture of operational excellence, continuous improvement, and psychological safety within the team
Conduct regular 1:1s, performance reviews, and career development conversations
Recruit, onboard, and retain top SRE talent aligned with Doctolib's mission and values
Partner with SREs and senior engineers to define and evolve the observability strategy across the platform, focusing on logging, metrics, tracing, and alerting
Own the strategy and evolution of critical transversal services including HashiCorp Vault and Terraform Enterprise
Drive prioritization and roadmap planning for large-scale reliability and observability initiatives
Ensure alignment between team objectives and broader engineering and business goals
Advocate for and allocate resources toward reducing technical debt and improving developer experience
Own the team's on-call experience and contribute to the incident response processes, ensuring sustainable practices and continuous improvement

What we offer

Free comprehensive health insurance for you and your children
Parent Care Program: receive one additional month of leave on top of the legal parental leave
Free mental health and coaching services through our partner Moka.care
For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
Work Council subsidy to refund part of sport club membership or creative class
Up to 14 days of RTT
A subsidy from the work council to refund part of the membership to a sport club or a creative class
Lunch voucher with Swile card

Fulltime

Engineering Manager - Observability & Reliability Engineering Obsession

We are looking for an Engineering Manager to join the OREO (Observability Reliab...

Location

Germany , Berlin

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of software engineering or SRE experience, with a strong technical background in cloud-native environments (preferably AWS, GCP, and/or Kubernetes-based)
3+ years of engineering management experience, leading technical teams (ideally SRE, platform, or infrastructure teams)
Deep understanding of observability tooling and architecture (Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Prometheus, Thanos, Datadog)
Experience with infrastructure as code (Terraform, OpenTofu) and secrets management systems (Vault, AWS Secrets Manager)
Proven ability to balance technical depth with people leadership, able to mentor engineers, review technical designs, and guide architectural decisions

Job Responsibility

Lead, coach, and grow a team of Site Reliability Engineers, supporting their technical development and career progression
Create a culture of operational excellence, continuous improvement, and psychological safety within the team
Conduct regular 1:1s, performance reviews, and career development conversations
Recruit, onboard, and retain top SRE talent aligned with Doctolib's mission and values
Partner with SREs and senior engineers to define and evolve the observability strategy across the platform, focusing on logging, metrics, tracing, and alerting
Own the strategy and evolution of critical transversal services including HashiCorp Vault and Terraform Enterprise
Drive prioritization and roadmap planning for large-scale reliability and observability initiatives
Ensure alignment between team objectives and broader engineering and business goals
Advocate for and allocate resources toward reducing technical debt and improving developer experience
Own the team's on-call experience and contribute to the incident response processes, ensuring sustainable practices and continuous improvement

What we offer

Free comprehensive health insurance for you and your children
Parent Care Program: receive one additional month of leave on top of the legal parental leave
Free mental health and coaching services through our partner Moka.care
For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
Work Council subsidy to refund part of sport club membership or creative class
Up to 14 days of RTT
A subsidy from the work council to refund part of the membership to a sport club or a creative class
Lunch voucher with Swile card

Fulltime

Director, Site Reliability Engineering

As our Director of Site Reliability Engineering, reporting to our VP of Platform...

Location

Germany , Berlin

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

12+ years in software engineering, including 5+ years leading managers and running infrastructure or SRE organisations at scale
Track record of taking SRE practices from reactive to proactive — with measurable reductions in incidents and MTTR
Strong multi-cloud and network infrastructure experience: load balancing, CDN/WAF, VPCs, peering, at high-traffic scale
Deep database operations background: large-scale transactional systems (PostgreSQL, Aurora), streaming/CDC (Kafka), data layer FinOps
Experience building observability platforms that give teams genuine visibility — metrics, logs, traces, alerting
Sharp process thinking: SLOs, error budgets, incident management, blameless post-mortems
Outcome-driven: you track reliability, cost efficiency, and engineering velocity as business metrics, not just technical ones
Strong communicator and influencer at executive level — equally credible with senior engineers and business stakeholders
Builder of high-performing, people-first engineering cultures
Fluent in English

Job Responsibility

Build and run a world-class SRE org of 25+ engineers across Cloud Infrastructure, Database & Storage, Network Infrastructure, Observability Tooling, and the Doctolib Operations Center
Own the infrastructure strategy and roadmap — cloud, database, network, observability — and deliver against company OKRs
Lead the Doctolib Operations Center: set incident response standards, drive MTTR reduction, embed blameless post-mortem culture across engineering
Architect and execute our multi-cloud strategy — reducing vendor lock-in, cutting migration costs, and enabling international expansion
Own network infrastructure at scale: load balancing, CDN/WAF, VPCs, peering, zero-trust networking across a high-traffic, multi-country platform
Drive observability as a product — give 700+ engineers true visibility into system health and turn observability maturity into an operational excellence lever
Lead from the front as a senior technical voice in the Platform org and broader Tech leadership team

What we offer

A Deutschlandticket (Germany-wide public transport pass) fully paid for by Doctolib
28 vacation days + 1 additional day for each full calendar year of employment (up to a maximum of 30 days)
Work from abroad for up to 10 days per year thanks to our flexibility days policy
Company health insurance with great supplementary benefits through our partner Allianz
Company pension scheme (bAV) through Allianz with an employer subsidy of 40% (15% within the probationary period)
Enrollment in Doctolib's long-term employee value sharing plan called DoctoGrowth
The Doctolib Parent Care program, which includes one month additional parental leave and much more
Free mental health and coaching services through our partner Moka.care
Subsidized sports membership through our partner Urban Sports Club
A flexible workplace policy offering both hybrid and office-based mode

Fulltime

Director, Site Reliability Engineering

As our Director of Infrastructure platform, you will be a key driver of Doctolib...

Location

France , Paris

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

12+ years in software engineering, including 6+ years leading large (30+) distributed, international platform or infrastructure teams
Proven experience driving platform-as-a-product transformations and modularizing large monolithic architectures at scale
Demonstrated ability to architect, deliver, and operate secure, reliable, and scalable developer platforms in SaaS, multi-product, or regulated environments
Strong process orientation: experience implementing OKRs, robust monitoring/observability, and best-in-class incident management
Measurable impact on developer productivity, platform adoption, reliability, and cost-efficiency
Effective communicator and influencer, with the ability to align and inspire cross-functional stakeholders
Experience leading change and building high-performing, people-first engineering cultures
Fluent in English and comfortable in fast-paced, international environments

Job Responsibility

Lead and scale a high-performing infrastructure organization of 30+ engineers across Infrastructure, Automation, SRE, and Database teams, while maintaining strong engagement and fostering a culture of excellence and ownership
Own the infrastructure platform strategy and roadmap that enables Doctolib's modularization journey, delivers on company OKRs, and ensures predictable execution across all infrastructure and automation initiatives
Champion platform-as-a-product by building self-service capabilities (infrastructure provisioning, CI/CD, observability, database management) that transform developer experience and unlock team autonomy across the engineering organization
Be the guardian of quality and reliability by establishing world-class incident management, driving measurable improvements in availability and performance, and ensuring infrastructure components operate at the highest standards of security and resilience
Accelerate engineering velocity by reducing platform friction, enabling faster modularization, and leveraging AI-augmented development tools to multiply productivity across feature teams
Drive the infrastructure transformation from monolith-supporting infrastructure to a modular, multi-service platform architecture - enabling international expansion, product velocity, and operational excellence at scale
Act as a senior technical leader within the Platform organization and broader Tech leadership team, bringing strong technical opinions and challenging architectural decisions while clearly articulating how infrastructure investments contribute to company strategy and business outcomes

What we offer

Free comprehensive health insurance for you and your children
Parent Care Program: receive additional leave on top of the legal parental leave
Free mental health and coaching services through our partner Moka.care
For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
Work from abroad for up to 10 days per year thanks to our flexibility days policy
Work Council subsidy to refund part of sport club membership or creative class
Up to 14 days of RTT
Lunch voucher with Swile card

Fulltime

Select Country

Manager of Site Reliability Engineering (SRE)

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?