Site Reliability Engineering (SRE) Team Lead Job at OneMain Financial (Irving)

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
Strong proficiency in Infrastructure as Code (IaC) using Terraform
Solid understanding of cloud platforms including AWS, GCP, or Azure
Experience with automation/configuration management tools like Ansible, Chef, or Puppet
Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
Experience managing Kubernetes and containerized environments (Docker, Helm)
Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
Excellent leadership, communication, and collaboration skills

Job Responsibility

Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

Fulltime

Site Reliability Engineering (SRE) / Observability Technical Lead

Join a dynamic team as a Site Reliability Engineer, leading observability and re...

Location

United Kingdom , London

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
Hands-on experience with OpenTelemetry (OTel) for distributed tracing and observability instrumentation
Strong proficiency in Infrastructure as Code (IaC) using Terraform
Solid understanding of cloud platforms including AWS, GCP, or Azure
Experience with automation/configuration management tools like Ansible, Chef, or Puppet
Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
Experience managing Kubernetes and containerized environments (Docker, Helm)
Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
Excellent leadership, communication, and collaboration skills

Job Responsibility

Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

What we offer

Tailored benefits that support your physical, emotional, and financial wellbeing
Continuous growth and development opportunities
Flexible work options

Fulltime

Manager of Site Reliability Engineering (SRE)

The Manager of Site Reliability Engineering leads and develops a team of SRE pra...

Location

United States , Birmingham

Salary:

Not provided

Genuine Parts Company

Expiration Date

Until further notice

Requirements

Typically requires a bachelor's degree and 7 years of experience in a technology and/or software engineering role or an equivalent combination
Proven experience working in large, complex enterprise environments (Fortune 500 or equivalent)
Strong understanding and demonstrated implementation of Site Reliability Engineering (SRE) principles at scale
Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, and ArgoCD
In-depth knowledge and practical experience with CI/CD pipelines and automation of software delivery
Championing DevOps practices and embedding reliability early in the SDLC
Significant hands-on experience in Site Reliability Engineering or related roles focused on cloud infrastructure reliability
Strong software engineering background with proficiency in infrastructure-as-code tools (e.g., Terraform, ArgoCD) and CI/CD automation
Deep knowledge of cloud platforms, specifically Google Cloud Platform (GCP), Kubernetes, container orchestration, and cloud-native architecture
Familiarity with monitoring and observability tools such as Dynatrace, Datadog, or equivalents

Job Responsibility

Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence
Implement and champion Site Reliability Engineering principles and DevOps best practices within the team to ensure service reliability, availability, and performance
Define and track key SRE metrics such as service uptime, incident response and resolution times
Drive automation efforts including CI/CD pipeline enhancements, infrastructure-as-code practices, and self-service infrastructure provisioning to increase deployment velocity while reducing manual toil
Own and continuously improve observability practices including system monitoring, logging, alerting, and diagnostics to ensure rapid issue detection and resolution
Participate in incident response processes including incident management, root cause analysis, post-mortems, and continuous improvement to enhance system resilience
Partner closely with software engineering, product management, architecture, and security teams to embed reliability and security early in the software development lifecycle (SDLC)
Oversee the management and scalability of cloud infrastructure environments, primarily on Google Cloud Platform (GCP), with a focus on Kubernetes, container orchestration, and hybrid cloud integrations
Advocate for and apply best practices in performance tuning, capacity planning, and system design for high availability
Develop and execute a long-term roadmap for our hybrid cloud platform, aligning with evolving business objectives and technology trends

What we offer

comprehensive benefit plans and programs designed to support your health and wellness, provide income protection and build financial security for your retirement

Fulltime

Site Reliability Engineering Lead

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...

Location

Canada , Mississauga

Salary:

120800.00 - 170800.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

6–10 years of relevant experience in a hands‑on technical role
Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
Experience working with senior stakeholders or technology partners
Demonstrated experience supporting IT service improvements or platform stability initiatives
Strong communication and presentation skills, with the ability to convey technical concepts clearly
Experience supporting or contributing to technical roadmaps or operational workstreams
Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
Ability to collaborate with cross‑functional support teams and technology groups
Strong organizational and workload‑planning skills
Consistently demonstrates clear and concise written and verbal communication skills

Job Responsibility

Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
Assist with vendor relationship management, including coordination with offshore managed services
Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
Partner with development teams to guide improvements in application stability and supportability
Contribute to frameworks for managing capacity, throughput, and latency
Assist in defining and implementing application onboarding guidelines and standards
Support team members by fostering a collaborative environment and encouraging skill development
Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
Participate in business review meetings to help align technology tools and strategies with business requirements
Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program

Fulltime

Site Reliability Engineering Lead

We are seeking an experienced and motivated team member to support our AI and De...

Location

Canada , Mississauga

Salary:

120800.00 - 170800.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

6+ years of relevant experience in a hands‑on technical or support leadership role
Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
Experience working with senior stakeholders or technology partners
Demonstrated experience supporting IT service improvements or platform stability initiatives
Strong communication and presentation skills, with the ability to convey technical concepts clearly
Experience supporting or contributing to technical roadmaps or operational workstreams
Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
Ability to collaborate with cross‑functional support teams and technology groups
Strong organizational and workload‑planning skills
Consistently demonstrates clear and concise written and verbal communication skills

Job Responsibility

Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
Assist with vendor relationship management, including coordination with offshore managed services
Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
Partner with development teams to guide improvements in application stability and supportability
Contribute to frameworks for managing capacity, throughput, and latency
Assist in defining and implementing application onboarding guidelines and standards
Support team members by fostering a collaborative environment and encouraging skill development
Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
Participate in business review meetings to help align technology tools and strategies with business requirements
Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program

Fulltime

Executive Principal, Site Reliability Engineering (SRE) – DevOps

The Executive Principal of Infra Engineering is a senior leader responsible for ...

Location

United States , Irvine

Salary:

180000.00 - 210000.00 USD / Year

Hyundai AutoEver America

Expiration Date

Until further notice

Requirements

Bachelor's degree in IT/IS or equivalent experience
10 years of infrastructure engineering experience
8+ years of management experience required
High availability, fault tolerance, and incident management
Automation of infrastructure and operations
CI/CD pipeline design and maintenance
Monitoring, metrics, and performance tuning
Multi-platform expertise (Windows, Linux, VMware, cloud)
Security, audit, and identity/access management
Change control and risk management

Job Responsibility

Guide the Site Reliability Engineering (SRE) function, integrating DevOps principles to drive operational excellence, reliability, and innovation across infrastructure platforms
Lead multiple technical teams, including Platform Engineering, Data Center Management, Infrastructure Planning & Architecture and Network & Telecommunications, ensuring 24x7 support and continuous improvement within a complex, hybrid environment
Mentor and develop infrastructure managers and SMEs
Lead onshore/offshore teams and manage service providers
Oversee 24x7 operations, incident response, and problem management
Manage OpEx/CapEx, SLAs, KPIs, and OKRs
Ensure reliability, disaster recovery, and lifecycle management
Champion automation, CI/CD, and Infrastructure as Code
Direct monitoring, observability, and performance optimization
Align with security and compliance requirements

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

Director, Site Reliability Engineering

As our Director of Site Reliability Engineering, reporting to our VP of Platform...

Location

Germany , Berlin

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

12+ years in software engineering, including 5+ years leading managers and running infrastructure or SRE organisations at scale
Track record of taking SRE practices from reactive to proactive — with measurable reductions in incidents and MTTR
Strong multi-cloud and network infrastructure experience: load balancing, CDN/WAF, VPCs, peering, at high-traffic scale
Deep database operations background: large-scale transactional systems (PostgreSQL, Aurora), streaming/CDC (Kafka), data layer FinOps
Experience building observability platforms that give teams genuine visibility — metrics, logs, traces, alerting
Sharp process thinking: SLOs, error budgets, incident management, blameless post-mortems
Outcome-driven: you track reliability, cost efficiency, and engineering velocity as business metrics, not just technical ones
Strong communicator and influencer at executive level — equally credible with senior engineers and business stakeholders
Builder of high-performing, people-first engineering cultures
Fluent in English

Job Responsibility

Build and run a world-class SRE org of 25+ engineers across Cloud Infrastructure, Database & Storage, Network Infrastructure, Observability Tooling, and the Doctolib Operations Center
Own the infrastructure strategy and roadmap — cloud, database, network, observability — and deliver against company OKRs
Lead the Doctolib Operations Center: set incident response standards, drive MTTR reduction, embed blameless post-mortem culture across engineering
Architect and execute our multi-cloud strategy — reducing vendor lock-in, cutting migration costs, and enabling international expansion
Own network infrastructure at scale: load balancing, CDN/WAF, VPCs, peering, zero-trust networking across a high-traffic, multi-country platform
Drive observability as a product — give 700+ engineers true visibility into system health and turn observability maturity into an operational excellence lever
Lead from the front as a senior technical voice in the Platform org and broader Tech leadership team

What we offer

A Deutschlandticket (Germany-wide public transport pass) fully paid for by Doctolib
28 vacation days + 1 additional day for each full calendar year of employment (up to a maximum of 30 days)
Work from abroad for up to 10 days per year thanks to our flexibility days policy
Company health insurance with great supplementary benefits through our partner Allianz
Company pension scheme (bAV) through Allianz with an employer subsidy of 40% (15% within the probationary period)
Enrollment in Doctolib's long-term employee value sharing plan called DoctoGrowth
The Doctolib Parent Care program, which includes one month additional parental leave and much more
Free mental health and coaching services through our partner Moka.care
Subsidized sports membership through our partner Urban Sports Club
A flexible workplace policy offering both hybrid and office-based mode

Fulltime

Select Country

Site Reliability Engineering (SRE) Team Lead

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?