CrawlJobs Logo

Site Reliability Engineering (SRE) Team Lead

onemainfinancial.com Logo

OneMain Financial

Location Icon

Location:
United States , Irving

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are looking for a highly skilled and experienced Site Reliability Engineering Team Lead to guide our SRE team, foster best practices, and ensure operational excellence across our infrastructure. As the SRE Team Lead, you will be responsible for the technical leadership of a talented team of site reliability engineers dedicated to maintaining and improving the reliability, scalability, and performance of our critical systems and services. You will serve as a technical leader and mentor, driving strategic initiatives around automation, incident management, observability and system design while collaborating closely with engineering, operations, and product teams.

Job Responsibility:

  • Lead, mentor, and grow a team of site reliability engineers, promoting a culture of reliability, automation, and continuous improvement
  • Drive the design, implementation, and maintenance of scalable and fault-tolerant infrastructure to support high-availability services
  • Oversee incident management processes, including triage, root cause analysis, and postmortems to improve system reliability and prevent recurrence
  • Collaborate cross-functionally with software engineering, product, and operations teams to integrate reliability best practices into the software development lifecycle
  • Define and implement operational metrics, SLIs/SLOs, and dashboards to monitor system health and drive proactive improvements
  • Manage and assess the observability of critical environments proactively addressing gaps that may arise
  • Oversee the release management processes, artifacts and tools that drive a repeatable software delivery lifecycle
  • Champion automation efforts to reduce manual intervention, improve deployment pipelines, and optimize infrastructure management
  • Lead capacity planning, disaster recovery, and performance tuning efforts
  • Ensure security and compliance standards are upheld across infrastructure and operations

Requirements:

  • BA/BS in Computer Science, Engineering, related field, or equivalent experience
  • 7+ years of experience in site reliability engineering, systems engineering, or related roles, with at least 2 years in a leadership position
  • Proven experience leading and scaling high-performing engineering teams
  • Deep expertise in cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker)
  • Strong skills in infrastructure as code tools (Terraform, Ansible, CloudFormation) and CI/CD pipelines
  • Proficiency with monitoring and alerting systems (Prometheus, Grafana, ELK, Datadog)
  • Solid programming and scripting skills (Python, Go, Bash, or similar)
  • Strong understanding of distributed systems, networking, security, and databases
  • Excellent leadership, communication, and collaboration skills
  • Experience managing incident response and on-call rotations

Nice to have:

  • Experience working with microservices and event-driven architectures
  • Familiarity with compliance frameworks such as GDPR, PCI, SOX, or SOC 2
  • Background in DevOps practices and tooling
What we offer:
  • Health and wellbeing options including medical, prescription, dental, vision, hearing, accident, hospital indemnity, and life insurances
  • Up to 4% matching 401(k)
  • Employee Stock Purchase Plan (10% share discount)
  • Tuition reimbursement
  • Paid time off (15 days’ vacation per year, plus 2 personal days, prorated based on start date)
  • Paid sick leave as determined by state or local ordinance, prorated based on start date
  • Paid holidays (7 days per year, based on start date)
  • Paid volunteer time (3 days per year, prorated based on start date)
  • Access to Talkspace and Hinge for on-demand physical therapy via an app
  • Family back-up care

Additional Information:

Job Posted:
January 16, 2026

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Engineering (SRE) Team Lead

Staff Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...
Location
Location
Spain
Salary
Salary:
101000.00 - 131000.00 EUR / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
  • Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
  • 7+ years experience in Site Reliability or Production Engineering teams
  • Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
  • Ability to write high quality code that is easily understood and used by others
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization
  • Equivalent practical experience or a Bachelor’s degree in a related field
  • Based in Spain for the role
Job Responsibility
Job Responsibility
  • Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
  • Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
  • Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
  • Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
  • Participate in an on-call rotation
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental benefit
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Visa sponsorship
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps ...
Location
Location
Poland
Salary
Salary:
358000.00 - 458000.00 PLN / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
  • Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
  • 7+ years experience in Site Reliability or Production Engineering teams
  • Demonstrate curiosity with empathy, and strong opinions loosely held
  • Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
  • Write high quality code that is easily understood and used by others
  • Thrive in ambiguity, and are comfortable moving from low level language idioms all the way to the architecture of large systems to understand how they work
  • Growth and impact trajectory demonstrates that you have mastered gathering and iterating on feedback from your engineering and cross-functional peers
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization
Job Responsibility
Job Responsibility
  • Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
  • Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
  • Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
  • Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental leave
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps ...
Location
Location
Poland
Salary
Salary:
301000.00 - 401000.00 PLN / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • A track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes
  • Meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance
  • 4+ years working in a Site Reliability or Production Engineering team
  • Experience defining a technical plan for the delivery of a significant feature or system component with an elegant, simple and extensible design
  • Experience in making impactful changes in a large code base, and have developed a suite of tools and practices that enable you and your team to do so safely
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team
Job Responsibility
Job Responsibility
  • Own and deliver quarterly goals for your team, lead engineers on your team through ambiguity to solve open-ended problems, and ensure that everyone is supported throughout delivery
  • Support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs
  • Proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis
  • Support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental benefits
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Manager

The Wikimedia Foundation is looking for an Engineering Manager to join our SRE t...
Location
Location
United States of America
Salary
Salary:
132439.00 - 208378.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Prior experience managing teams
  • Prior hands-on experience with software or reliability engineering (within the last 3 years preferred)
  • Ability to analyze complex systems, troubleshoot issues, and devise effective solutions under pressure
  • Proficiency in project management methodologies to effectively plan, execute, and track new and existing initiatives
  • Strong understanding of cloud computing, networking, Linux systems administration, containerization (e.g., Docker, Kubernetes), and infrastructure as code (e.g., Terraform, Ansible) to be able to provide technical support to the team
  • Aptitude for automation and streamlining of tasks
  • Communicate effectively in both spoken and written English
  • Ability to work independently, as an effective part of a globally distributed team
  • Ability to travel several times a year for occasional in-person meetings
  • B.S. or M.S. in Computer Science or the equivalent in related work experience
Job Responsibility
Job Responsibility
  • Managing one to two globally distributed teams within Wikimedia’s Site Reliability Engineering organization
  • Providing guidance, mentorship, and support to ensure the team's effectiveness and growth
  • Working with team members to set individual performance goals, and supporting them in meeting and evolving their goals and career path
  • Recruiting, hiring, and helping onboard new team members
  • Triaging incoming workload, maintaining focus on priorities, and setting realistic expectations for both peers and team members
  • Coordinating and communicating with other members of the Wikimedia product & engineering teams on relevant projects, executing complex projects and contributing to the organizational strategy
  • Continuously developing the roadmap of the team in alignment with other SRE and Product & Technology teams, and helping to draft and execute the team’s annual and quarterly plans
  • Project managing new and existing initiatives
  • Leading the definition, refinement, and execution of the processes through which the team manages and performs work
  • Leading incident response, diagnosis, and follow-up on system alerts and outages across Wikimedia’s production infrastructure
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right