CrawlJobs Logo

Senior Manager, Hybrid Services & Reliability (SRE)

gm.com Logo

General Motors

Location Icon

Location:
United States , Austin, Texas

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

201600.00 - 302000.00 USD / Year

Job Description:

As the Senior Engineering Manager for Hybrid Services & Reliability (HSR) within AV Core Infrastructure (ACI) at GM, you are the architect of our system trust. You will lead a newly seeded team responsible for the measurable availability of the hybrid cloud systems that underlie all autonomous vehicle development and operations. We need a leader who views reliability not as an afterthought, but as an inherent property of the platform, ensuring that all teams have a stable and ready-state engineering environment. You are comfortable operating systems at scale, not just designing them.

Job Responsibility:

  • Reliability Engineering: Define, measure, and enforce strict SLOs/SLIs for critical hybrid cloud services, including network connectivity and compute readiness
  • Foundational Utilities: Own and manage core on-prem utilities, such as DHCP, PXE, and CDN, to ensure seamless server auto-provisioning across the global fleet
  • Environment Integrity: Manage the entire data flow path, from initial ingestion at the test bench through the secure cloud network into production staging
  • HIL Readiness: Guarantee the 99%+ availability and stability of remote CI-based Hardware-in-the-Loop (HIL) benches required for AV safety validation
  • Organization Growth: Actively lead the recruitment and technical mentorship of Senior and Staff ICs as part of the team's expansion

Requirements:

  • Extensive background in Site Reliability Engineering (SRE) and defining SLO/SLI frameworks for hybrid cloud environments
  • Technical proficiency in managing on-prem Linux utilities (DHCP/PXE/NTP) and core development services
  • Opinionated view on automated observability, incident response, and MTTR reduction
  • Proven leadership experience

Nice to have:

Experience with configuration management tools (e.g., Chef, Ansible) for large-scale, remote hardware fleets

What we offer:
  • medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts
  • relocation benefits

Additional Information:

Job Posted:
March 03, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Manager, Hybrid Services & Reliability (SRE)

Senior Engineer, Hybrid Cloud Fabric

Become a key player in GEICO's tech transformation! We are seeking a Senior or S...
Location
Location
United States , Palo Alto, CA; Dallas, TX; Seattle, WA
Salary
Salary:
100000.00 - 215000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Service mesh expertise (dev): familiar with mesh architecture, components, and configuration options, including advanced traffic management, security policies, and telemetry customization
  • Service mesh experience (ops): designed, implemented, and managed service mesh solutions at scale, addressing challenges related to performance, security, and observability
  • Programming skills: Experience with Go is a must
  • Rust is a bonus
  • Linux OS: In-depth knowledge of Linux operating systems, including performance tuning, troubleshooting, and security best practices
  • Networking: Advanced understanding of networking concepts and tools (e.g., iptables, netfilter, traffic shaping) for analyzing and optimizing service mesh performance within the hybrid cloud environment
  • Kubernetes and containerization: Extensive experience with Kubernetes and container orchestration platforms, including networking, security, and service management
  • Microservices architecture: Deep understanding of microservices design patterns, service discovery mechanisms, API gateways, and distributed tracing
  • Observability and monitoring: Expertise in tools like Prometheus, Grafana, Jaeger, and Kiali to monitor service mesh performance and troubleshoot issues
  • Security best practices: Knowledge of zero-trust security principles, authentication and authorization mechanisms, and encryption technologies within the context of service mesh
Job Responsibility
Job Responsibility
  • Design and implement a robust service mesh architecture, encompassing traffic management, security, observability, and resilience for microservices across public and private clouds within our on-premises data centers
  • Integrate the service mesh with existing infrastructure and applications, ensuring seamless operation and interoperability with various platforms and technologies, including legacy systems
  • Establish and enforce service mesh best practices, including security policies, traffic routing rules, circuit breakers, and access control mechanisms, to maintain a secure and reliable application environment
  • Develop comprehensive monitoring and observability dashboards to provide deep insights into service mesh health, performance, and potential issues, enabling proactive problem identification and resolution
  • Guide and mentor engineers on service mesh principles and best practices, fostering knowledge sharing and expertise development within the team, empowering them to contribute effectively to the service mesh implementation
  • Work closely with networking and security teams to ensure secure and efficient integration of the service mesh with on-premises infrastructure and networks, addressing potential challenges and ensuring smooth operation
  • Partner with SREs to establish service mesh observability, monitoring, and alerting strategies for maintaining high availability and performance, collaborating to define SLOs, SLIs, and error budgets
  • Actively engage with the Istio community, contribute to open-source projects, and represent GEICO's leadership in service mesh adoption
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Staff Engineer, Hybrid Cloud Fabric

Become a key player in GEICO's tech transformation! We are seeking a Senior or S...
Location
Location
United States , Palo Alto; Dallas; Chevy Chase; Seattle
Salary
Salary:
120000.00 - 260000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Service mesh expertise (dev): familiar with mesh architecture, components, and configuration options, including advanced traffic management, security policies, and telemetry customization
  • Service mesh experience (ops): designed, implemented, and managed service mesh solutions at scale, addressing challenges related to performance, security, and observability
  • Programming skills: Experience with Go is a must
  • Rust is a bonus
  • Linux OS: In-depth knowledge of Linux operating systems, including performance tuning, troubleshooting, and security best practices
  • Networking: Advanced understanding of networking concepts and tools (e.g., iptables, netfilter, traffic shaping) for analyzing and optimizing service mesh performance within the hybrid cloud environment
  • Kubernetes and containerization: Extensive experience with Kubernetes and container orchestration platforms, including networking, security, and service management
  • Microservices architecture: Deep understanding of microservices design patterns, service discovery mechanisms, API gateways, and distributed tracing
  • Observability and monitoring: Expertise in tools like Prometheus, Grafana, Jaeger, and Kiali to monitor service mesh performance and troubleshoot issues
  • Security best practices: Knowledge of zero-trust security principles, authentication and authorization mechanisms, and encryption technologies within the context of service mesh
Job Responsibility
Job Responsibility
  • Design and implement a robust service mesh architecture, encompassing traffic management, security, observability, and resilience for microservices across public and private clouds within our on-premises data centers
  • Integrate the service mesh with existing infrastructure and applications, ensuring seamless operation and interoperability with various platforms and technologies, including legacy systems
  • Establish and enforce service mesh best practices, including security policies, traffic routing rules, circuit breakers, and access control mechanisms, to maintain a secure and reliable application environment
  • Develop comprehensive monitoring and observability dashboards to provide deep insights into service mesh health, performance, and potential issues, enabling proactive problem identification and resolution
  • Guide and mentor engineers on service mesh principles and best practices, fostering knowledge sharing and expertise development within the team, empowering them to contribute effectively to the service mesh implementation
  • Work closely with networking and security teams to ensure secure and efficient integration of the service mesh with on-premises infrastructure and networks, addressing potential challenges and ensuring smooth operation
  • Partner with SREs to establish service mesh observability, monitoring, and alerting strategies for maintaining high availability and performance, collaborating to define SLOs, SLIs, and error budgets
  • Actively engage with the Istio community, contribute to open-source projects, and represent GEICO's leadership in service mesh adoption
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Orion Tech SRE Lead - Senior Vice President

The Orion Tech- SRE Lead is a hands-on leader responsible for shaping and delive...
Location
Location
India , Chennai; Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 16+ years of experience in Observability, SRE, Infrastructure Engineering, or Platform Architecture, including 5+ years in senior leadership roles
  • Deep expertise in observability tools and stacks such as Grafana, Prometheus, OpenTelemetry, ELK, Splunk, and similar platforms
  • Strong hands-on experience across hybrid infrastructure, including on-prem, cloud (AWS, Google Cloud), and container platforms (ECS, Kubernetes)
  • Proven ability to design scalable telemetry and instrumentation strategies, resolve production observability gaps, and integrate them into large-scale systems
  • Experience leading teams and managing people across geographically distributed locations
  • Strong ability to influence platform, cloud, and engineering leaders to ensure observability tooling is built for reuse and scale
  • Deep understanding of SRE fundamentals, including SLIs, SLOs, error budgets, and telemetry-driven operations
  • Strong collaboration skills and experience working across horizontal infrastructure teams, building consensus and delivering changes
  • Ability to stay up to date with market trends and apply them to improve internal tooling and design decisions
  • Good understanding of AI tech stack, should be able to create a business case and solve using Citibank AI solutions
Job Responsibility
Job Responsibility
  • Define and own the roadmap for Engineering enablers for Project Orion team aligned with enterprise reliability and SRE Services organization goals
  • Translate Organization strategy into an actionable delivery plan in partnership with Services Products, Operations & Engineering function, delivering incremental, high-value milestones
  • Understand Critical Business Services functional scope and translate into End-to-End monitoring solutions
  • Periodic review and analyze application monitoring TOIL and collaborate with stakeholders and remediate them as per organization goal
  • Identify manual operations use cases which are performed by Level 1 functions. Create a strategic plan to automate
  • Drive reusability and efficiency by tracking problem statements raised by Orion Level 1 Function by providing milestone delivery plan
  • Ability to Design & Build strategic observability dashboard including gold signals like SLO, SLI, Latency & business metrics in a single pane of glass
  • Lead and mentor SREs, fostering a technical growth and SRE mindset
  • Work hands-on to troubleshoot telemetry and instrumentation issues across on-prem, cloud (AWS, GCP, etc.), and ECS/Kubernetes-based environments
  • Use Jira/Agile workflows to track and report on strategic enablers coverage, adoption, and contribution to improved client experience
  • Fulltime
Read More
Arrow Right

SRE Observability Lead Engineer

The SRE Observability Lead Engineer is a hands-on leader responsible for shaping...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Relevant experience in Observability, SRE, Infrastructure Engineering, or Platform Architecture, including several years in senior leadership roles
  • Deep expertise in observability tools and stacks such as Grafana, Prometheus, OpenTelemetry, ELK, Splunk, and similar platforms
  • Strong hands-on experience across hybrid infrastructure, including on-prem, cloud (AWS, GCP, Azure), and container platforms (ECS, Kubernetes)
  • Proven ability to design scalable telemetry and instrumentation strategies, resolve production observability gaps, and integrate them into large-scale systems
  • Experience leading teams and managing people across geographically distributed locations
  • Strong ability to influence platform, cloud, and engineering leaders to ensure observability tooling is built for reuse and scale
  • Deep understanding of SRE fundamentals, including SLIs, SLOs, error budgets, and telemetry-driven operations
  • Strong collaboration skills and experience working across federated teams, building consensus and delivering change
  • Ability to stay up to date with industry trends and apply them to improve internal tooling and design decisions
  • Excellent written and verbal communication skills
Job Responsibility
Job Responsibility
  • Define and own the strategic vision and multi-year roadmap for Observability across Services Technology, aligned with enterprise reliability and production goals
  • Translate strategy into an actionable delivery plan in partnership with Services Architecture & Engineering function, delivering incremental, high-value milestones toward a unified, scalable observability architecture
  • Lead and mentor SREs across Services, fostering a technical growth and SRE mindset
  • Build and offer a suite of central observability services across LoBs – including standardized telemetry libraries, onboarding templates, dashboard packs, and alerting standards
  • Drive reusability and efficiency by creating common patterns and golden paths for observability adoption across critical client flows and platforms
  • Partner with infrastructure, CTO and other SMBF tooling teams, to ensure observability tooling is scalable, resilient, and avoids duplication (“cottage industries”)
  • Work hands-on to troubleshoot telemetry and instrumentation issues across on-prem, cloud (AWS, GCP, etc.), and ECS/Kubernetes-based environments
  • Collaborate closely with the architecture function to support implementation of observability NFRs in the SDLC, ensuring new apps go live with sufficient coverage and insight
  • Support SRE Communities of Practice (CoP) and foster strong relationships with SREs, developers, and platform leads across Services and beyond to accelerate adoption & promote SRE best practices like SLO adoption, Capacity Planning
  • Use Jira/Agile workflows to track and report on observability maturity across Services LoBs – coverage, adoption, and contribution to improved client experience
What we offer
What we offer
  • 27 days annual leave (plus bank holidays)
  • A discretional annual performance related bonus
  • Private Medical Care & Life Insurance
  • Employee Assistance Program
  • Pension Plan
  • Paid Parental Leave
  • Special discounts for employees, family, and friends
  • Access to an array of learning and development resources
  • Fulltime
Read More
Arrow Right

Senior Staff Engineer – Change Management

GEICO is seeking an experienced Software Engineer who is passionate about buildi...
Location
Location
United States , Chevy Chase; Austin; New York City; Seattle; Palo Alto
Salary
Salary:
110000.00 - 260000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in at least two modern programming languages (Go, Python, Java, C, C++) and object-oriented design
  • Strong ownership and accountability with excellent communication and collaboration skills
  • Hands-on experience in incident response, troubleshooting, and root cause analysis
  • Experience managing distributed systems in public, private, or hybrid cloud environments
  • Experience with monitoring, logging, and observability tools (Prometheus, Grafana, OpenTelemetry, Loki)
  • Passion for automation and reducing manual operations using tools like Terraform and Ansible
  • Familiarity with configuration management and orchestration tools (Helm, Puppet, Spinnaker)
  • Experience with CI/CD pipelines, Infrastructure as Code (IaC), and cloud-based deployments
  • Ability to operate in a fast-paced, high-scale environment with a problem-solving mindset
  • 10+ years of professional experience in software development, platform architecture, and infrastructure management
Job Responsibility
Job Responsibility
  • Develop and drive the overall strategy for our enterprise Change and Approval Management, aligning it with the organization's business goals and objectives
  • Lead technical initiatives across multiple teams, providing strategic and technical guidance
  • Utilize programming languages like Go, Python, Java, and work with SQL/NoSQL databases
  • Work with container orchestration tools such as Docker, Kubernetes, and OpenStack
  • Architect and develop cloud-native applications using Azure services
  • Collaborate with product managers, engineering teams, and stakeholders to solve complex challenges
  • Ensure the quality, performance, and usability of engineering solutions
  • Serve as a mentor and thought leader, coaching engineers and influencing executives
  • Continuously improve processes, adopt best practices, and drive operational efficiency
  • Support and participate in On Call rotations, respond to incidents, diagnosing production issues, and conducting post-incident reviews to improve system reliability
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Network Engineer

Bumble is seeking a Network Engineer to maintain a stable, predictable, controll...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
bumble.com Logo
Bumble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of hands-on Linux systems engineering experience (preferably rpm-based distributions such as RHEL or CentOS)
  • Strong diagnostic and troubleshooting skills spanning application performance, traffic-delivery issues, and complex multi-layer networking challenges
  • Deep understanding of networking across L1–L4 and L7, including copper/optics, Ethernet, and static/dynamic routing
  • Production experience with IS-IS and BGP (OSPF familiarity beneficial)
  • Extensive hands-on experience with Juniper MX, SRX, and QFX devices
  • Practical experience implementing and supporting EVPN-VXLAN architectures
  • Strong background in load balancing (CARP, IPVS, userspace, or enterprise solutions) and packet filtering
  • Experience building and supporting cloud networking architectures (VPC structures, virtual routing, firewalling, hybrid connectivity, etc.)
  • Proficiency with 802.1X, 802.1Q, and bonding/teaming at both the server and network hardware layers
  • Strong diagnostic capabilities with IPv4, ICMP, TCP, UDP, DHCP, and DNS (IPv6 is a plus)
Job Responsibility
Job Responsibility
  • Support and evolve Bumble’s global network infrastructure across multiple data centres and offices, including diagnostics of network subsystems within Linux servers (primarily CentOS/RHEL)
  • Improve network reliability and operational efficiency through configuration management, automation, and continuous optimisation of BAU tasks
  • Contribute to the design, implementation, and operation of cloud networking as we migrate a significant portion of our workloads into cloud environments
  • Collaborate closely with Systems Engineering and SRE teams, sharing networking expertise, participating in design reviews, and shaping resilient, secure platform architectures
  • Manage relationships with global service providers, including IP transit operators, to ensure optimal performance, availability, and accountability
  • Own IP address management, including subnet allocation, VLAN design, and maintaining accurate documentation
  • Strengthen Bumble’s security posture by contributing to perimeter defence, segmentation strategy, and proactive threat prevention
  • Participate in the on-call rota to maintain platform availability and support timely incident response
Read More
Arrow Right

Senior Staff Engineer, Software

Our Senior Staff Software Engineer works with our Managers, Distinguished and Sr...
Location
Location
United States , Chevy Chase; Palo Alto; Dallas; Seattle
Salary
Salary:
120000.00 - 260000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fluency in at least one modern language (Go/Python preferred)
  • Understanding of compression algorithms, deduplication, encryption, and error correction
  • Understanding of SQL and NoSQL databases, including stateful services management and storage
  • Understanding of networking, caches, key/value stores, load balancing, global load balancing, queues, DNS and CDN
  • Deep knowledge of SRE practices, methodologies, and principles, along with a solid understanding of on prem and public cloud-based network, compute, and storage technologies
  • In-depth knowledge of hybrid cloud architecture, IaaS and PaaS technologies, container orchestration platforms (e.g., Kubernetes), cloud efficiency and observability etc.
  • Strong background in incident management
  • Ability to create incident response playbooks, runbooks, incident triaging strategies, and post-incident analysis to drive continuous improvement in system reliability and availability
  • Experience with open-source management and monitoring tools
  • Experience with infrastructure automation, tooling, and configuration management frameworks (e.g., Puppet, Chef, Ansible, Pulumi, Terraform, etc.)
Job Responsibility
Job Responsibility
  • Develop and drive the overall strategy for the Business Continuity and Disaster Recovery (BCDR) organization, aligning it with the organization's business goals and objectives
  • Provide thought leadership in BCDR, staying ahead of industry trends and emerging technologies to enhance our backup/restore posture
  • Conduct comprehensive risk assessments to identify potential threats and vulnerabilities
  • Design and implement robust strategies to ensure data safety, integrity and correctness
  • Lead the design and architecture of resilient and scalable systems, considering both on-premises and cloud-based solutions
  • Collaborate with cross-functional teams to integrate data safeguard best practices into the development and deployment processes
  • Develop and maintain comprehensive incident response plans to address various disaster scenarios on our orchestration and backup/restore systems
  • Conduct regular simulations and drills to ensure the readiness of the organization in the event of a disaster
  • Hands-on software engineering and SDLC best practices (Technical Review Documents, Architecture, Software Development, Software Reviews, Testing, Production Readiness Reviews, among others)
  • Evaluate, select, and implement cutting-edge technologies and tools to enhance our data safeguard capabilities including but not limited to processes, compliance, and visibility
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior IT Systems Engineer

We are looking for an experienced Senior Systems Engineer who feels comfortable ...
Location
Location
Germany , Munich
Salary
Salary:
Not provided
brainlab.com Logo
Brainlab
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Very good hands-on experience with VMware Cloud Foundation (VCF) – ideally including recent versions (5.x / 9.x)
  • Solid understanding of core VMware technologies: vSphere, NSX, vSAN, Aria Operations / Aria Automation
  • Practical experience operating and troubleshooting Kubernetes clusters (preferably 1–4 years)
  • Good knowledge of container ecosystem concepts
  • Strong skills with Infrastructure as Code (Terraform, Ansible)
  • Experience with workflow orchestration / automation tools – Kestra is a strong plus
  • Very good practical knowledge of Grafana LGTM stack
  • Scripting & programming skills
  • Understanding of hybrid cloud architectures (on-prem ↔ public cloud connectivity patterns)
  • Familiar with SRE principles (SLI/SLO, error budgets, toil reduction, blameless post-mortems)
Job Responsibility
Job Responsibility
  • Design, operate and continuously improve the reliability & availability of our VMware Cloud Foundation (VCF) based platforms (on-prem and in interconnection with cloud environments)
  • Implement and extend our observability stack based on Grafana LGTM
  • Manage and automate VMware landscapes (vSphere, NSX, vSAN, Aria Suite etc.) in large-scale hybrid/multi-cloud setups
  • Build, operate and scale Kubernetes clusters, including day-2 operations, upgrades, capacity management and security hardening
  • Develop and maintain automation workflows primarily using Kestra (in conjunction with other tools such as Ansible, Terraform)
  • Drive incident response, post-mortem culture, error budgets and toil reduction according to SRE principles
  • Collaborate closely with development teams, platform teams and security to enable self-service capabilities and fast, safe releases
  • Participate in on-call rotation
What we offer
What we offer
  • 30 vacation days, plus December 24th and December 31st
  • Flexible working hours as well as hybrid work model within Germany
  • Bike leasing via cooperation partner "BikeLeasing"
  • Parking garage and safe underground bike storage
  • Award-winning subsidized company restaurant and in-house cafes
  • Variety-rich fitness program in our ultra-modern 360m2 company gym
  • Regular after work, team, and company events
  • Comprehensive training and continuing education opportunities
Read More
Arrow Right