This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are hiring a Senior Site Reliability Engineer (SRE) to join the GM Motorsports Software Engineering Data Platform team. This team builds and operates the next-generation data infrastructure that powers analytics, simulation, and telemetry insights across GM’s racing programs including Formula 1, NASCAR, IndyCar, and IMSA. As a foundational member of the reliability function within the Data Engineering organization, you will ensure the availability, performance, and resilience of high-throughput telemetry and analytics platforms that ingest, process, and deliver mission-critical motorsports data. Our environment handles high-frequency streaming telemetry, simulation outputs, and engineering datasets that must be reliable, observable, and scalable. You will play a key role in designing systems where resilience, automation, and observability are built in from the start. We are looking for engineers who are uncomfortable with manual toil and are driven to build platforms where scaling, recovery, and operational insight are inherent properties of the system architecture.
Job Responsibility
Design and implement reliability practices across the motorsports data platform, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for streaming and analytics workloads
Ensure reliability and performance of high-throughput streaming and batch data pipelines supporting telemetry ingestion, analytics processing, and simulation workloads using technologies such as Kafka, Flink, and Databricks
Build and maintain comprehensive observability frameworks including metrics, logs, and tracing across the platform. Develop dashboards, alerts, and automated responses that detect system degradation before it impacts engineering workflows
Drive the automation of platform infrastructure using Infrastructure as Code (IaC) and platform engineering best practices to enable consistent, reproducible environments across development, testing, and production
Identify operational friction and eliminate manual processes by implementing self-healing infrastructure, automation frameworks, and developer self-service capabilities
Own the reliability of data ingestion, transformation, and storage layers, ensuring stable and performant integration across distributed data systems
Continuously evaluate platform performance and scalability, ensuring the data platform can support high-frequency telemetry ingestion, real-time analytics, and large-scale historical analysis
Provide mentorship and peer review to engineers across the platform team, promoting strong operational discipline, resilient system design, and high-quality engineering practices
Requirements
Proven experience in Site Reliability Engineering (SRE), DevOps, or Platform Engineering supporting large-scale distributed systems
Strong experience with Linux systems administration and cloud-native infrastructure
Experience operating high-throughput data platforms or streaming systems (Kafka, Flink, Spark, etc.)
Hands-on experience with Infrastructure as Code tools such as Terraform or similar frameworks