This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a hardware focused Senior Reliability Engineer to focus on sensor and hardware system reliability, owning the observability, alerting, and automation that ensures Uber’s in-vehicle sensor data collection systems operate reliably at scale. This role is centered on maximizing sensor uptime, data yield, and supply hours across a large, geographically distributed fleet. You will design systems that determine when to react to issues impacting data recording capability, whether caused by failing sensors, degraded onboard computers, software regressions, or systemic environmental factors. As the technical owner for sensor reliability and observability, you will build the infrastructure that converts low-level signals into actionable intelligence and automated responses. This is a seniorrole requiring strong software engineering fundamentals, deep systems thinking, and the ability to drive cross-team technical direction without direct authority.
Job Responsibility
Architect Observability Systems: Design and scale an observability platform capable of ingesting and analyzing real-time health telemetry from thousands of distributed vehicle nodes
Build for Edge Constraints: Develop systems that remain performant despite hardware diversity, intermittent connectivity, and rapid fleet scaling
Define Criticality Models: Establish alerting strategies that distinguish transient anomalies from systemic issues impacting sensor uptime and data yield
Detect Complex Failure Modes: Design detection logic for 'silent' failures, such as sensor degradation, compute saturation, or recording pipeline stalls
Scale Through Automation: Design automated detection, triage, and mitigation mechanisms to eliminate manual intervention as the fleet grows
Partner on Mitigation: Collaborate with Operations and Engineering to build safe, automated responses to recurring hardware and software failure scenarios
Drive Operational Efficiency: Build technical interfaces to help Operations surface issues and Engineering diagnose and deploy mitigations rapidly (TTD/TTM)
Lead Technical Strategy: Drive reliability-focused design reviews and translate operational pain points into concrete technical requirements and roadmaps
Uncover Proactive Insights: Apply advanced data analytics to identify latent patterns in fleet telemetry, enabling the proactive detection of systemic regressions and hardware degradation before they impact operations
Requirements
5+ years of relevant industry experience in software engineering, site reliability, or systems engineering
Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments
coding skills in one or more of Go, Python, or C++, with experience building and operating production systems
Proficiency in Linux internals and shell scripting for triaging and debugging edge devices or hardware-adjacent systems
Ability to debug across services, containers (Docker), and networking stacks
Proven track record owning reliability, infrastructure, or platform systems for large-scale production workloads
Experience designing and operating observability systems (metrics, logging, alerting, and dashboards)
Experience defining and implementing SLIs and SLOs for system availability or data yield
Deep understanding of networking protocols (TCP/IP, gRPC, or MQTT) and data handling in bandwidth-constrained environments
Experience driving complex technical projects and architectural reviews across multiple teams from design through production
Nice to have
Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments
Knowledge of sensor data protocols (e.g., Camera, LiDAR, Radar) or hardware-to-cloud data ingestion pipelines
Experience with 'Grey Failure' detection and management in complex, distributed systems
Proven track record in 'Fleet Health' for large-scale hardware deployments (e.g., cloud infrastructure, global server fleets, or industrial IoT) where automation was used to replace manual intervention