This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Join us in building the future of finance. The Robinhood Command Center (RCC) is a newly formed reliability team that serves as the front line for detecting, coordinating, and mitigating production incidents across Robinhood. As part of Robinhood’s broader reliability initiative, RCC works closely with product engineering, reliability, observability, infrastructure, and business teams to reduce customer impact and shorten incident duration. As a Staff Reliability Engineer, you will be part of the founding RCC team, helping define how Robinhood responds to and learns from incidents at scale. This is a highly visible role focused on incident leadership, operational excellence, and reliability tooling. You will not own product services or core infrastructure, but you will own the processes and tools that enable fast, high-quality incident response.
Job Responsibility:
Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood’s infrastructure
Partner closely across many different types of engineers to raise the bar for operational excellence and incident response
Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents
Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact
Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics
Own and evolve incident response tooling and processes, including education, adoption, and measurement of MTTD/MTTR improvements
Drive post-incident governance and learning, defining standards for postmortems, SEV reviews, and follow-up tracking to ensure durable reliability improvements
Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers
Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems
Define and own the roadmap of bringing observability to critical user journeys for Robinhood’s products
Deliver key insights and executive-level reporting to enable better business decisions around service quality and reliability
Act as a force multiplier through mentoring, technical influence, and contributions to hiring and engineering culture
Requirements:
8+ years of software engineering experience, including significant experience operating production systems
4+ years focused on reliability engineering, infrastructure, distributed systems, or production operations