CrawlJobs Logo

Staff Reliability Engineer

robinhood.com Logo

Robinhood

Location Icon

Location:
United States , New York City

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

217000.00 - 255000.00 USD / Year

Job Description:

Join us in building the future of finance. The Robinhood Command Center (RCC) is a newly formed reliability team that serves as the front line for detecting, coordinating, and mitigating production incidents across Robinhood. As part of Robinhood’s broader reliability initiative, RCC works closely with product engineering, reliability, observability, infrastructure, and business teams to reduce customer impact and shorten incident duration. As a Staff Reliability Engineer, you will be part of the founding RCC team, helping define how Robinhood responds to and learns from incidents at scale. This is a highly visible role focused on incident leadership, operational excellence, and reliability tooling. You will not own product services or core infrastructure, but you will own the processes and tools that enable fast, high-quality incident response.

Job Responsibility:

  • Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood’s infrastructure
  • Partner closely across many different types of engineers to raise the bar for operational excellence and incident response
  • Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents
  • Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact
  • Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics
  • Own and evolve incident response tooling and processes, including education, adoption, and measurement of MTTD/MTTR improvements
  • Drive post-incident governance and learning, defining standards for postmortems, SEV reviews, and follow-up tracking to ensure durable reliability improvements
  • Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers
  • Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems
  • Define and own the roadmap of bringing observability to critical user journeys for Robinhood’s products
  • Deliver key insights and executive-level reporting to enable better business decisions around service quality and reliability
  • Act as a force multiplier through mentoring, technical influence, and contributions to hiring and engineering culture

Requirements:

  • 8+ years of software engineering experience, including significant experience operating production systems
  • 4+ years focused on reliability engineering, infrastructure, distributed systems, or production operations
  • Hands-on experience serving in incident leadership roles (e.g., IMOC, incident commander, primary oncall)
  • Strong communication and cross-functional collaboration skills, especially during high-severity incidents
  • Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design
  • Experience with multi-region or multi-cluster architectures, capacity planning, and failover strategies
  • Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana)
  • Demonstrated ability to drive measurable improvements in MTTD, MTTR, availability, or customer impact
What we offer:
  • Performance driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching
  • 100% paid health insurance for employees with 90% coverage for dependents
  • Lifestyle wallet - a highly flexible benefits spending account for wellness, learning, and more
  • Employer-paid life & disability insurance, fertility benefits, and mental health benefits
  • Time off to recharge including company holidays, paid time off, sick time, parental leave, and more
  • Exceptional office experience with catered meals, events, and comfortable workspaces

Additional Information:

Job Posted:
February 14, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Reliability Engineer

Staff Site Reliability Engineer

We are looking for a Site Reliability Engineer to own our internal systems infra...
Location
Location
United States , Sunnyvale
Salary
Salary:
175000.00 - 250000.00 USD / Year
figure.ai Logo
Figure
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience with Linux/Unix systems administration
  • Proficiency in programming/scripting
  • Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
  • Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems
  • Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
  • Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
  • Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
  • Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets
  • Ability to work in cross-functional teams with developers, infra, and product teams
  • Excellent verbal and written communication skills
Job Responsibility
Job Responsibility
  • Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more
  • Migrate SaaS to self-hosted solutions to enhance security and reliability
  • Implement monitoring and alerting systems, and define incident response plans and runbooks
  • Reduce human workload through automation to automate deployment and scaling
  • Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives
  • Use a data driven approach to demonstrate service robustness and track optimization work
  • Partner with the security team to ensure that security remediations and updates are applied in a timely manner
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

At Ledger, we are looking for an experienced Reliability Engineer to join our SR...
Location
Location
France , Paris
Salary
Salary:
Not provided
https://www.ledger.com Logo
Ledger
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years on cloud engineering at scale, on organizations operating SaaS solutions
  • Proficiency in working in Unix/Linux environments, Git, Python, Terraform, Kubernetes, AWS cloud solutions and architectures, CI/CD tools, Argocd, Ansible, configuration management, etc.
  • Strong knowledge on observability practices, with experience implementing and managing Logging, Monitoring and Alerting framework with solutions such as Datadog or Prometheus/Grafana/Loki.
  • Experience of cross-functional work and the ability to demonstrate a collaborative approach with regards to building key relationships across the organization and define projects scope, goals, plan and deliverables
  • Customer focused with the ability to identify and understand both internal and external customer's needs
  • Creative problem-solving and analysis skills with an ability to identify, develop, and implement solutions to meet the needs of the business
  • Excellent presentation and written communication
  • Ability to deal with ambiguity, high level of pressure and rapidly changing environments
  • Engineering degree.
Job Responsibility
Job Responsibility
  • Participate in building a DevOps / SRE culture and enable the transition to modern infrastructure management and deployment practices
  • Participate in building the SRE team roadmap (vision and delivery accountability). Anticipate stakeholder needs, game-changing technologies emergence and challenge scope / deadlines
  • Perform integration of platform software components
  • Participate to design and deliver solutions to improve the availability, scalability, latency, and efficiency of systems
  • Influence and create standards & best practices in support of service level objectives
  • Automate key SRE metrics including SLOs/SLAs and error budgets
  • Provide expert support to our level-2/application support team, to troubleshoot priority incidents, and conduct post-mortems
  • Apply analytics on past incidents and usage patterns to predict issues and take proactive actions
  • Ensure control of technical debt and promote quality practices
  • Follow SRE and chaos engineering approaches across all strategic systems to predict in coordination with Service Design and prevent outages and improve solution availability
What we offer
What we offer
  • Equity: Employees are the foundation of our success, and we award stock options so you can share in that success as we grow
  • Flexibility: A hybrid work policy
  • Social: Annual company outing for Ledgerdary Days, plus frequent social events, snacks and drinks
  • Medical: Comprehensive health insurance policy offering extensive medical, dental and vision care coverage
  • Well-being: Personal development, coaching & fitness with our dedicated partners
  • Vacation: Five weeks of paid leave per year, in addition to national holidays and rest & relaxation (RTT) days
  • High tech: Access to high performance office equipment and gadgets, including Apple products
  • Transport: Ledger reimburses part of your preferred means of transportation
  • Discounts: Employee discount on all our products.
  • Fulltime
Read More
Arrow Right

Staff Engineer, Technology Development Quality & Reliability

The Staff Engineer defines, develops and qualifies new semiconductor packages an...
Location
Location
Malaysia , Batu Kawan, Penang
Salary
Salary:
Not provided
sandisk.com Logo
Sandisk
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • MS / PhD degree in Mechanical/Materials Engineering
  • Knowledge of semiconductor packaging highly desired
  • Working knowledge of AutoCAD, Cadence APD, Finite Element Analysis, Design of Experiments, statistical techniques and package failure analysis techniques required
  • Ability to multi-task and meet tight deadlines
  • Excellent communication and interpersonal skills required
  • Preferred candidate will have worked on projects related to semiconductor packaging from both mechanical and electrical integrity perspective
Job Responsibility
Job Responsibility
  • Defines, develops and qualifies new semiconductor packages and maintains quality of existing packages
  • Represents Package Engineering in cross-functional teams and ensures packages are characterized, qualified and introduced into production in a timely manner while meeting all electrical, performance, reliability and quality requirements
  • Take responsibility for product DPPM, as well as new Technology development programs and new NAND/ASIC development Quality and Reliability
  • Develop a comprehensive quality guidance library encompassing engineering work criteria, statistical Design of Experiments (DOE), Scorecard, sampling plans, and metric/methodology instructions
  • Address quality cases in technology and product development using DMAIC, RCA, and 8DR methodologies
  • Coordinates with factories worldwide on the high-volume introduction of new packages and assembly processes
  • Manages assembly yield improvement and package cost reduction programs
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...
Location
Location
Spain
Salary
Salary:
101000.00 - 131000.00 EUR / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
  • Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
  • 7+ years experience in Site Reliability or Production Engineering teams
  • Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
  • Ability to write high quality code that is easily understood and used by others
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization
  • Equivalent practical experience or a Bachelor’s degree in a related field
  • Based in Spain for the role
Job Responsibility
Job Responsibility
  • Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
  • Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
  • Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
  • Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
  • Participate in an on-call rotation
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental benefit
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Visa sponsorship
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps ...
Location
Location
Poland
Salary
Salary:
358000.00 - 458000.00 PLN / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
  • Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
  • 7+ years experience in Site Reliability or Production Engineering teams
  • Demonstrate curiosity with empathy, and strong opinions loosely held
  • Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
  • Write high quality code that is easily understood and used by others
  • Thrive in ambiguity, and are comfortable moving from low level language idioms all the way to the architecture of large systems to understand how they work
  • Growth and impact trajectory demonstrates that you have mastered gathering and iterating on feedback from your engineering and cross-functional peers
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization
Job Responsibility
Job Responsibility
  • Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
  • Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
  • Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
  • Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental leave
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Staff Engineer, Site Reliability

LearnUpon is looking for a Staff Site Reliability Engineer to join our team in I...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
learnupon.com Logo
LearnUpon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in a software or Ops role
  • 5+ years of cloud engineering experience, with at least 2 years experience with AWS
  • Experience deploying Microservice environments, using containerisation technologies such as Kubernetes and Docker
  • Experience in designing and implementing Observability tech stacks
  • Have championed the benefits of Observability to Engineering teams
  • Can architect the design of SLO/SLI implementation that balances the needs of different teams
  • Familiar with cost analysis of Observability metrics gathering, Engineering effort, and tooling
  • Experience building and supporting large-scale distributed systems that back a consumer app or website with associated requirements of performance, security and disaster recovery
  • Experience with implementing IaaC (e.g. CloudFormation, Terraform etc.), automation tooling (e.g. Puppet, Ansible etc.), CI/CD (e.g. Jenkins, Travis CI, GitLab etc.)
  • Able to effectively communicate technical ideas to and collaborate with both technical and non-technical peers
Job Responsibility
Job Responsibility
  • Identifying opportunities to improve and scale our infrastructure for performance, observability, maintainability, and cost, by creating innovative solutions
  • Leading our efforts to build an observability function that incorporates application metrics, application transaction tracking, and event log management
  • Driving the processes to maintain resilient, scalable and cost-effective infrastructure
  • Working with other Engineering teams to provide infrastructure solutions that meet their ongoing requirements
  • Building tools focused on measuring, monitoring and alerting, with an eye towards self-service in order to promote Engineers’ ownership of observability
  • Reacting quickly to changing customer and business needs
  • Participate in on-call rota
  • Mentoring junior talent
What we offer
What we offer
  • Work in a fun and supportive environment with regular team events
  • Excellent career progression
  • Structured learning environment
  • Competitive salary and company ESOP
  • Private health insurance
  • 26 days annual leave
  • Fulltime
Read More
Arrow Right

Staff /Sr Staff/ Principal Engineer - Lakehouse

Balbix is the world's leading platform for cybersecurity posture automation. Usi...
Location
Location
India , Bangalore; Gurgaon
Salary
Salary:
Not provided
balbix.com Logo
Balbix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in backend software development dealing with large scale applications involving large scale data
  • Proven experience in defining and improving application design and architecture
  • Drive to discover and learn the required new technologies
  • Exposure to state of the art technologies for large scale data systems
  • Proficiency in programming languages such as Python, Scala or Java
  • Hands-on experience with large scale technologies such as Apache Spark, Apache Flink, Cassandra
  • Experience with cloud platforms such as AWS, Azure, or Google Cloud Platform
  • Bachelor's or Master's degree in Computer Science, Engineering, or related field
  • Excellent problem-solving and analytical skills
  • Strong communication and collaboration skills
Job Responsibility
Job Responsibility
  • Collaborate with product managers, data scientists, and other stakeholders to understand requirements and translate them into technical solutions
  • Design, develop, and deploy high scale systems using state of the art technologies
  • Build reliable, consistent and high throughput data services and interfaces
  • Mentor junior developers and contribute to knowledge sharing within the team
  • Help define and ensure the best practices and guidelines across the systems
  • Optimize and tune applications for performance and scalability
  • Troubleshoot and resolve issues in production environments
  • Fulltime
Read More
Arrow Right

Asic Engineer Staff

Hewlett Packard Enterprise is seeking an ASIC Engineer Staff to lead projects in...
Location
Location
United States , Durham
Salary
Salary:
130500.00 - 300000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Electrical Engineering, Computer engineering or equivalent
  • 6-10 years of experience in VLSI design, verification, or implementation
  • Expert level proficiency in electrical engineering fundamentals, VLSI principles, digital logic, and computer architecture
  • Expert level analytical and problem solving skills
  • Expert level knowledge of designing VLSI components, integrated circuitry, architectures and algorithms into VLSI solutions
  • Expert level knowledge of a programming and scripting, hardware description language, electronic design automation (EDA), and/or FPGA tools
  • Coursework in VLSI design or VLSI concepts
  • Experience in executive written and verbal communication skills
  • mastery in English and local language
  • Subject matter expertise or discipline leadership as evidenced through patents/publications in the field of VSLI or Electronic/hardware component designs and tools.
Job Responsibility
Job Responsibility
  • Provide technical expertise and leads project teams of Electronic and VLSI engineers and internal and outsourced development partners responsible for all stages of VLSI design and development for complex products, solutions, and platforms, including design, validation, and testing
  • Reviews and evaluates designs and project activities for compliance with VLSI technology and development guidelines and standards
  • provides tangible feedback to improve product quality
  • Provides VLSI-specific and technical expertise along with the overall architecture design and platform leadership to cross-organization projects, programs, and activities
  • Provides leadership of project team of other VLSI engineers and internal and outsourced development partners to develop reliable, cost effective and high quality solutions for VLSI prototypes and products
  • Provides guidance and mentoring to less experienced staff members to set an example of VLSI design and development innovation and excellence
  • Participates in and provides input on process for selection of future technical leaders
  • Drives VLSI innovation and integration of new technologies into projects and activities in the design organization.
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right