Staff Reliability Engineer Job at Robinhood (New York City)

Staff Reliability Engineer

ALSO is looking for a Reliability Engineer to play a key role in developing and ...

Location

United States , Palo Alto

Salary:

220000.00 - 255000.00 USD / Year

ALSO

Expiration Date

Until further notice

Requirements

Minimum Bachelor of Science in an engineering discipline or equivalent
Five or more years of industry experience in a reliability engineering role
Technical knowledge of one or more aspects related to PCBA reliability, energy storage systems, drive units, and chassis components
Working knowledge of a coding language, preferably Python
Working knowledge of statistical software for reliability, such as JMP or the ReliaSoft suite, or MATLAB
Working knowledge of failure analysis techniques such as SEM, CT, X-Ray, EDS, TDR
Experience with instrumentations for data collection and testing such as thermocouples, strain gauges, accelerometers
Experience with deploying reliability testing guidelines and inventing new ways of testing
Ability to use FEA software (Eg. Ansys Sherlock) for product life prediction in creation of reliability tests

Job Responsibility

Establish reliability targets and metrics for new product development that include actuators, batteries, drive units, PCBAs, chassis, and various low voltage electronic systems
Develop new reliability tests procedures and specifications such as highly accelerated life testing, environmental testing, and other reliability tests to demonstrate reliability
Apply various types of acceleration models for creation of reliability tests to accurately predict product lifetime under accelerated stress conditions
Prepare concise and detailed test plans and analyze test reports. Provide updates and progress on various testing campaigns
Guide the engineering teams on failure modes, typical countermeasures for reliability failures, and design guidelines. Provide risk analysis to guide design decisions
Monitor field performance and identify trends in failure
Leading and facilitating FMEAs with cross-functional teams and developing design verification plans from these activities

What we offer

Robust health coverage. Excellent health, dental and vision insurance covered up to 100% by ALSO with FSA & HSA options
One Medical membership and dedicated insurance advocates
Rich fertility and family building benefits with Progyny
Flexible time off
401(k) match

Fulltime

Staff Reliability Engineer

The Robinhood Command Center (RCC) is a newly formed reliability team that serve...

Location

United States , New York

Salary:

217000.00 - 255000.00 USD / Year

Robinhood

Expiration Date

Until further notice

Requirements

8+ years of software engineering experience, including significant experience operating production systems
4+ years focused on reliability engineering, infrastructure, distributed systems, or production operations
Hands-on experience serving in incident leadership roles (e.g., IMOC, incident commander, primary oncall)
Strong communication and cross-functional collaboration skills, especially during high-severity incidents
Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design
Experience with multi-region or multi-cluster architectures, capacity planning, and failover strategies
Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana)
Demonstrated ability to drive measurable improvements in MTTD, MTTR, availability, or customer impact

Job Responsibility

Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood’s infrastructure
Partner closely across many different types of engineers to raise the bar for operational excellence and incident response
Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents
Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact
Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics
Own and evolve incident response tooling and processes, including education, adoption, and measurement of MTTD/MTTR improvements
Drive post-incident governance and learning, defining standards for postmortems, SEV reviews, and follow-up tracking to ensure durable reliability improvements
Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers
Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems
Define and own the roadmap of bringing observability to critical user journeys for Robinhood’s products

What we offer

Performance driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching
100% paid health insurance for employees with 90% coverage for dependents
Lifestyle wallet - a highly flexible benefits spending account for wellness, learning, and more
Employer-paid life & disability insurance, fertility benefits, and mental health benefits
Time off to recharge including company holidays, paid time off, sick time, parental leave, and more
Exceptional office experience with catered meals, events, and comfortable workspaces

Fulltime

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Staff Site Reliability Engineer - Incident Management & Reliability

We’re not just building better tech. We’re rewriting how data moves and what the...

Location

Canada

Salary:

225100.00 - 264500.00 CAD / Year

Confluent

Expiration Date

Until further notice

Requirements

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes

Job Responsibility

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks
use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs
coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

What we offer

Remote-First Work
Robust Insurance Benefits
Flexible Time Away
The Best Teammates
Experience Ambassadors
Open and Honest Culture
Well-Being and Growth
Offers Equity

Fulltime

Staff Engineer – Reliability Engineering

At GEICO, we offer a rewarding career where your ambitions are met with endless ...

Location

United States , Bethesda, MD; Seattle, WA

Salary:

115000.00 - 230000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Experience in at least two modern programming languages (Go, Python, Java, .NET) and object-oriented design
Advance knowledge of web technologies such as HTML, CSS, JavaScript is preferred
Understand open-source databases like MySQL, PostgreSQL, etc., familiar with No-SQL databases like ONgDB, Cassandra, MongoDB, Elasticsearch, etc.
Deep hands-on experience in complex system design and data pipeline and architectures, scale and performance, tuning, with good knowledge of Docker and Kubernetes
Hands-on experience with major cloud platforms (Azure, AWS, GCP) or large-scale private data center environments
Experience managing distributed systems in public, private or hybrid cloud environments
Experience with monitoring, logging and observability tools (Prometheus, Grafana, Open Telemetry)
Passion for automation and reducing manual operations using tools like Terraform and Ansible
Familiarity with configuration management and orchestration tools like Helm, Puppet, Spinnaker
Experience with CI/CD pipelines, Infrastructure as Code(IaC), and cloud-based deployments

Job Responsibility

Focus on multiple areas and provide strategic and technical guidance
Utilize programming languages like Go, Python, Java, .Net or other object-oriented languages, SQL, and NoSQL databases
Work with container orchestration tools such as Docker and Kubernetes (K8S), OpenStack and a variety of Azure tools and services
Architect and develop cloud-native applications using Azure Services
Collaborate with product managers, team members, customers, and other engineering teams to solve our toughest problems
Ensure the quality, performance and usability of the engineering solutions
Serve as a mentor and thought leader, coaching engineers and Influence and educate executives
Drive best practices for platform reliability, disaster recovery, monitoring, alerting, and incident management
Collaborate with cross-functional teams (Platform engineering, DevOps, SREs) to integrate, test, and improve platform reliability and performance
Determine and support resource requirements, evaluate operational processes, measure outcomes to ensure desired results, demonstrate adaptability and sponsor continuous learning

What we offer

Market-competitive compensation
401K savings plan vested from day one with 6% match
Performance and recognition-based incentives
Tuition assistance
Mental healthcare
Fertility and adoption assistance
Workplace flexibility
GEICO Flex program (ability to work from anywhere in the US for up to four weeks per year)

Fulltime

Site Reliability Engineer Staff

Site Reliability Engineer Staff. This role has been designed as 'Hybrid' with an...

Location

United States , San Juan

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Minimum of 4 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
Proficiency with Linux systems, especially Debian-based distributions
Strong experience with cloud platforms such as AWS and GCP
Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
Solid programming skills in Python and/or Golang
Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
Experience with GitOps workflows
Proven track record in implementing and maintaining CI/CD pipelines
Strong background in security and familiarity with security programs
Experience with monitoring and logging tools (Prometheus, Grafana, ELK)

Job Responsibility

Enhance Infrastructure as Code (IAC) and enforce best practices
Optimize cloud infrastructure for scalability, security, and cost-effectiveness
Develop internal tools to support and streamline cloud platform operations
Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
Address container image vulnerabilities and standardize remediation processes
Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
Troubleshoot complex production issues to ensure system reliability and customer satisfaction
Fine-tune distributed systems such as Apache Kafka and Cassandra
Collaborate with development, security, and operations teams to align infrastructure with application needs

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Senior Staff Reliability Engineer

Our reliability team is responsible to evaluate, develop, design, and implement ...

Location

United States , Secaucus

Salary:

124500.00 - 166000.00 USD / Year

Sanmina

Expiration Date

Until further notice

Requirements

Minimum B.S. in Electrical Engineering, Computer with Science/Engineering, or Software development and 8+ years of relevant work experience (alternatively an MS and 6+ years)
Knowledge of computer systems/hardware structure, as well as switch/network interfaces
Knowledge and/or experience with programming languages like Python or Unix (Bash and/or PowerShell)
Knowledge of statistical & probability techniques and reliability modeling
Ability to communicate, collaborate and lead cross-functionally to resolve issues, including those with customers

Job Responsibility

Evaluate, develop, design, and implement software and product reliability test regimens
Use Design for Reliability principles to ensure cloud hardware meets specified use-conditions and stresses
Act as internal consultant on all reliability matters and interface with program management, vendors, and design engineering
Support the Software/script development needs of the reliability team
Create or revise reliability engineering guidelines to improve product field performance
Use principles of performance evaluation and prediction to improve reliability and maintainability
Identify, collect, analyze, and manage various types of data to minimize failures and improve product performance
Develop scripts that represent the expected environment and operational conditions
Collaborate with other development functional teams and internal stakeholders regarding the application of Design for Reliability principles

What we offer

Performance-based annual bonus eligibility
401(k) retirement savings plan
Tuition reimbursement for eligible education programs
Comprehensive medical, dental, and vision coverage
Mental health resources and employee wellness support programs
Company-paid life and disability insurance
Paid time off (PTO) and company-paid holidays
Parental leave and family care support programs
Structured training programs and on-the-job learning opportunities
Matching gifts and volunteer programs

Fulltime

Staff Reliability Engineer - AI & Hyperscale Server NPI and Mfg.

Our reliability team is responsible to evaluate, develop, design, and implement ...

Location

United States , Secaucus

Salary:

105000.00 - 140000.00 USD / Year

Sanmina

Expiration Date

Until further notice

Requirements

Minimum B.S. in Electrical Engineering, Computer with Science/Engineering, or Software development and 5+ years of relevant work experience (alternatively, a MS degree and 3+ years of experience)
Knowledge of computer systems/hardware structure, as well as switch/network interfaces
Knowledge and/or experience with programming languages like Python or Unix (Bash and/or PowerShell)
Knowledge of statistical & probability techniques and reliability modeling
Ability to communicate, collaborate and lead cross-functionally to resolve issues, including those with customers.

Job Responsibility

Evaluate, develop, design, and implement software and product reliability test regimens
Use Design for Reliability principles to ensure cloud hardware meets specified use-conditions and stresses
Act as the internal consultant on all reliability matters and interface with program management, vendors, and design engineering
Support the Software/script development needs of the reliability team
Create or revise reliability engineering guidelines to improve product field performance
Use principles of performance evaluation and prediction to improve the reliability and maintainability of Cloud Infrastructure servers
Identify, collect, analyze, and manage various types of data to minimize failures and improve product performance
Develop scripts that represent the expected environment and operational conditions
Collaborate with other development functional teams and internal stakeholders regarding the application of Design for Reliability principles.

What we offer

Performance-based annual bonus eligibility
401(k) retirement savings plan
Tuition reimbursement for eligible education programs
Comprehensive medical, dental, and vision coverage with access to leading providers
Mental health resources and employee wellness support programs
Company-paid life and disability insurance
Paid time off (PTO) and company-paid holidays
Parental leave and family care support programs
Structured training programs and on-the-job learning opportunities
Matching gifts and volunteer programs to support causes you care about

Fulltime

Select Country

Staff Reliability Engineer

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Staff Reliability Engineer

Staff Reliability Engineer

Staff Reliability Engineer

Staff Engineer, Site Reliability Engineer

Staff Site Reliability Engineer - Incident Management & Reliability

Staff Engineer – Reliability Engineering

Site Reliability Engineer Staff

Senior Staff Reliability Engineer

Staff Reliability Engineer - AI & Hyperscale Server NPI and Mfg.

Our AI answers in your language