CrawlJobs Logo

Senior Staff Engineer - Availability and Incident Management

Geico

Location Icon

Location:
United States , Chevy Chase

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

110000.00 - 260000.00 USD / Year

Job Description:

GEICO is seeking an experienced Engineer with a passion for building high-performance, low maintenance, zero-downtime platforms, and applications. You will help drive our insurance business transformation as we transition from a traditional IT model to a tech organization with engineering excellence as its mission, while co-creating the culture of psychological safety and continuous improvement. The Senior Staff Engineer in Availability and Incident Management will engineer solutions and empower the engineering community with automated processes, data-driven insights, and technical tools that reduce incident recurrence, improve system reliability, and accelerate incident resolution. This role will be heavily centered around building automation platforms to streamline postmortem workflows, eliminate manual tracking, and provide fast feedback loops for incident prevention. You will lead the strategy and execution of a technical roadmap that increases the velocity of incident resolution, reduces repeat incidents, and unlocks new reliability engineering capabilities.

Job Responsibility:

  • Lead the strategy and execution for incident retrospective and correction of error (COE) processes across the engineering organization
  • Help conduct deep technical root cause analysis and incident forensics across distributed systems using observability data, logs, metrics, and traces
  • Establish continuous improvement loops through automated trend analysis, pattern recognition algorithms, and predictive analytics
  • Design, code, and deploy automation platforms and self-service tools using Python, Go, Java, or C# that scale incident retrospective workflows and eliminate manual tracking
  • Build production-grade data pipelines, analytics systems, and real-time dashboards to measure incident trends, COE effectiveness, and action item completion rates
  • Write code for workflow automation, integrations with observability platforms, and APIs that connect incident management tools across the engineering ecosystem
  • Leverage SQL and NoSQL databases to store, query, and analyze incident data at scale using Azure tools and cloud-native services
  • Develop and maintain systems that ensure rigorous follow-through on action items, remediation plans, and preventive measures with automated tracking
  • Partner with service engineering teams to implement preventive measures and architectural improvements based on incident patterns
  • Present data-driven insights and incident trend analysis to leadership and engineering teams to drive preventive action
  • Influence and educate leadership on incident patterns, prevention strategies, and reliability best practices
  • Mentor engineers on coding best practices, automation techniques, and strengthen technical expertise across the engineering community
  • Stay current with industry advances in SRE, observability, incident management, and automation
  • educate teams on emerging practices

Requirements:

  • Experience building automation platforms and self-service tools for workflow management, analytics, or engineering productivity
  • Fluency in at least two modern languages such as Python, Go, Java, C++, or C# including object-oriented design
  • Experience building microservices architectures, REST APIs, and distributed systems
  • Experience with data pipelines, analytics platforms, and visualization tools for operational metrics and KPIs
  • Experience with SQL and NoSQL databases (e.g., PostgreSQL, MongoDB, Cassandra, CosmosDB) for data storage and analytics
  • Experience with observability platforms (Prometheus, Grafana, Datadog, Splunk, ELK) and distributed systems monitoring, logging, and tracing
  • Experience with cloud providers (Azure, AWS, or GCP) and cloud-native architectures
  • Experience with CI/CD pipelines, infrastructure as code, and container orchestration (Kubernetes, Docker)
  • Experience writing workflow automation code (YAML pipelines, GitHub Actions, Azure DevOps pipelines)
  • Strong understanding of distributed systems architecture, design patterns, reliability, and scaling
  • Knowledge of retrospective facilitation, continuous improvement processes, and blameless culture principles
  • Strong architecture and design skills with ability to influence engineering direction and technical roadmap
  • Experience solving complex analytical problems with data-driven approaches
  • Proven ability to partner with cross-functional engineering teams and drive systemic improvements
  • Excellent communication skills with ability to present technical insights to leadership and influence decision-making
  • 10+ years of professional platform development or general development experience
  • 8+ years of experience with architecture and design
  • 6+ years of experience in open-source frameworks
  • 4+ years of experience with AWS, GCP, Azure, or another cloud service
  • Bachelor’s degree in Computer Science, Information Systems, or equivalent education or work experience

Nice to have:

Experience leveraging GenAI or LLMs is a plus

What we offer:
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Staff Engineer - Availability and Incident Management

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

HSEQ Senior Manager

We are currently looking for an HSEQ Senior Manager to join our team, part of ou...
Location
Location
Greece , Thessaloniki
Salary
Salary:
Not provided
https://www.metlengroup.com Logo
Metlen Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A bachelor's degree in Engineering, Chemical, Mechanical, Metallurgical or Environmental Science
  • 8+ years’ experience in HSEQ leadership roles within industrial or energy sectors
  • Strong knowledge of ISO standards and regulatory frameworks
  • Fluency in English (verbal and written) is necessary, knowledge of a second foreign language is desirable
  • Excellent leadership, communication, and analytical skills
  • Availability to travel within and outside Europe
  • MSc or MBA will be considered an asset
Job Responsibility
Job Responsibility
  • Design and implement the company’s HSEQ strategy, policies, and objectives in alignment with ISO standards (45001, 14001, 9001, 17025)
  • Ensure full legal compliance with Greek, Romanian, and EU regulations on environmental protection, occupational health & safety, waste shipment, and industrial risk (e.g., SEVESO Directive)
  • Lead permitting processes for hazardous material handling, including environmental impact assessments, CO₂ emissions, and greenhouse gas allowances
  • Oversee implementation of HSEQ and ESG action plans across all Circular Metals facilities, standardizing systems while adapting to local legal and operational needs
  • Coordinate audits and inspections, including SEVESO reviews, and lead resolution of non-conformities
  • Develop emergency response plans for incidents involving hazardous waste, chemicals, or emissions
  • Lead a multidisciplinary team covering Health & Safety, Environmental Management, Quality Control, Systems & Occupational Medicine
  • Ensure training and upskilling of all plant staff and contractors and manage departmental budgeting and KPIs
  • Engage with authorities in Greece and Romania for licensing, inspections, and regulatory submissions
  • Track and report on key HSEQ and ESG indicators such as environmental performance, LTI, incident frequency, CO₂ footprint, and waste valorization
What we offer
What we offer
  • Competitive remuneration package
  • Ticket Restaurant Card
  • Group Health Insurance Plan
  • Preferential household electricity plan
  • Pension Plan
  • Company car
  • Fuel allowance
  • Performance bonus
  • Fulltime
Read More
Arrow Right

Assistant Engineering Manager

We’re looking for an Assistant Engineering Manager to support the Engineering Ma...
Location
Location
United Kingdom , Luton
Salary
Salary:
Not provided
arrivabus.co.uk Logo
Arriva London South Limited
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Level 3 qualification in Mechanical Engineering, Electrical Engineering, or a related discipline
  • Proven experience in an engineering role within transport or a similar heavy industry environment
  • Strong understanding of vehicle maintenance, diagnostics, and repair procedures
  • Experience supervising or supporting a small team of technicians or engineers
  • Strong analytical and problem-solving skills with the ability to interpret technical data
  • Excellent communication, organisational, and stakeholder engagement skills
  • Competent IT skills, including Microsoft Office and engineering systems
  • Knowledge of relevant health & safety and environmental legislation
  • A valid UK driving licence is desirable
  • A proactive approach and commitment to continuous professional development
Job Responsibility
Job Responsibility
  • Support the Engineering Manager in planning, organising, and overseeing day-to-day engineering activities, including preventative and reactive maintenance
  • Assist with the supervision, development, and performance management of engineering staff
  • Help implement engineering strategies, policies, and procedures to improve efficiency and control costs
  • Monitor fleet performance data, identify trends, and recommend actions to improve vehicle availability and reduce breakdowns
  • Ensure full compliance with health & safety legislation, industry standards, and company policies
  • Support procurement and management of spare parts, equipment, and external engineering services
  • Assist with engineering projects such as fleet upgrades, new equipment installations, and infrastructure improvements
  • Work collaboratively with Operations, Finance, and other departments to support business objectives
  • Participate in incident investigations and support corrective actions
  • Prepare reports and performance updates for senior management
  • Fulltime
Read More
Arrow Right

Project Manager

The Project Manager will be a leading part of the management team who are respon...
Location
Location
United Kingdom , Aberdeen
Salary
Salary:
Not provided
gcultd.com Logo
GCU UK Ltd
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A proven track record of working at a senior management level on high value projects within the telecoms industry or other utility industries
  • Managers with a civils or a construction background will also be considered
  • NRSWA Supervisor certification is desirable
  • The ability to manage and supervise a project team within a fast-paced environment
  • Understand and implement agreed plans which support the needs of the business and support operational effectiveness
  • A track record of establishing a productive long-term working relationship with a client, resulting in the successful management and delivery of a multi-million-pound contract
Job Responsibility
Job Responsibility
  • Manage productivity, assign roles, tasks and responsibilities required for the successful completion of the project in accordance with planning, specifications and requirements
  • Manage project Supervisors and ensure policies and procedures are being followed
  • Ensure Supervisors efficiently manage manpower for all projects
  • Lead support staff
  • Ensure adequate plant and equipment available for all necessary works
  • Hold team updates daily and arrange weekly and monthly meetings with Supervisors
  • Building, retaining, and improving client and local authority relationships
  • Adhere to and ensure that all aspects of health and safety are followed in line with legislation and company procedures
  • Ensure all staff and sub-contractors under your control follow the procedures set out
  • Report all incidents and health and safety matters as required
What we offer
What we offer
  • Competitive basic salary
  • Company vehicle
  • Fuel card
  • Holiday allowance
  • Enrolment into the company pension scheme
  • Accommodation assistance can be offered where required
  • Fulltime
Read More
Arrow Right

Senior Engineering Manager, Cloud Enablement

We are hiring a Senior Engineering Manager to lead the Cloud Enablement team, pa...
Location
Location
United States
Salary
Salary:
225000.00 - 275000.00 USD / Year
temporal.io Logo
Temporal
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong background as a senior or staff-level software engineer before moving into engineering management
  • Deep experience building and operating distributed systems with a focus on reliability, scalability, and fault tolerance
  • Comfort working hands-on in the codebase, especially in complex, concurrency-heavy systems
  • Demonstrated experience leading teams that deliver production-grade cloud services
  • Strong understanding of replication, failover, and migration concepts in distributed systems
  • Proven ability to drive execution with a bias for action, even in ambiguous or fast-moving environments
  • Experience coaching engineers through technically challenging work and operational ownership
  • Clear, pragmatic communication skills across technical and non-technical audiences
Job Responsibility
Job Responsibility
  • Lead, grow, and support a team of engineers working on solving core distributed system problems
  • Set clear technical direction for the team while aligning execution with CGS and company-wide priorities
  • Remain hands-on and technically engaged
  • Drive delivery of key Temporal Cloud capabilities, including: High Availability namespaces and failover automation
  • Migration tooling between self-hosted and cloud Temporal clusters
  • Namespace migration within Temporal Cloud for capacity management and data movement
  • Establish a strong culture of operational excellence, ensuring features are observable, safe to operate, and production-ready
  • Own execution and outcomes: planning, prioritization, delivery, and follow-through
  • Partner closely with Product, Infrastructure, Cloud and OSS teams to deliver cohesive solutions
  • Mentor and develop engineers, providing technical guidance, career growth support, and actionable feedback
What we offer
What we offer
  • Unlimited PTO, 12 Holidays + 2 Floating Holidays
  • 100% Premiums Coverage for Medical, Dental, and Vision
  • AD&D, LT & ST Disability, and Life Insurance (Standard & Supplemental Available)
  • Empower 401K Plan
  • Additional Perks for Learning & Development, Lifestyle Spending, In-Home Office Setup, Professional Memberships, WFH Meals, Internet Stipend and more
  • $3,600 / Year Work from Home Meals
  • $1,800 / Year Professional Enrichment (Career Development & Professional Memberships)
  • $1,200 / Year Lifestyle Spending Account
  • $1,000 / Year In-Home Office Setup (In addition to Temporal issued equipment)
  • $74 / Month Reimbursement for Internet
  • Fulltime
Read More
Arrow Right
New

Staff Reliability Engineer

The Robinhood Command Center (RCC) is a newly formed reliability team that serve...
Location
Location
United States , New York
Salary
Salary:
217000.00 - 255000.00 USD / Year
robinhood.com Logo
Robinhood
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience, including significant experience operating production systems
  • 4+ years focused on reliability engineering, infrastructure, distributed systems, or production operations
  • Hands-on experience serving in incident leadership roles (e.g., IMOC, incident commander, primary oncall)
  • Strong communication and cross-functional collaboration skills, especially during high-severity incidents
  • Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design
  • Experience with multi-region or multi-cluster architectures, capacity planning, and failover strategies
  • Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana)
  • Demonstrated ability to drive measurable improvements in MTTD, MTTR, availability, or customer impact
Job Responsibility
Job Responsibility
  • Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood’s infrastructure
  • Partner closely across many different types of engineers to raise the bar for operational excellence and incident response
  • Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents
  • Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact
  • Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics
  • Own and evolve incident response tooling and processes, including education, adoption, and measurement of MTTD/MTTR improvements
  • Drive post-incident governance and learning, defining standards for postmortems, SEV reviews, and follow-up tracking to ensure durable reliability improvements
  • Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers
  • Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems
  • Define and own the roadmap of bringing observability to critical user journeys for Robinhood’s products
What we offer
What we offer
  • Performance driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching
  • 100% paid health insurance for employees with 90% coverage for dependents
  • Lifestyle wallet - a highly flexible benefits spending account for wellness, learning, and more
  • Employer-paid life & disability insurance, fertility benefits, and mental health benefits
  • Time off to recharge including company holidays, paid time off, sick time, parental leave, and more
  • Exceptional office experience with catered meals, events, and comfortable workspaces
  • Fulltime
Read More
Arrow Right

Staff Reliability Engineer

Join us in building the future of finance. The Robinhood Command Center (RCC) is...
Location
Location
United States , New York City
Salary
Salary:
217000.00 - 255000.00 USD / Year
robinhood.com Logo
Robinhood
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience, including significant experience operating production systems
  • 4+ years focused on reliability engineering, infrastructure, distributed systems, or production operations
  • Hands-on experience serving in incident leadership roles (e.g., IMOC, incident commander, primary oncall)
  • Strong communication and cross-functional collaboration skills, especially during high-severity incidents
  • Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design
  • Experience with multi-region or multi-cluster architectures, capacity planning, and failover strategies
  • Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana)
  • Demonstrated ability to drive measurable improvements in MTTD, MTTR, availability, or customer impact
Job Responsibility
Job Responsibility
  • Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood’s infrastructure
  • Partner closely across many different types of engineers to raise the bar for operational excellence and incident response
  • Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents
  • Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact
  • Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics
  • Own and evolve incident response tooling and processes, including education, adoption, and measurement of MTTD/MTTR improvements
  • Drive post-incident governance and learning, defining standards for postmortems, SEV reviews, and follow-up tracking to ensure durable reliability improvements
  • Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers
  • Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems
  • Define and own the roadmap of bringing observability to critical user journeys for Robinhood’s products
What we offer
What we offer
  • Performance driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching
  • 100% paid health insurance for employees with 90% coverage for dependents
  • Lifestyle wallet - a highly flexible benefits spending account for wellness, learning, and more
  • Employer-paid life & disability insurance, fertility benefits, and mental health benefits
  • Time off to recharge including company holidays, paid time off, sick time, parental leave, and more
  • Exceptional office experience with catered meals, events, and comfortable workspaces
  • Fulltime
Read More
Arrow Right
New

Staff Software Engineer

As a Senior Staff Software Engineer at NMI, you operate beyond the scope of a si...
Location
Location
United States
Salary
Salary:
130000.00 - 160000.00 USD / Year
parking.net Logo
Parking Network B.V.
Expiration Date
March 13, 2026
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Information Technology, or equivalent practical experience
  • 8+ years of experience developing complex software applications in a commercial environment, with demonstrated impact at the Staff or Senior Staff engineer level
  • Advanced, hands-on experience building and maintaining large-scale systems using .NET Framework / C# (preferred) and/or PHP, with a strong understanding of object-oriented design principles and software architecture
  • Strong experience working with relational databases, particularly Microsoft SQL Server, including schema design, query optimization, performance tuning, and maintaining data integrity in production systems
  • Proven experience designing, coding, deploying, and operating cloud-based solutions hosted on AWS, with an understanding of scalability, fault tolerance, security, and cost-aware design
  • Experience designing and architecting scalable, distributed systems, with consideration for performance, reliability, and long-term maintainability
  • Deep understanding of the Software Development Life Cycle (SDLC) and agile development methodologies
  • Strong knowledge of security best practices, including secure coding principles and compliance requirements (e.g., OWASP Top Ten, PCI DSS, SOC 2, HIPAA, or similar)
  • Solid understanding of networking fundamentals, including HTTPS, DNS, SSL/TLS, and service-to-service communication patterns
  • Deep knowledge of design patterns and their practical application in real-world systems
Job Responsibility
Job Responsibility
  • Provide technical leadership for the team, influencing architecture and design decisions that span multiple teams
  • Own and evolve critical platform areas including partner onboarding, developer tooling, authentication, user management, and the unified partner portal
  • Identify long-term technical risks and opportunities, and lead initiatives to address scalability, reliability, security, and maintainability
  • Set and reinforce engineering standards, patterns, and best practices across teams
  • Collaborate closely with Engineering Managers and Directors to align technical strategy with delivery plans and team goals
  • Partner with Product Managers, Directors, and Designers to translate product vision into technically sound, scalable solutions
  • Act as a trusted technical advisor across teams, helping resolve complex cross-team dependencies and tradeoffs
  • Drive alignment and consistency across partner-facing systems and experiences
  • Design, implement, and review high-impact code, particularly in complex or high-risk areas
  • Lead technical discovery and execution for ambiguous or strategically important initiatives
What we offer
What we offer
  • A remote first culture
  • Flex PTO
  • Health, Dental and Vision Insurance
  • 13 Paid Holidays
  • Company volunteer days
  • bonus
  • Fulltime
Read More
Arrow Right