CrawlJobs Logo

Senior Site Reliability Engineer - Fleet Reliability

lambda.ai Logo

Lambda

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 345000.00 USD / Year

Job Description:

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.

Job Responsibility:

  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

Requirements:

  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation

Nice to have:

  • Experience in the machine learning or computer hardware industry
  • Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)
  • Experience building and/or operating HPC resources
  • Background in chaos engineering or similar reliability testing methodologies
  • Understanding of compliance frameworks (SOC 2, ISO 27001, etc.)
What we offer:
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:
PREMIUM
More languages and countries
+ Unlock 31694 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer - Fleet Reliability

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

Principal Group Engineering Manager

Microsoft Specialized Clouds combines the power of edge platforms, devices, and ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of professional software engineering experience, including designing, building, and operating distributed, cloud-scale services
  • 5+ years of engineering leadership experience, including managing managers and leading multi-team engineering organizations (M2+)
  • Deep experience with network device platforms — specifically Arista (EOS, eAPI, CloudVision) and/or Cisco (NX-OS, DCNM/NDFC) — including device programming, configuration management, and automation
  • Strong background in device programming and network automation — building systems that programmatically configure, validate, and manage network device state at scale
  • Experience with Azure Resource Provider (RP) engineering — ARM resource modeling, deployment pipelines, control-plane architecture, and resource lifecycle management
  • Solid understanding of L2/L3 networking fundamentals: spine-leaf architecture, VXLAN, overlay/underlay networking, BGP, and data center network design
  • Proven ability to set technical direction and architectural strategy for complex platforms spanning multiple components and partner teams
  • Demonstrated success owning end-to-end delivery of customer-critical services, including design, development, release, and live-site operations
  • Strong experience driving operational excellence, including reliability, incident management, automation, and cost optimization for production services
  • Proven track record of leading organizational transformation — such as quality resets, reliability turnarounds, code yellow resolution, or engineering culture change across an engineering org
Job Responsibility
Job Responsibility
  • Lead engineering teams through the design, architecture, development, testing, and operations of the Network Fabric platform — the cloud-managed networking layer for Azure Operator Nexus and Azure Local
  • Drive execution excellence across the full software lifecycle: semester planning, feature delivery, release management, and live-site operations
  • Own engineering commitments across multiple workstreams including network device programming, Azure Resource Provider development, fabric orchestration, and network configuration management
  • Ensure services meet Microsoft standards for quality, reliability, security, and operational readiness
  • Establish and enforce engineering best practices — including test-driven development, automated validation, secure development lifecycle (SDL/SFI), and continuous integration
  • Continue and accelerate the ongoing engineering transformation: driving quality resets, improving release predictability, and reducing customer-impacting incidents
  • Own the resolution of code yellow and equivalent quality escalations, driving root cause analysis and systemic remediation across the engineering organization
  • Champion a culture of engineering fundamentals — ensuring that quality, security, and operational maturity are embedded into every sprint, not treated as afterthoughts
  • Drive measurable reduction in support costs through automation, improved test coverage, and process optimization
  • Provide technical leadership across device programming (Arista EOS, Cisco NX-OS), network fabric orchestration, and Azure Resource Provider engineering
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience designing and managing infrastructure security for large fleets of diverse services
  • Experience with technical response during security incidents
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Ability and willingness to travel 1-2 times a year for in-person events and team meetings
  • Most importantly, share our values and work in accordance with them
  • Fulltime
Read More
Arrow Right

Senior Manager, Hybrid Services & Reliability (SRE)

As the Senior Engineering Manager for Hybrid Services & Reliability (HSR) within...
Location
Location
United States , Austin, Texas; Sunnyvale, California
Salary
Salary:
201600.00 - 302000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive background in Site Reliability Engineering (SRE) and defining SLO/SLI frameworks for hybrid cloud environments
  • Technical proficiency in managing on-prem Linux utilities (DHCP/PXE/NTP) and core development services
  • Opinionated view on automated observability, incident response, and MTTR reduction
  • Proven leadership experience
Job Responsibility
Job Responsibility
  • Reliability Engineering: Define, measure, and enforce strict SLOs/SLIs for critical hybrid cloud services, including network connectivity and compute readiness
  • Foundational Utilities: Own and manage core on-prem utilities, such as DHCP, PXE, and CDN, to ensure seamless server auto-provisioning across the global fleet
  • Environment Integrity: Manage the entire data flow path, from initial ingestion at the test bench through the secure cloud network into production staging
  • HIL Readiness: Guarantee the 99%+ availability and stability of remote CI-based Hardware-in-the-Loop (HIL) benches required for AV safety validation
  • Organization Growth: Actively lead the recruitment and technical mentorship of Senior and Staff ICs as part of the team's expansion
What we offer
What we offer
  • medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts
  • relocation benefits
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

It's fun to work in a company where people truly believe in what they're doing! ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–10+ years in SRE, DevOps, or systems engineering in production cloud environments
  • B.tech/B.E in Computer Science or related field
  • Expertise in automation, observability & monitoring, CI/CD pipelines, and incident management
  • Experience with SRE principles (SLI/SLO/error budgets/postmortems, etc)
  • Proficient in IaC tools like Terraform, Ansible, Chef
  • Experience in working on HashiCorp tools - Consul, Vault, Nomad, Packer
  • Strong cloud knowledge (GCP preferred, AWS/Azure a plus)
  • Experience with containerization & orchestration (Docker, Kubernetes, ArgoCD, etc)
  • Advanced scripting and automation (Python, Go, PowerShell)
  • Familiarity with cloud cost monitoring and optimization techniques
Job Responsibility
Job Responsibility
  • Own performance, scalability, and operational excellence across critical services
  • Blend software engineering and systems engineering to build and run large-scale, fault-tolerant, distributed systems—focusing on performance, capacity, availability, and security
  • Own service reliability across the stack and collaborate closely with developers, architects, and infrastructure teams to ensure services are resilient by design and self-healing by default
  • Automate operational tasks to reduce toil and increase team velocity
  • Lead timely and reliable deployments, with emphasis on progressive delivery techniques (canary, blue/green, feature flags, zero outage, etc)
  • Partner in blameless postmortems and ensure incident reviews lead to systemic fixes
  • Automate secure lifecycle of certificates, secrets, and credentials
  • Build and maintain cloud-native security stacks and compliance guardrails
  • Execute infrastructure rotation and automated rehydration to maintain fleet hygiene
  • Create and manage highly reproducible environment provisioning via Infrastructure as Code
What we offer
What we offer
  • A technology-based company with a sense of adventure and a vision for the future
  • A culture that is kind, open, and accepting
  • A culture where BlackLiner's continued growth and learning is empowered
  • BlackLine offers a wide variety of professional development seminars and inclusive affinity groups to celebrate and support our diversity
Read More
Arrow Right

Asset Health and Reliability Specialist

Reporting to the Mobile Maintenance Manager, the Asset Health and Reliability Sp...
Location
Location
Australia , Pilbara
Salary
Salary:
Not provided
pls.com Logo
PLS
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Relevant nationally recognized trade qualification
  • Current Driver’s Licence (C Class minimum)
  • Minimum 5 years of experience in a mining or heavy industry environment, with a focus on HME reliability or maintenance
  • Strong knowledge of heavy mobile equipment systems and components (e.g., engines, hydraulics, powertrain, electrical, undercarriage)
  • Experience conducting root cause analysis and implementing reliability improvements
  • Excellent communication and interpersonal skills for cross-functional collaboration
  • Strong commitment to safety and continuous improvement
  • Highly developed attention to detail with the ability to analysis condition monitoring information and provide accurate reported recommendation
  • Data exploration skills and capability to review datasets across systems, inclusive of time-series VIMS, KOMTRAX and equivalent data
Job Responsibility
Job Responsibility
  • Monitor and analyse the reliability performance of Heavy Mobile Equipment (HME) fleet
  • Develop and maintain equipment health strategies using reliability tools such as RCM, FMEA, and condition monitoring techniques
  • Collaborate with maintenance, operations, and OEMs to drive continuous improvement in equipment performance and availability
  • Identify and implement initiatives to improve mean time between failures (MTBF) and reduce mean time to repair (MTTR)
  • Review and optimise preventative and predictive maintenance strategies and schedules
  • Prepare and present reliability reports, KPIs, and improvement plans to senior management
  • Support the implementation and usage of reliability software systems (e.g., Pronto, AMT or similar CMMS tools)
  • Ensure all activities comply with site safety standards, environmental policies, and legislative requirements
What we offer
What we offer
  • Quarterly short-term incentive bonus recognising individual and business performance
  • PLS employee share scheme
  • Access to newly refurbished facilities at Pilgangoora, including gym, tennis, pickleball and volleyball courts, sports oval and scenic walking tracks
  • 18 weeks parental leave for primary carers and 4 weeks for secondary carers
  • Health and wellbeing allowance
  • Novated leasing through salary sacrifice
  • Paid community leave
  • Monthly employee recognition awards
  • Access to PLS’ KidsCo School Holiday Program
  • Access to our Employee Assistance Program and Company Chaplains
  • Fulltime
Read More
Arrow Right

Senior Network Technician

As Senior Network Technician, you would help support the rollout of GeniusIQ, ou...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
geniussports.com Logo
Genius Sports
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5 years’ experience with system and network administration on infrastructure with 100+ Linux servers
  • Strong understanding of the entire Linux server stack: OS boot and installation, system, networking, container deployment, logging, metrics & monitoring, out-of-band management, etc.
  • Strong understanding of OSI network layers 2-3-4 and network configuration: switching, VLANs, routing, firewall rules, ARP, DHCP, DNS, TCP, switch command-line, etc.
  • Proficiency in Bash scripting
  • Ability to communicate efficiently and articulate concepts based on the audience, including remote hands, engineering and customers
Job Responsibility
Job Responsibility
  • Supervise IT issue tracking and resolution for a large fleet of bare-metal Linux servers and network equipment in hundreds of sport venues in Europe
  • Assist venue operations coordinators with preparation of equipment and installation, based on automation processes developed by site reliability engineers
  • Communicate kindly with external venue IT and management staff
  • Partner with software engineers to eliminate common issues
  • Fulltime
Read More
Arrow Right

Senior Maintenance Planner

We are currently seeking an experienced Senior Mobile Fleet Maintenance Planner ...
Location
Location
Australia , Mudgee
Salary
Salary:
Not provided
peabodyenergy.com Logo
Peabody Energy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Mechanical Trade or Engineering qualification
  • 3+ Years experience as a Maintenance Planner desirable
  • Strong working knowledge of SAP, maintenance planning and scheduling principles and procedures
  • Strong interpersonal and communication skills
  • demonstrated experience in safety systems and processes including JSEAs, risk assessments and permits
  • Experience with Microsoft Project is not required but desirable
  • goal orientated and have the ability to work autonomously
Job Responsibility
Job Responsibility
  • Ensuring maintenance "best practice" techniques are implemented to ensure equipment is maintained to a high safety, productive and reliable standard
  • Working with stakeholder to manage lead time on parts
  • Prioritisation of work and time management
  • An active role in Forecasting Costs for Field Short to Mid-Term work
  • Working with the Maintenance Execution Team to develop plans that support the Maintenance function to meet the needs of the business
  • Developing and maintaining relationships with our internal departments as well as our key suppliers
  • Ensuring compliance with relevant statutory, legislative, WH&S standards and site policies and procedures
  • Development a high performing planning and scheduling team
  • Fulltime
Read More
Arrow Right