CrawlJobs Logo

Senior Site Reliability Engineer - Fleet Reliability

lambda.ai Logo

Lambda

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 345000.00 USD / Year

Job Description:

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.

Job Responsibility:

  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

Requirements:

  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation

Nice to have:

  • Experience in the machine learning or computer hardware industry
  • Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)
  • Experience building and/or operating HPC resources
  • Background in chaos engineering or similar reliability testing methodologies
  • Understanding of compliance frameworks (SOC 2, ISO 27001, etc.)
What we offer:
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer - Fleet Reliability

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

Principal Group Engineering Manager

Microsoft Specialized Clouds combines the power of edge platforms, devices, and ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of professional software engineering experience, including designing, building, and operating distributed, cloud-scale services
  • 5+ years of engineering leadership experience, including managing managers and leading multi-team engineering organizations (M2+)
  • Deep experience with network device platforms — specifically Arista (EOS, eAPI, CloudVision) and/or Cisco (NX-OS, DCNM/NDFC) — including device programming, configuration management, and automation
  • Strong background in device programming and network automation — building systems that programmatically configure, validate, and manage network device state at scale
  • Experience with Azure Resource Provider (RP) engineering — ARM resource modeling, deployment pipelines, control-plane architecture, and resource lifecycle management
  • Solid understanding of L2/L3 networking fundamentals: spine-leaf architecture, VXLAN, overlay/underlay networking, BGP, and data center network design
  • Proven ability to set technical direction and architectural strategy for complex platforms spanning multiple components and partner teams
  • Demonstrated success owning end-to-end delivery of customer-critical services, including design, development, release, and live-site operations
  • Strong experience driving operational excellence, including reliability, incident management, automation, and cost optimization for production services
  • Proven track record of leading organizational transformation — such as quality resets, reliability turnarounds, code yellow resolution, or engineering culture change across an engineering org
Job Responsibility
Job Responsibility
  • Lead engineering teams through the design, architecture, development, testing, and operations of the Network Fabric platform — the cloud-managed networking layer for Azure Operator Nexus and Azure Local
  • Drive execution excellence across the full software lifecycle: semester planning, feature delivery, release management, and live-site operations
  • Own engineering commitments across multiple workstreams including network device programming, Azure Resource Provider development, fabric orchestration, and network configuration management
  • Ensure services meet Microsoft standards for quality, reliability, security, and operational readiness
  • Establish and enforce engineering best practices — including test-driven development, automated validation, secure development lifecycle (SDL/SFI), and continuous integration
  • Continue and accelerate the ongoing engineering transformation: driving quality resets, improving release predictability, and reducing customer-impacting incidents
  • Own the resolution of code yellow and equivalent quality escalations, driving root cause analysis and systemic remediation across the engineering organization
  • Champion a culture of engineering fundamentals — ensuring that quality, security, and operational maturity are embedded into every sprint, not treated as afterthoughts
  • Drive measurable reduction in support costs through automation, improved test coverage, and process optimization
  • Provide technical leadership across device programming (Arista EOS, Cisco NX-OS), network fabric orchestration, and Azure Resource Provider engineering
  • Fulltime
Read More
Arrow Right

Senior Manager, Hybrid Services & Reliability (SRE)

As the Senior Engineering Manager for Hybrid Services & Reliability (HSR) within...
Location
Location
United States , Austin, Texas; Sunnyvale, California
Salary
Salary:
201600.00 - 302000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive background in Site Reliability Engineering (SRE) and defining SLO/SLI frameworks for hybrid cloud environments
  • Technical proficiency in managing on-prem Linux utilities (DHCP/PXE/NTP) and core development services
  • Opinionated view on automated observability, incident response, and MTTR reduction
  • Proven leadership experience
Job Responsibility
Job Responsibility
  • Reliability Engineering: Define, measure, and enforce strict SLOs/SLIs for critical hybrid cloud services, including network connectivity and compute readiness
  • Foundational Utilities: Own and manage core on-prem utilities, such as DHCP, PXE, and CDN, to ensure seamless server auto-provisioning across the global fleet
  • Environment Integrity: Manage the entire data flow path, from initial ingestion at the test bench through the secure cloud network into production staging
  • HIL Readiness: Guarantee the 99%+ availability and stability of remote CI-based Hardware-in-the-Loop (HIL) benches required for AV safety validation
  • Organization Growth: Actively lead the recruitment and technical mentorship of Senior and Staff ICs as part of the team's expansion
What we offer
What we offer
  • medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts
  • relocation benefits
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

It's fun to work in a company where people truly believe in what they're doing! ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–10+ years in SRE, DevOps, or systems engineering in production cloud environments
  • B.tech/B.E in Computer Science or related field
  • Expertise in automation, observability & monitoring, CI/CD pipelines, and incident management
  • Experience with SRE principles (SLI/SLO/error budgets/postmortems, etc)
  • Proficient in IaC tools like Terraform, Ansible, Chef
  • Experience in working on HashiCorp tools - Consul, Vault, Nomad, Packer
  • Strong cloud knowledge (GCP preferred, AWS/Azure a plus)
  • Experience with containerization & orchestration (Docker, Kubernetes, ArgoCD, etc)
  • Advanced scripting and automation (Python, Go, PowerShell)
  • Familiarity with cloud cost monitoring and optimization techniques
Job Responsibility
Job Responsibility
  • Own performance, scalability, and operational excellence across critical services
  • Blend software engineering and systems engineering to build and run large-scale, fault-tolerant, distributed systems—focusing on performance, capacity, availability, and security
  • Own service reliability across the stack and collaborate closely with developers, architects, and infrastructure teams to ensure services are resilient by design and self-healing by default
  • Automate operational tasks to reduce toil and increase team velocity
  • Lead timely and reliable deployments, with emphasis on progressive delivery techniques (canary, blue/green, feature flags, zero outage, etc)
  • Partner in blameless postmortems and ensure incident reviews lead to systemic fixes
  • Automate secure lifecycle of certificates, secrets, and credentials
  • Build and maintain cloud-native security stacks and compliance guardrails
  • Execute infrastructure rotation and automated rehydration to maintain fleet hygiene
  • Create and manage highly reproducible environment provisioning via Infrastructure as Code
What we offer
What we offer
  • A technology-based company with a sense of adventure and a vision for the future
  • A culture that is kind, open, and accepting
  • A culture where BlackLiner's continued growth and learning is empowered
  • BlackLine offers a wide variety of professional development seminars and inclusive affinity groups to celebrate and support our diversity
Read More
Arrow Right

Senior Maintenance Planner

We are currently seeking an experienced Senior Mobile Fleet Maintenance Planner ...
Location
Location
Australia , Mudgee
Salary
Salary:
Not provided
peabodyenergy.com Logo
Peabody Energy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Mechanical Trade or Engineering qualification
  • 3+ Years experience as a Maintenance Planner desirable
  • Strong working knowledge of SAP, maintenance planning and scheduling principles and procedures
  • Strong interpersonal and communication skills
  • demonstrated experience in safety systems and processes including JSEAs, risk assessments and permits
  • Experience with Microsoft Project is not required but desirable
  • goal orientated and have the ability to work autonomously
Job Responsibility
Job Responsibility
  • Ensuring maintenance "best practice" techniques are implemented to ensure equipment is maintained to a high safety, productive and reliable standard
  • Working with stakeholder to manage lead time on parts
  • Prioritisation of work and time management
  • An active role in Forecasting Costs for Field Short to Mid-Term work
  • Working with the Maintenance Execution Team to develop plans that support the Maintenance function to meet the needs of the business
  • Developing and maintaining relationships with our internal departments as well as our key suppliers
  • Ensuring compliance with relevant statutory, legislative, WH&S standards and site policies and procedures
  • Development a high performing planning and scheduling team
  • Fulltime
Read More
Arrow Right

Senior Data Scientist - Smart Charging

We are looking for a Senior Data Scientist with a background in Python-based mod...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
zenobe.com Logo
Zenobē
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • STEM degree (e.g. engineering, applied physics, data science)
  • 4+ years of relevant professional experience working in modelling, simulation, analytics and optimisation preferably in an EV or energy-adjacent domain
  • 5+ years’ experience with Python (pandas, scipy, plotly, scikit-learn, and other scientific / data libraries) and it’s developer tooling (e.g. uv, ruff, mypy)
  • A pragmatic approach to problem solving, follower of the 80/20 rule balancing outcome with effort and comfortable working with imperfect real-world data
  • Solid SDLC and collaborative software practices including GIT for version control, testing, CICD, environment management etc. Ability to mentor others on advanced use of Python.
  • Technical background with good understanding of the underlying physical principles related to electric vehicles and the energy sector
  • Familiarity with energy markets, tariffs, grid services and commercial aspects of the energy sector
  • Confidence in working with autonomy, spearheading areas of technical development and representing our technical expertise both internally and externally
  • Demonstrable leadership in the planning and delivery of projects, accountability to senior stakeholders and line management of team members
  • A working knowledge of data engineering and cloud platform concepts
Job Responsibility
Job Responsibility
  • Leading development of our charging strategy optimisation pipeline: writing python code for data processing, physics-based simulation, commercial optimisation and data insight.
  • Diving into the operational, commercial and technical details of EV charging sites to tailor our modelling and optimisation pipelines to each customer and geography across our global portfolio, adapting to local constraints and opportunities in each case.
  • Managing the roll out of smart charging across our international portfolio and in partnership with business development and customer success colleagues, you’ll use data and expert insight to highlight the value of smart charging to customers and help resolve operational concerns
  • Work with large fleet and charging datasets to train machine-learning models for predicting vehicle energy consumption or correlate physics-based models for virtual recreation of charging operations as a digital twin simulation.
  • Testing and cloud-deployment of code whilst ensuring alignment of ways of working with other technical teams
  • Pairing with other team members contributing to the smart charging codebase and reviewing pull requests to maintain coding standards.
  • Strategic planning of the smart charging analytical roadmap, aligning the delivery and dependencies of new features with product managers and balancing customer value delivery with reliability and effort
  • Management of team members working in the smart charging domain including work planning, reviewing and 1:1s.
  • Engage with onboarding and operational teams to define data requirements and testing plans in support of model development and correlation (e.g. charging or vehicle energy consumption).
  • Keep up to date with evolving smart charging opportunities and business cases, and the expanding needs of the business with a growing number of technologies and geographies supported.
What we offer
What we offer
  • Up to 33% annual bonus
  • 25 days holiday, increasing with length of service up to 30 days, plus bank holidays
  • Private Medical Insurance
  • £1,500 training budget per year
  • EV Salary Sacrifice Scheme
  • Pension scheme, up to 8% matched contributions
  • Enhanced parental leave
  • Cash back health plan
Read More
Arrow Right

Viewing Assistant

Welcome to Charters, we're known for bringing local knowledge and strong values ...
Location
Location
United Kingdom , Bishop's Waltham
Salary
Salary:
15.00 GBP / Hour
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Essential that you drive and have your own car
  • Excellent communication and interpersonal skills
  • Friendly and efficient
  • Demonstrate excellent time keeping
  • Immaculately presented and professional manner
Job Responsibility
Job Responsibility
  • Conduct viewings with a knowledgeable and professional manner
  • Familiarise yourself with the property credentials ahead of the viewings
  • Attend team meetings
  • Parttime
Read More
Arrow Right

Software Engineer

Technology sits at the heart of everything we do at MI5. We're looking for Softw...
Location
Location
United Kingdom , London; Manchester
Salary
Salary:
64005.00 - 70791.00 GBP / Year
socialvalueportal.com Logo
Social Value Portal Ltd
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Demonstrate ability in developing software in at least one common language
  • Demonstrate an understanding of the principles of modern standards approaches such as continuous integration and delivery, test driven development and cloud services
  • Demonstrate taking on a more senior role within a team. Provide technical direction and the ability to guide and support others with regards to software developing
  • To work at MI5 you need to be a British citizen or hold dual British nationality
  • This role requires the highest security clearance, known as Developed Vetting (DV)
  • You’ll be proficient in developing enterprise or commercial software in at least one common language (for example Java, C#, Python or JavaScript) and are familiar with the principles of a modern standards approach
  • You can demonstrate proficiency in the use of the agile methodology and have awareness of design patterns and how to implement them appropriately with security in mind
  • You’ll demonstrate competency in leadership and you’re continuously looking for opportunities to develop and learn new engineering practices and approaches
Job Responsibility
Job Responsibility
  • Develop solutions, mentor less experienced colleagues whilst working alongside a range of technical specialists including Product Owners, Business Analysts, Delivery Managers, Data Scientists and Machine Learning Engineers, to build and run secure applications and products
  • Using agile methodologies to deliver products that are core to MI5’s operations
  • Using cloud technologies such as AWS and Azure as well, supporting on-premises platforms and long-established technologies and frameworks
  • Taking on ownership of large problems, breaking them down and working with the team to deliver new features throughout the engineering lifecycle
  • Support the products owned by the team, working with users to identify and fix defects (providing on-call support if necessary) developing automated tests to maintain the assurance of our products and deploy through continuous integration pipelines
  • Support and mentor less experienced colleagues and help them to understand what great engineering looks like, promoting best practises, participating in our engineering community and guilds, and encourage cross-organisation initiatives to help build our community of engineers
What we offer
What we offer
  • 25 Days Annual Leave automatically rising to 30 days after 5 years' service, and an additional 10.5 days public and privilege holidays
  • opportunities to be recognised through our employee performance scheme
  • dedicated development budget
  • interest-free season ticket loan
  • excellent pension scheme
  • cycle to work scheme
  • facilities such as a gym, restaurant and on-site coffee bars (at some locations)
  • paid parental and adoption leave
  • up to 20% innovation and personal development time
  • opportunities to gain qualifications and pursue specialist pathways, as well as undertaking tailored training, coaching and mentoring
Read More
Arrow Right