CrawlJobs Logo

Senior Site Reliability Engineer - Fleet Reliability

lambda.ai Logo

Lambda

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 345000.00 USD / Year

Job Description:

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.

Job Responsibility:

  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

Requirements:

  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation

Nice to have:

  • Experience in the machine learning or computer hardware industry
  • Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)
  • Experience building and/or operating HPC resources
  • Background in chaos engineering or similar reliability testing methodologies
  • Understanding of compliance frameworks (SOC 2, ISO 27001, etc.)
What we offer:
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer - Fleet Reliability

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

It's fun to work in a company where people truly believe in what they're doing! ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–10+ years in SRE, DevOps, or systems engineering in production cloud environments
  • B.tech/B.E in Computer Science or related field
  • Expertise in automation, observability & monitoring, CI/CD pipelines, and incident management
  • Experience with SRE principles (SLI/SLO/error budgets/postmortems, etc)
  • Proficient in IaC tools like Terraform, Ansible, Chef
  • Experience in working on HashiCorp tools - Consul, Vault, Nomad, Packer
  • Strong cloud knowledge (GCP preferred, AWS/Azure a plus)
  • Experience with containerization & orchestration (Docker, Kubernetes, ArgoCD, etc)
  • Advanced scripting and automation (Python, Go, PowerShell)
  • Familiarity with cloud cost monitoring and optimization techniques
Job Responsibility
Job Responsibility
  • Own performance, scalability, and operational excellence across critical services
  • Blend software engineering and systems engineering to build and run large-scale, fault-tolerant, distributed systems—focusing on performance, capacity, availability, and security
  • Own service reliability across the stack and collaborate closely with developers, architects, and infrastructure teams to ensure services are resilient by design and self-healing by default
  • Automate operational tasks to reduce toil and increase team velocity
  • Lead timely and reliable deployments, with emphasis on progressive delivery techniques (canary, blue/green, feature flags, zero outage, etc)
  • Partner in blameless postmortems and ensure incident reviews lead to systemic fixes
  • Automate secure lifecycle of certificates, secrets, and credentials
  • Build and maintain cloud-native security stacks and compliance guardrails
  • Execute infrastructure rotation and automated rehydration to maintain fleet hygiene
  • Create and manage highly reproducible environment provisioning via Infrastructure as Code
What we offer
What we offer
  • A technology-based company with a sense of adventure and a vision for the future
  • A culture that is kind, open, and accepting
  • A culture where BlackLiner's continued growth and learning is empowered
  • BlackLine offers a wide variety of professional development seminars and inclusive affinity groups to celebrate and support our diversity
Read More
Arrow Right

Senior Network Technician

As Senior Network Technician, you would help support the rollout of GeniusIQ, ou...
Location
Location
United Kingdom , Manchester
Salary
Salary:
Not provided
geniussports.com Logo
Genius Sports
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5 years’ experience with system and network administration on infrastructure with 100+ Linux servers
  • Strong understanding of the entire Linux server stack: OS boot and installation, system, networking, container deployment, logging, metrics & monitoring, out-of-band management, etc...
  • Strong understanding of OSI network layers 2-3-4 and network configuration: switching, VLANs, routing, firewall rules, ARP, DHCP, DNS, TCP, switch command-line, etc...
  • Proficiency in Bash scripting
  • Ability to communicate efficiently and articulate concepts based on the audience, including remote hands, engineering and customers
Job Responsibility
Job Responsibility
  • Supervise IT issue tracking and resolution for a large fleet of bare-metal Linux servers and network equipment in hundreds of sport venues in Europe
  • Assist venue operations coordinators with preparation of equipment and installation, based on automation processes developed by site reliability engineers
  • Communicate kindly with external venue IT and management staff
  • Partner with software engineers to eliminate common issues
  • Fulltime
Read More
Arrow Right

Senior Maintenance Planner

We are currently seeking an experienced Senior Mobile Fleet Maintenance Planner ...
Location
Location
Australia , Mudgee
Salary
Salary:
Not provided
peabodyenergy.com Logo
Peabody Energy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Mechanical Trade or Engineering qualification
  • 3+ Years experience as a Maintenance Planner desirable
  • Strong working knowledge of SAP, maintenance planning and scheduling principles and procedures
  • Strong interpersonal and communication skills
  • demonstrated experience in safety systems and processes including JSEAs, risk assessments and permits
  • Experience with Microsoft Project is not required but desirable
  • goal orientated and have the ability to work autonomously
Job Responsibility
Job Responsibility
  • Ensuring maintenance "best practice" techniques are implemented to ensure equipment is maintained to a high safety, productive and reliable standard
  • Working with stakeholder to manage lead time on parts
  • Prioritisation of work and time management
  • An active role in Forecasting Costs for Field Short to Mid-Term work
  • Working with the Maintenance Execution Team to develop plans that support the Maintenance function to meet the needs of the business
  • Developing and maintaining relationships with our internal departments as well as our key suppliers
  • Ensuring compliance with relevant statutory, legislative, WH&S standards and site policies and procedures
  • Development a high performing planning and scheduling team
  • Fulltime
Read More
Arrow Right
New

Assistant Housekeeping Manager

Responsible for the daily shift operations of Housekeeping, Recreation/Health Cl...
Location
Location
India , Mumbai
Salary
Salary:
Not provided
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High school diploma or GED
  • 2 years experience in the housekeeping or related professional area
  • OR 2-year degree from an accredited university in Hotel and Restaurant Management, Hospitality, Business Administration, or related major
  • no work experience required
Job Responsibility
Job Responsibility
  • Verifies guest room status is communicated to the Front Desk in a timely and efficient manner
  • Inspects guestrooms on a daily basis
  • Obtains list of rooms to be cleaned immediately and list of prospective check-outs or discharges to prepare work assignments
  • Inventories stock to verify adequate supplies
  • Supports and supervises an effective inspection program for all guestrooms and public space
  • Understands the impact of department’s operations on the overall property financial goals and objectives and manages to achieve or exceed budgeted goals
  • Verifies all employees have proper supplies, equipment and uniforms
  • Communicates areas that need attention to staff and follows up to verify understanding
  • Supervises daily Housekeeping shift operations and verifies compliance with all housekeeping policies, standards and procedures
  • Participates in departmental meetings and continually communicates a clear and consistent message regarding the departmental goals to produce desired results
  • Fulltime
Read More
Arrow Right
New

Restaurant General Manager

P.F. Chang's is a renowned upscale casual dining restaurant chain that specializ...
Location
Location
United States , Denver
Salary
Salary:
100000.00 - 120000.00 USD / Year
oysterlink.com Logo
OysterLink
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Progressive restaurant or hospitality management experience
  • Prior experience as a General Manager required for external hires
  • Valid driver’s license for operational duties
  • Passion for delivering exceptional guest experiences
  • Track record of driving sales growth and achieving KPIs
  • Experience mentoring and coaching at various levels
  • Knowledge of inventory, cost of goods, and financial control
  • Proficiency with POS, delivery platforms, inventory, and reservation systems
  • Proven ability to drive sales growth and achieve financial targets
  • Success in mentoring and coaching team members
Job Responsibility
Job Responsibility
  • Demonstrate genuine passion for exceptional guest experiences and personalized service
  • Actively engage with guests to foster a welcoming atmosphere and drive sales
  • Drive incremental sales through coaching team members on upselling
  • Utilize financial analysis for continuous operational improvement
  • Foster an ownership mindset within the team using KPIs and EBITDA targets
  • Develop management team members through leadership training
  • Inspire and motivate the team with coaching and performance reviews
What we offer
What we offer
  • Medical
  • Dental
  • Vision
  • 401(k)
  • Paid Time Off
  • Performance incentives
  • Professional Development
  • Fulltime
Read More
Arrow Right
New

Geotechnical Project Engineer

Terracon has a fantastic opportunity to join our growing Geotechnical Transporta...
Location
Location
United States , Sacramento; Lodi; Concord; San Jose
Salary
Salary:
94300.00 - 146100.00 USD / Year
terracon.com Logo
Terracon Consultants, Inc
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Engineering
  • Minimum 5-8 years’ experience
  • Master’s degree in engineering preferred
  • Valid driver’s license with acceptable violation history
  • Professional Engineer (PE) registration
Job Responsibility
Job Responsibility
  • Follow safety rules, guidelines and standards for all projects
  • Participate in pre-task planning
  • Report any safety issues or concerns to management
  • Provide and lead continuous quality monitoring and improvement on projects
  • Monitor and promote quality standards and practices
  • Provide consistent quality standards on project and proposal delivery
  • Independently perform a variety projects or selected segments of larger projects
  • Plan, schedule, conduct, and/or coordinate detailed phases of assigned project work
  • Work closely with senior-level project managers to gain additional project management experience for more complex and larger projects
  • Make design, engineering, and construction recommendations, adaptations and modifications
What we offer
What we offer
  • medical
  • dental
  • vision
  • life insurance
  • 401(k) plan
  • paid time off and holidays
  • education reimbursement
  • various bonus programs
  • Fulltime
Read More
Arrow Right
New

Painter

At Hyatt, we believe our guests select Hyatt because of our caring and attentive...
Location
Location
United States , Tampa
Salary
Salary:
Not provided
about.hyatt.com Logo
Hyatt
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Candidate must have previous experience in the following areas: painting, drywall repair, and wallpaper repair
  • Finish carpentry skills are preferred
  • Ability to work flexible hours
Job Responsibility
Job Responsibility
  • Working on meeting space and public area painting
  • Drywall repair
  • Wallpaper repair
  • Other various projects throughout our hotel property
  • Fulltime
Read More
Arrow Right