CrawlJobs Logo

Lead Infrastructure and Reliability Engineer

lumalabs.ai Logo

Luma AI

Location Icon

Location:
United States , Palo Alto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 360000.00 USD / Year

Job Description:

Our Infrastructure Engineering team is a systems engineering group with company-level responsibility. At Luma, reliability engineers work directly with the researchers and products pushing the limits of multimodal intelligence. We operate close to the metal: Kernels, Containers, Schedulers, Networking, Storage, GPU behavior. But we are also responsible for something bigger: Turning deep systems knowledge into repeatable, scalable reliability for the entire company. We are hiring a leader who will define that direction. You will be a technical authority, an organizational force multiplier, and a magnet for other great engineers.

Job Responsibility:

  • Reliability of the Frontier: Architect and operate large, heterogeneous GPU environments under extreme demand
  • Improve utilization and performance where small gains materially change company outcomes
  • Resolve failures that span hardware, OS, runtimes, and orchestration
  • Eliminate entire classes of instability
  • Build mechanisms that make heroics unnecessary
  • Scaling Training & Inference: Define how infrastructure and workloads evolve as cluster size and concurrency grow
  • Design scheduling, placement, and resource management approaches for increasingly complex jobs
  • Work directly with research to build the systems required for new model capabilities
  • Ensure inference platforms scale rapidly without sacrificing reliability or latency
  • Anticipate where today’s abstractions will fail and redesign ahead of them
  • Building the Organization: Hire and develop exceptional systems and reliability engineers
  • Set the bar for technical depth, judgment, and production ownership
  • Shape architecture early through strong partnerships with research and product
  • Translate reliability constraints into long-term platform strategy

Requirements:

  • Deep expertise in Linux and distributed systems
  • Experience operating GPU / accelerator clusters in real production environments
  • Strong fluency in Kubernetes and modern open-source infrastructure
  • Comfortable debugging across hardware → kernel → runtime → orchestration
  • You understand how systems behave under contention and at scale
  • You write code and build automation
  • You think in bottlenecks, failure modes, and tradeoffs
  • Engineers trust your judgment, especially when things break

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Lead Infrastructure and Reliability Engineer

Lead Infrastructure Engineer - Distributed Systems

Arcesium seeks an accomplished Lead Infrastructure Engineer to join our Infrastr...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
arcesium.com Logo
Arcesium
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of relevant infrastructure experience
  • Demonstrated desire to take ownership of critical systems and successfully deliver systems and features
  • Advanced proficiency in Python and an eagerness to work with languages like Java, Go, and Bash as needed
  • Advanced knowledge and demonstrated experience with HTTP and other network protocols, load balancing, ALBs, nginx, Apache httpd, Kubernetes ingress, Kerberos, SAML, and other SSO technologies, authentication, and authorization
  • Ability to craft reliable and performant solutions to problems using a combination of software you write and open-source technology
  • Display a curious mind, enthusiasm for technology infrastructure, a knack for problem-solving, and the drive to take initiative and work with the right combination of independence and collaboration
  • A bachelor’s degree in computer science, computer engineering or a related discipline is required
Job Responsibility
Job Responsibility
  • Own the distributed systems ecosystem that powers our SaaS API-driven business
  • Construct innovative infrastructure systems, applications, and tools that streamline and enhance how Arcesium’s applications interoperate
  • Take ownership of our critical technology infrastructure
  • Provide highly available, reliable critical services to our customer-facing applications and developers
Read More
Arrow Right

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Lead Site Reliability Engineer

As a Lead Site Reliability Engineer (SRE), you will ensure the stability, perfor...
Location
Location
United States
Salary
Salary:
184000.00 - 229000.00 USD / Year
https://corelight.com/ Logo
Corelight
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience building and operating FedRAMP environments or similarly regulated systems
  • Expertise in AWS services (e.g., EC2, S3, RDS, Lambda, ECS/EKS, Glue, EMR, Redshift, OpenSearch, VPC)
  • Deep understanding of the FedRAMP framework, controls, and compliance requirements
  • Proficiency in programming languages such as Python, Go, or Java
  • Experience with big data technologies (Hadoop, Spark, Kafka)
  • Strong skills in Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible
  • Knowledge of containerization and orchestration tools like Docker and Kubernetes
  • Experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI
  • Proven track record in building and scaling platforms with high availability, resilience, and strict SLO objectives
  • Strong experience with Unix/Linux systems and cloud providers, ideally AWS
Job Responsibility
Job Responsibility
  • Collaborate with software engineering teams to ensure the reliability, performance, and security of the Federal region’s infrastructure
  • Design, implement, and manage FedRAMP-compliant infrastructure and systems
  • Establish continuous monitoring, logging, and auditing processes to ensure compliance with FedRAMP controls
  • Partner with security teams to conduct security assessments and implement necessary controls
  • Design and implement scalable infrastructure solutions that support multi-region growth
  • Drive automation efforts, enabling infrastructure and platforms to scale efficiently with a focus on compliance
  • Stay up-to-date on best practices, evolving security threats, and FedRAMP guidelines to maintain a strong security posture
  • Deploy and maintain cloud-native services in AWS that are resilient and elastic
  • Participate in 24x7 incident response and on-call rotations
  • Plan for capacity and work with teams to prepare for platform growth
What we offer
What we offer
  • Equity and additional benefits will also be awarded
  • Fulltime
Read More
Arrow Right

Lead Infrastructure and Automation Engineer

The Lead Infrastructure and Automation Engineer is responsible for developing, m...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
communityfibre.co.uk Logo
Community Fibre
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 7-10 years of experience working in a server and storage environment
  • Advanced understanding of Linux (Ubuntu / Debian, and CentOS / RHEL)
  • Experience with infrastructure as code (IAC) practices
  • Configuration management: Puppet and Ansible
  • Orchestration: Terraform
  • DCIM/IPAM tools, e.g. NetBox
  • Log ingestion: OpenSearch/Elasticsearch, Logstash, Kibana, Filebeat, Syslog, Graylog
  • Containerisation: Docker and / or Kubernetes
  • Virtualisation: VMware 7.x
  • Cloud: AWS
Job Responsibility
Job Responsibility
  • Develop, maintain, improve, and support Community Fibre’s infrastructure service environments
  • Manage and maintain existing Linux based servers, both on-prem and cloud hosted
  • Work with the Network Technology Team
  • Build new servers / systems
  • Ensure existing ones are maintained, reliable and resilient
  • Cover backend systems engineering, infrastructure, and site reliability engineering
  • Provide guidance and mentoring to other engineers
  • Create and implement high and low-level designs
  • Act as a senior member for the Network Technology team
  • Provide reports to SLT, Exec’ and Board members when required
What we offer
What we offer
  • 25 days holiday, increasing by 1 day for each year of service up to 28 days
  • Birthday leave
  • Cycle to work scheme
  • Flexible WFH policy
  • Private Health Cover
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Senior Director of Engineering, Infrastructure

Senior Director of Engineering role leading the Infrastructure group at PagerDut...
Location
Location
United States , San Francisco
Salary
Salary:
233000.00 - 392000.00 USD / Year
https://www.pagerduty.com Logo
PagerDuty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in senior engineering leadership roles, managing multiple layers of managers
  • Significant experience as a hands-on technical contributor earlier in your career
  • Deep knowledge of modern infrastructure and software delivery: high availability, distributed systems, public cloud (AWS), microservices, containers, CI/CD pipelines, observability, and automation
  • Track record of building and scaling high-performing, inclusive engineering organizations
Job Responsibility
Job Responsibility
  • Define and drive the multi-year strategy for PagerDuty's infrastructure and platform foundations
  • Strong ownership of PagerDuty's reliability patterns and practices
  • Bar raiser for all engineering functions
  • Lead, mentor, and scale a diverse team of Engineering Managers, Senior Managers, and technical leaders across multiple geographies
  • Ensure the reliability, scalability, and security of PagerDuty's global SaaS platform
  • Partner with peers in Engineering, Product, and Security to deliver large cross-functional initiatives
  • Champion engineering excellence: CI/CD maturity, observability best practices, operational rigor, and incident readiness
  • Manage budgets, headcount, and vendor relationships to optimize infrastructure investments
  • Represent Infrastructure externally with customers and partners, and internally with executives, as a trusted voice on technical and business tradeoffs
  • Foster a culture of inclusion, accountability, collaboration, and growth
What we offer
What we offer
  • Competitive salary
  • Comprehensive benefits package
  • Flexible work arrangements
  • Company equity
  • ESPP (Employee Stock Purchase Program)
  • Retirement or pension plan
  • Generous paid vacation time
  • Paid holidays and sick leave
  • Dutonian Wellness Days & HibernationDuty - companywide paid days off in addition to PTO
  • Paid parental leave: 22 weeks for pregnant parent, 12 weeks for non-pregnant parent
  • Fulltime
Read More
Arrow Right