CrawlJobs Logo

Lead Infrastructure and Reliability Engineer

United States, Palo Alto 230000.00 - 360000.00 USD / Year · Job Posted March 13, 2026
Apply Position
Job Link Share

Job Description

Our Infrastructure Engineering team is a systems engineering group with company-level responsibility. At Luma, reliability engineers work directly with the researchers and products pushing the limits of multimodal intelligence. We operate close to the metal: Kernels, Containers, Schedulers, Networking, Storage, GPU behavior. But we are also responsible for something bigger: Turning deep systems knowledge into repeatable, scalable reliability for the entire company. We are hiring a leader who will define that direction. You will be a technical authority, an organizational force multiplier, and a magnet for other great engineers.

Job Responsibility

  • Reliability of the Frontier: Architect and operate large, heterogeneous GPU environments under extreme demand
  • Improve utilization and performance where small gains materially change company outcomes
  • Resolve failures that span hardware, OS, runtimes, and orchestration
  • Eliminate entire classes of instability
  • Build mechanisms that make heroics unnecessary
  • Scaling Training & Inference: Define how infrastructure and workloads evolve as cluster size and concurrency grow
  • Design scheduling, placement, and resource management approaches for increasingly complex jobs
  • Work directly with research to build the systems required for new model capabilities
  • Ensure inference platforms scale rapidly without sacrificing reliability or latency
  • Anticipate where today’s abstractions will fail and redesign ahead of them
  • Building the Organization: Hire and develop exceptional systems and reliability engineers
  • Set the bar for technical depth, judgment, and production ownership
  • Shape architecture early through strong partnerships with research and product
  • Translate reliability constraints into long-term platform strategy

Requirements

  • Deep expertise in Linux and distributed systems
  • Experience operating GPU / accelerator clusters in real production environments
  • Strong fluency in Kubernetes and modern open-source infrastructure
  • Comfortable debugging across hardware → kernel → runtime → orchestration
  • You understand how systems behave under contention and at scale
  • You write code and build automation
  • You think in bottlenecks, failure modes, and tradeoffs
  • Engineers trust your judgment, especially when things break

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Lead Infrastructure and Reliability Engineer

8 matching positions

Lead Infrastructure and Reliability Engineer

Our Infrastructure Engineering team is a systems engineering group with company-...
Location
Location
United States , Palo Alto
Salary
Salary:
230000.00 - 360000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in Linux and distributed systems
  • Experience operating GPU / accelerator clusters in real production environments
  • Strong fluency in Kubernetes and modern open-source infrastructure
  • Comfortable debugging across hardware → kernel → runtime → orchestration
  • You understand how systems behave under contention and at scale
  • You write code and build automation
  • You think in bottlenecks, failure modes, and tradeoffs
  • Engineers trust your judgment, especially when things break
Job Responsibility
Job Responsibility
  • Reliability of the Frontier: Architect and operate large, heterogeneous GPU environments under extreme demand
  • Improve utilization and performance where small gains materially change company outcomes
  • Resolve failures that span hardware, OS, runtimes, and orchestration
  • Eliminate entire classes of instability
  • Build mechanisms that make heroics unnecessary
  • Scaling Training & Inference: Define how infrastructure and workloads evolve as cluster size and concurrency grow
  • Design scheduling, placement, and resource management approaches for increasingly complex jobs
  • Work directly with research to build the systems required for new model capabilities
  • Ensure inference platforms scale rapidly without sacrificing reliability or latency
  • Anticipate where today’s abstractions will fail and redesign ahead of them
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Infrastructure and Efficiency

Roku is seeking a world class engineer to be a true force multiplier by owning t...
Location
Location
United Kingdom , Manchester
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's in Computer Science or Computer Engineering, or equivalent experience
  • Prior experience at Staff, Principal, or Architect level
  • Ownership in platform/infrastructure, developer productivity, CI/CD, build systems, or workflow automation, including work spanning globally distributed teams and systems
  • Strong, hands-on knowledge of: Git (workflows, branching strategies, release management)
  • Jenkins (CI/CD pipeline design, scaling, reliability)
  • Docker containers (reproducible build/test execution)
  • Build systems (BitBake/Yocto preferred)
  • Jira (workflow design/automation, cross-team visibility)
  • Test management systems (e.g., TestRail or equivalent)
  • Demonstrated experience delivering automation at scale, including building platforms/components used by multiple teams
Job Responsibility
Job Responsibility
  • Define and lead an AI-first automation roadmap for Engineering Infrastructure and Enterprise Tooling
  • Architect and ship AI/LLM-enabled workflow automation across the SDLC
  • Establish policies and guardrails for AI usage in internal tools
  • Set technical direction and standards for the globally used developer tools
  • Architect and evolve, CI/CD and build/test pipelines using Jenkins, Docker, and build systems
  • Own and scale test and quality workflows leveraging test management systems
  • Drive Jira workflow design and automation
  • Design and implement automation spanning Engineering, Product, Marketing, and Partners
  • Replace manual coordination with durable automation and integrations
  • Embed secure-by-default controls into pipelines and tooling
What we offer
What we offer
  • Global access to mental health and financial wellness support and resources
  • Statutory and voluntary benefits which may include healthcare (medical, dental, and vision), life, accident, disability, commuter, and retirement options (401(k)/pension)
  • Employees are supported in taking time off, in accordance with local leave policies and other personal needs
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience designing and managing infrastructure security for large fleets of diverse services
  • Experience with technical response during security incidents
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Ability and willingness to travel 1-2 times a year for in-person events and team meetings
  • Most importantly, share our values and work in accordance with them
  • Fulltime
Read More
Arrow Right

Systems and Infrastructure Engineer

This position leverages expertise in system administration to maintain systems c...
Location
Location
United States , Tucker
Salary
Salary:
70880.00 - 173900.00 USD / Year
gasoc.com Logo
Georgia System Operations
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, Information Systems, Software Engineering, Electrical or Computer Engineering
  • Level I: 0 to 24 months work experience in system administration, cyber security or related
  • Level II: 2+ years work experience as stated
  • Level III: 4+ years work experience as stated
  • Level IV: 6+ years work experience as stated
  • Level V: 8+ years work experience as stated
  • or equivalent education and experience as per level
  • Must be able to pass NERC CIP PRA screening
Job Responsibility
Job Responsibility
  • Maintain systems critical to GSOC's system operations function
  • Perform system administration of Operational Technology systems (installation, patching, backup/recovery, performance monitoring, cyber security hardening)
  • Maintain awareness of NERC Reliability and CIP standards
  • Implement and manage infrastructure tools for system configuration consistency
  • Develop processes and documentation for systems management
  • Participate in Change Management Program
  • Collect evidence for NERC CIP compliance
  • Coordinate with Control Center, EMS, Security, Networking, Enterprise IT teams
  • Coordinate with GTC and OPC operations
  • Participate in on-call 24x7 support rotation
What we offer
What we offer
  • Comprehensive medical, dental, and vision coverage
  • Strong retirement program
  • Career development
  • Flexible work schedules
  • Wellness focus
  • Supportive community membership
  • Fulltime
Read More
Arrow Right

Principal, Systems and Infrastructure Engineer, Information Security

Are you driven to design durable, scalable, and well-governed cloud platforms th...
Location
Location
United States of America , Denver
Salary
Salary:
121000.00 - 242000.00 USD / Year
walmart.com Logo
Walmart
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Option 1: Bachelor's degree in computer science, information technology, engineering, information systems, cybersecurity, or related area and 5years' experience in systems and infrastructure engineering or related area at a technology, retail, or data-driven company.
  • Option 2: 7 years' experience in systems and infrastructure engineering or related area at a technology, retail, or data-driven company.
Job Responsibility
Job Responsibility
  • Lead the migration and modernization of a large portfolio of applications and databases from AWS to GCP and Azure, ensuring reliability, security, and minimal disruption.
  • Design target-state architectures and migration patterns that balance scalability, resilience, cost, and operational simplicity.
  • Evaluate cloud-native services and guide architectural tradeoffs across AWS, GCP, and Azure.
  • Establish reference architectures, landing zone standards, and platform patterns used across the organization.
  • Architect, build, and maintain complex, reusable Infrastructure-as-Code solutions using Terraform and Terragrunt.
  • Develop Python and Bash automation to support infrastructure lifecycle management, migrations, governance, and operational workflows.
  • Drive consistency and quality through shared modules, versioning strategies, and code review standards.
  • Integrate IaC and automation into CI/CD pipelines using GitHub Actions and related tooling.
  • Drive containerization and platform adoption using Docker and Kubernetes, enabling scalable and resilient application deployments.
  • Design and maintain robust CI/CD pipelines that support fast, safe, and repeatable infrastructure and application delivery.
What we offer
What we offer
  • Health benefits include medical, vision and dental coverage.
  • Financial benefits include 401(k), stock purchase and company-paid life insurance.
  • Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
  • Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement.
  • Live Better U education benefit program
  • Annual or quarterly performance bonuses
  • Stock
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer (Lead)

10Pearls is an award-winning end-to-end digital innovation company that helps bu...
Location
Location
Pakistan , Islamabad
Salary
Salary:
Not provided
10pearls.com Logo
10Pearls
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science or related field
  • 5–8 years in SRE or production-engineering roles running distributed systems at scale
  • Deep Kubernetes expertise — operators, RBAC, network policy, storage, upgrades
  • Hands-on with Keycloak / Vault / MinIO / Harbor / Kong or equivalent identity/secrets/storage/registry/gateway stacks
  • Strong Linux fundamentals and at least one systems language (Go, Rust) or shell/Python for tooling
  • Proven SLO/SLI authorship and error-budget-driven decision-making
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Loki, Tempo)
  • Calm, clear communication during incidents
  • strong post-mortem writing
  • Hands-on with infra-as-code — Helm, Kustomize, Terraform
Job Responsibility
Job Responsibility
  • Substrate operation — own the Kubernetes cluster plus Keycloak (identity), Vault (secrets), MinIO (object storage), Harbor (registry), Kong (gateway) — from bootstrap to day-2 operations
  • SLO framework — define, publish, and defend SLOs for every tier-1 service
  • own error budgets and burn-rate alerting
  • Incident response — build the on-call rotation, paging, runbook library, and post mortem culture
  • lead incident command during P1/P2 events
  • Release operations — co-own the blue-green / canary release model with L6 Delivery
  • sign off production-bound releases
  • Air-gap operations — ensure every operational runbook works in a fully offline environment — no assumption of external dependencies
  • Lead the Platform squad — technically lead 1 Infrastructure Engineer, 1 Observability Engineer, 2 DevOps Engineers
  • set standards for infra-as-code and automation
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Infrastructure Reliability

We are seeking a Senior Software Engineer to join our Security Product team, foc...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems
  • Strong proficiency in Go
  • familiarity with Python and Helm is a plus
  • Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting
  • Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker
  • Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through
  • Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP)
  • Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure
  • Excellent analytical and problem-solving skills with a methodical approach to debugging
  • Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams
Job Responsibility
Job Responsibility
  • Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP)
  • Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps
  • Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved
  • Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution
  • Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches
  • Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations
  • Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability
Read More
Arrow Right

Senior Software Engineer, Infrastructure and Security

At Vanta, our mission is to help businesses earn and prove trust. We believe tha...
Location
Location
United States
Salary
Salary:
179000.00 - 211000.00 USD / Year
vanta.com Logo
Vanta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You’ve played technical leadership roles on Infrastructure or platform teams
  • You have experience with infrastructure, AWS services, and scaling platforms in fast-growing environments
  • You care deeply about performance and reliability
  • You’re thoughtful about trade-offs and have good product sense when creating new infrastructure
  • Open to using AI to amplify their skills and strengthen their work - demonstrating curiosity, a willingness to learn, and sound judgment in applying AI responsibly to improve efficiency and impact
Job Responsibility
Job Responsibility
  • Design and build scalable infrastructure to support rapid growth in data volume, service usage, and engineering velocity
  • Lead projects across our cloud infrastructure, including container orchestration (e.g., AWS Fargate, ECS), monitoring and alerting systems, networking, and database maintenance
  • Implement and maintain core security infrastructure and controls including, service-to-service authentication, secrets management, application security primitives (e.g., rate-limiting, encryption libraries, etc.), and infrastructure hardening
  • Identify and solve complex scalability and performance challenges, particularly related to service reliability and data throughput
  • Partner closely with Security Engineering to implement infrastructure that supports best-in-class security and compliance practices
  • Drive infrastructure design reviews and provide technical guidance on architectural decisions and trade-offs
  • Work with talented and kind engineers to make a significant impact on our customer base, enabling them to improve their security and prove it
  • Contribute to building Vanta’s engineering culture as we grow
What we offer
What we offer
  • Offers Equity
  • medical benefits
  • 401(k) plan
  • other company perk programs
  • Comprehensive medical, dental, and vision coverage, with 100% of employee-only benefit premiums covered for most medical plans
  • 16 weeks fully-paid Parental Leave for all new parents
  • Health & wellness stipend
  • Remote workspace, internet, and cellphone stipend
  • Commuter benefits for team members who report to the SF and NYC office
  • Family planning benefits
  • Fulltime
Read More
Arrow Right