Lead Infrastructure and Reliability Engineer Job at Luma AI (Palo Alto)

Lead Infrastructure and Reliability Engineer

Our Infrastructure Engineering team is a systems engineering group with company-...

Location

United States , Palo Alto

Salary:

230000.00 - 360000.00 USD / Year

Luma AI

Expiration Date

Until further notice

Requirements

Deep expertise in Linux and distributed systems
Experience operating GPU / accelerator clusters in real production environments
Strong fluency in Kubernetes and modern open-source infrastructure
Comfortable debugging across hardware → kernel → runtime → orchestration
You understand how systems behave under contention and at scale
You write code and build automation
You think in bottlenecks, failure modes, and tradeoffs
Engineers trust your judgment, especially when things break

Job Responsibility

Reliability of the Frontier: Architect and operate large, heterogeneous GPU environments under extreme demand
Improve utilization and performance where small gains materially change company outcomes
Resolve failures that span hardware, OS, runtimes, and orchestration
Eliminate entire classes of instability
Build mechanisms that make heroics unnecessary
Scaling Training & Inference: Define how infrastructure and workloads evolve as cluster size and concurrency grow
Design scheduling, placement, and resource management approaches for increasingly complex jobs
Work directly with research to build the systems required for new model capabilities
Ensure inference platforms scale rapidly without sacrificing reliability or latency
Anticipate where today’s abstractions will fail and redesign ahead of them

Fulltime

Senior Software Engineer, Infrastructure and Efficiency

Roku is seeking a world class engineer to be a true force multiplier by owning t...

Location

United Kingdom , Manchester

Salary:

Not provided

Roku

Expiration Date

Until further notice

Requirements

Bachelor's in Computer Science or Computer Engineering, or equivalent experience
Prior experience at Staff, Principal, or Architect level
Ownership in platform/infrastructure, developer productivity, CI/CD, build systems, or workflow automation, including work spanning globally distributed teams and systems
Strong, hands-on knowledge of: Git (workflows, branching strategies, release management)
Jenkins (CI/CD pipeline design, scaling, reliability)
Docker containers (reproducible build/test execution)
Build systems (BitBake/Yocto preferred)
Jira (workflow design/automation, cross-team visibility)
Test management systems (e.g., TestRail or equivalent)
Demonstrated experience delivering automation at scale, including building platforms/components used by multiple teams

Job Responsibility

Define and lead an AI-first automation roadmap for Engineering Infrastructure and Enterprise Tooling
Architect and ship AI/LLM-enabled workflow automation across the SDLC
Establish policies and guardrails for AI usage in internal tools
Set technical direction and standards for the globally used developer tools
Architect and evolve, CI/CD and build/test pipelines using Jenkins, Docker, and build systems
Own and scale test and quality workflows leveraging test management systems
Drive Jira workflow design and automation
Design and implement automation spanning Engineering, Product, Marketing, and Partners
Replace manual coordination with durable automation and integrations
Embed secure-by-default controls into pipelines and tooling

What we offer

Global access to mental health and financial wellness support and resources
Statutory and voluntary benefits which may include healthcare (medical, dental, and vision), life, accident, disability, commuter, and retirement options (401(k)/pension)
Employees are supported in taking time off, in accordance with local leave policies and other personal needs

Fulltime

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...

Location

United States

Salary:

113082.00 - 175725.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

6+ years of experience in an SRE/Operations/DevOps role as part of a team
Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
we primarily use Python) and configuration management tools (Puppet, Ansible
we use Puppet)
Experience designing and managing infrastructure security for large fleets of diverse services
Experience with technical response during security incidents
Experience with package management on Linux systems (we use Debian)
Strong Linux system-level troubleshooting skills
History of automating tasks and processes, identifying process gaps, and finding automation opportunities
Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones

Job Responsibility

Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
Collaborating with a global, cross-functional team in an asynchronous communication environment
Mentoring peers in your areas of technical and operational strength
Ability and willingness to travel 1-2 times a year for in-person events and team meetings
Most importantly, share our values and work in accordance with them

Fulltime

Systems and Infrastructure Engineer

This position leverages expertise in system administration to maintain systems c...

Location

United States , Tucker

Salary:

70880.00 - 173900.00 USD / Year

Georgia System Operations

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science, Information Systems, Software Engineering, Electrical or Computer Engineering
Level I: 0 to 24 months work experience in system administration, cyber security or related
Level II: 2+ years work experience as stated
Level III: 4+ years work experience as stated
Level IV: 6+ years work experience as stated
Level V: 8+ years work experience as stated
or equivalent education and experience as per level
Must be able to pass NERC CIP PRA screening

Job Responsibility

Maintain systems critical to GSOC's system operations function
Perform system administration of Operational Technology systems (installation, patching, backup/recovery, performance monitoring, cyber security hardening)
Maintain awareness of NERC Reliability and CIP standards
Implement and manage infrastructure tools for system configuration consistency
Develop processes and documentation for systems management
Participate in Change Management Program
Collect evidence for NERC CIP compliance
Coordinate with Control Center, EMS, Security, Networking, Enterprise IT teams
Coordinate with GTC and OPC operations
Participate in on-call 24x7 support rotation

What we offer

Comprehensive medical, dental, and vision coverage
Strong retirement program
Career development
Flexible work schedules
Wellness focus
Supportive community membership

Fulltime

Principal, Systems and Infrastructure Engineer, Information Security

Are you driven to design durable, scalable, and well-governed cloud platforms th...

Location

United States of America , Denver

Salary:

121000.00 - 242000.00 USD / Year

Walmart

Expiration Date

Until further notice

Requirements

Option 1: Bachelor's degree in computer science, information technology, engineering, information systems, cybersecurity, or related area and 5years' experience in systems and infrastructure engineering or related area at a technology, retail, or data-driven company.
Option 2: 7 years' experience in systems and infrastructure engineering or related area at a technology, retail, or data-driven company.

Job Responsibility

Lead the migration and modernization of a large portfolio of applications and databases from AWS to GCP and Azure, ensuring reliability, security, and minimal disruption.
Design target-state architectures and migration patterns that balance scalability, resilience, cost, and operational simplicity.
Evaluate cloud-native services and guide architectural tradeoffs across AWS, GCP, and Azure.
Establish reference architectures, landing zone standards, and platform patterns used across the organization.
Architect, build, and maintain complex, reusable Infrastructure-as-Code solutions using Terraform and Terragrunt.
Develop Python and Bash automation to support infrastructure lifecycle management, migrations, governance, and operational workflows.
Drive consistency and quality through shared modules, versioning strategies, and code review standards.
Integrate IaC and automation into CI/CD pipelines using GitHub Actions and related tooling.
Drive containerization and platform adoption using Docker and Kubernetes, enabling scalable and resilient application deployments.
Design and maintain robust CI/CD pipelines that support fast, safe, and repeatable infrastructure and application delivery.

What we offer

Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement.
Live Better U education benefit program
Annual or quarterly performance bonuses
Stock

Fulltime

Site Reliability Engineer (Lead)

10Pearls is an award-winning end-to-end digital innovation company that helps bu...

Location

Pakistan , Islamabad

Salary:

Not provided

10Pearls

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science or related field
5–8 years in SRE or production-engineering roles running distributed systems at scale
Deep Kubernetes expertise — operators, RBAC, network policy, storage, upgrades
Hands-on with Keycloak / Vault / MinIO / Harbor / Kong or equivalent identity/secrets/storage/registry/gateway stacks
Strong Linux fundamentals and at least one systems language (Go, Rust) or shell/Python for tooling
Proven SLO/SLI authorship and error-budget-driven decision-making
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Loki, Tempo)
Calm, clear communication during incidents
strong post-mortem writing
Hands-on with infra-as-code — Helm, Kustomize, Terraform

Job Responsibility

Substrate operation — own the Kubernetes cluster plus Keycloak (identity), Vault (secrets), MinIO (object storage), Harbor (registry), Kong (gateway) — from bootstrap to day-2 operations
SLO framework — define, publish, and defend SLOs for every tier-1 service
own error budgets and burn-rate alerting
Incident response — build the on-call rotation, paging, runbook library, and post mortem culture
lead incident command during P1/P2 events
Release operations — co-own the blue-green / canary release model with L6 Delivery
sign off production-bound releases
Air-gap operations — ensure every operational runbook works in a fully offline environment — no assumption of external dependencies
Lead the Platform squad — technically lead 1 Infrastructure Engineer, 1 Observability Engineer, 2 DevOps Engineers
set standards for infra-as-code and automation

Fulltime

Senior Software Engineer - Infrastructure Reliability

We are seeking a Senior Software Engineer to join our Security Product team, foc...

Location

India , Bangalore

Salary:

Not provided

JFrog

Expiration Date

Until further notice

Requirements

7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems
Strong proficiency in Go
familiarity with Python and Helm is a plus
Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting
Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker
Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through
Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP)
Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure
Excellent analytical and problem-solving skills with a methodical approach to debugging
Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams

Job Responsibility

Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP)
Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps
Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved
Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution
Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches
Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations
Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability

Senior Software Engineer, Infrastructure and Security

At Vanta, our mission is to help businesses earn and prove trust. We believe tha...

Location

United States

Salary:

179000.00 - 211000.00 USD / Year

Vanta

Expiration Date

Until further notice

Requirements

You’ve played technical leadership roles on Infrastructure or platform teams
You have experience with infrastructure, AWS services, and scaling platforms in fast-growing environments
You care deeply about performance and reliability
You’re thoughtful about trade-offs and have good product sense when creating new infrastructure
Open to using AI to amplify their skills and strengthen their work - demonstrating curiosity, a willingness to learn, and sound judgment in applying AI responsibly to improve efficiency and impact

Job Responsibility

Design and build scalable infrastructure to support rapid growth in data volume, service usage, and engineering velocity
Lead projects across our cloud infrastructure, including container orchestration (e.g., AWS Fargate, ECS), monitoring and alerting systems, networking, and database maintenance
Implement and maintain core security infrastructure and controls including, service-to-service authentication, secrets management, application security primitives (e.g., rate-limiting, encryption libraries, etc.), and infrastructure hardening
Identify and solve complex scalability and performance challenges, particularly related to service reliability and data throughput
Partner closely with Security Engineering to implement infrastructure that supports best-in-class security and compliance practices
Drive infrastructure design reviews and provide technical guidance on architectural decisions and trade-offs
Work with talented and kind engineers to make a significant impact on our customer base, enabling them to improve their security and prove it
Contribute to building Vanta’s engineering culture as we grow

What we offer

Offers Equity
medical benefits
401(k) plan
other company perk programs
Comprehensive medical, dental, and vision coverage, with 100% of employee-only benefit premiums covered for most medical plans
16 weeks fully-paid Parental Leave for all new parents
Health & wellness stipend
Remote workspace, internet, and cellphone stipend
Commuter benefits for team members who report to the SF and NYC office
Family planning benefits

Fulltime

Select Country

Lead Infrastructure and Reliability Engineer

Job Description

Job Responsibility

Requirements

Looking for more opportunities?

Lead Infrastructure and Reliability Engineer

Lead Infrastructure and Reliability Engineer

Senior Software Engineer, Infrastructure and Efficiency

Senior Site Reliability Engineer, Infrastructure Foundations

Systems and Infrastructure Engineer

Principal, Systems and Infrastructure Engineer, Information Security

Site Reliability Engineer (Lead)

Senior Software Engineer - Infrastructure Reliability

Senior Software Engineer, Infrastructure and Security

Our AI answers in your language