CrawlJobs Logo

Software Engineer - Reliability GPU Infrastructure

United States; United Kingdom, Palo Alto 170000.00 - 360000.00 USD / Year · Job Posted January 13, 2026
Apply Position
Job Link Share

Job Description

Luma AI is a capital-intensive lab building the future of creative intelligence. This unique position offers the leverage to build systems of immense scale while retaining individual ownership over the architecture and strategy of our infrastructure.

Job Responsibility

  • Define the technical strategy for compute substrate
  • Determine how to provision, manage, and scale multi-cloud and on-premise GPU footprint
  • Bridge the gap between hardware vendors and software stack
  • Architect a seamless infrastructure mesh that spans multiple cloud providers and bare-metal environments
  • Design the logic that allocates massive compute resources across competing priorities
  • Lead the effort to define entire stack as code
  • Build rigorous CI/CD and GitOps workflows

Requirements

  • History of designing complex distributed systems
  • Deep expertise across various infrastructure providers
  • Ability to mentor the team and drive consensus on technical decisions

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Software Engineer - Reliability GPU Infrastructure

8 matching positions

Staff Software Engineer, GPU Infrastructure (HPC)

The internal infrastructure team is responsible for building world-class infrast...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment
Job Responsibility
Job Responsibility
  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure - Autonomy & Robotics

DoorDash Labs is an independent team within DoorDash. We are working on building...
Location
Location
United States , San Francisco
Salary
Salary:
159800.00 - 235000.00 USD / Year
doordash.com Logo
DoorDash
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.S., M.S., or PhD. in Computer Science, Robotics or related technical field
  • In-depth knowledge of data structures and algorithms
  • Strong Python programming experience
  • Experience with operationalizing large-scale systems
  • Experience with at least one distributed data processing framework (Ray, Spark, Flink, etc)
  • Passionate about software quality and reliability
Job Responsibility
Job Responsibility
  • Have significant scope and decision-making responsibility
  • Design and implement infrastructure to enable autonomous vehicle development, including: Large-scale distributed simulation execution
  • Ingest, processing, and organization of petabyte-scale datasets
  • GPU-accelerated distributed computing for data preparation and training
  • Design and implement robot data and metrics pipelines
  • Collaborate with core autonomy teams: motion planning, perception, and simulation
What we offer
What we offer
  • 401(k) plan with employer matching
  • 16 weeks of paid parental leave
  • Wellness benefits
  • Commuter benefits match
  • Paid time off
  • Paid sick leave
  • Medical, dental, and vision benefits
  • 11 paid holidays
  • Disability and basic life insurance
  • Family-forming assistance
  • Fulltime
Read More
Arrow Right

Software Engineer, Reliability

Join the engineering teams that bring OpenAI’s ideas safely to the world. The Ap...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 490000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent work experience)
  • Proven experience as an SWE focused on reliability or a similar role in a fast-paced, rapidly scaling company
  • Strong proficiency in cloud infrastructure
  • Proficiency in programming languages
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Knowledge of IaC tools such as Terraform or CloudFormation
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Experience with observability tools such as DataDog, Prometheus, Grafana and Splunk
  • Experience with microservices architecture and service mesh technologies
Job Responsibility
Job Responsibility
  • Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands
  • Build and maintain the load, chaos and synthetic testing software leveraged by development teams to make the systems they design and operate more reliable
  • Build and maintain automation tools to streamline repetitive tasks and improve system reliability
  • Build and maintain the platform for CPU/storage, GPU, and network lifecycle management to drive efficiency, accountability and support dynamic optimization of our resources
  • Implement fault-tolerant and resilient design patterns to minimize service disruptions
  • Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability
  • Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world
  • Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Senior Software Engineer (Infrastructure)

We're hiring a Senior Software Engineer (Infrastructure) to be a technical drive...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 250000.00 USD / Year
helpcare.ai Logo
Helpcare AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You're a scrappy infrastructure generalist who has seen it all
  • You've worked with GPU cloud providers and understand what's needed to build reliable systems on top of them
  • You consider AWS your second home. You're comfortable spinning up new services and building simple repeatable processes for others to leverage
  • You thrive in an ambiguous and fast changing space
  • You bring a senior mindset: you set direction, own decisions, and get things over the finish line
  • You have incredibly communication skills and can communicate complex technical ideas clearly to both technical and non-technical team members
Job Responsibility
Job Responsibility
  • Work across across teams to own and extend our GPU infra as well as our traditional cloud infra (AWS)
  • Work closely with our external infrastructure partners to ensure stability and reliability for GPU deployments and GPU availability
  • Empower other engineers to move fast by building amazing developer experiences for setting up new systems
What we offer
What we offer
  • flexible work schedule
  • unlimited PTO
  • competitive healthcare
  • gear stipends
  • Fulltime
Read More
Arrow Right

Software Engineer - Reliability

We are looking for a hands-on, first-principles engineer who is fluent in Linux,...
Location
Location
United States , Palo Alto
Salary
Salary:
170000.00 - 360000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment
  • Deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance
  • Strong experience with providers like AWS or OCI
  • Thrive on solving complex, low-level problems where hardware and software intersect
  • Energetic and thrive in a less structured, fast-paced environment
  • Working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO
  • Practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs
Job Responsibility
Job Responsibility
  • Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale
  • Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance
  • Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices
  • Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level
  • Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure
  • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Core Infrastructure

Our Core Infrastructure team in Aarhus is at the forefront of building and scali...
Location
Location
Denmark , Aarhus
Salary
Salary:
Not provided
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in backend software development with distributed systems, infrastructure, or cloud platforms
  • Strong expertise in Go, Java, or similar backend languages, with a deep understanding of Kubernetes, cloud infrastructure, and high-scale systems
  • Experience leading cross-team or team-wide projects focused on system modernization, performance optimizations, and deployment safety improvements
  • Experience designing and implementing highly available, efficient, and secure cloud-native/kubernetes architectures
  • Deep understanding of safe deployment strategies, workload automation, and resilience engineering
  • Strong experience in scaling autoscaling solutions, ARM adoption, hybrid cloud, or GPU support for ML workloads
  • Ability to lead complex, cross-team engineering projects and build strategic relationships with stakeholders across platform, security, and infrastructure teams
Job Responsibility
Job Responsibility
  • Design and implement backend infrastructure components to support Uber’s growing workloads, including deployment engines, autoscalers, and hybrid cloud environments
  • Lead cross-team projects focused on safe deployment and rollback automation across stateless, stateful, and batch workloads, improving resilience and developer efficiency
  • Improve infrastructure security and compliance, including encryption-at-rest, ransomware mitigation, and cloud security best practices
  • Contribute to and drive modernization efforts within the team and across related teams, including Kubernetes migration, unified workload platforms, and PaaS improvements
  • Optimize Uber’s infrastructure efficiency, focusing on ARM adoption, autoscaling enhancements, and cost-effective compute allocation
  • Proactively mentor other engineers and help define the technical direction for your team, ensuring Uber’s backend infrastructure remains reliable, scalable, and efficient
  • Fulltime
Read More
Arrow Right

Software Engineer - Cloud FinOps & Reliability

This is a foundational engineering position for a technical, data-driven expert ...
Location
Location
United States , Palo Alto
Salary
Salary:
120000.00 - 255000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer
  • Deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others
  • Proficient in Python for the purpose of scripting, data analysis, and building automation tooling
  • Strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage
  • Not an accountant
  • you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale
  • A tenacious troubleshooter and a data-driven decision-maker who thrives on finding the 'why' behind the numbers
Job Responsibility
Job Responsibility
  • Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning
  • Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets
  • Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately
  • Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs
  • Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Search

We are looking for a Senior Software Engineer dedicated to the core development ...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
aiven.io Logo
Aiven Deutschland GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Search Mastery: Deep technical knowledge of OpenSearch or Elasticsearch internals, particularly in resource management, indexing strategies, and query optimization.
  • Reliability & Operations: A passion for building highly available, resilient distributed systems and a proven track record of solving complex operational challenges.
  • Hardware: Interest or experience in hardware acceleration, specifically deploying and managing GPU-based instances for data services.
  • Automation Mindset: Proven experience in designing automated recovery systems and performance-tuning algorithms.
  • Modern Search Patterns: A strong interest in the evolving landscape, including Vector Database implementations, hybrid search and RAG (Retrieval-Augmented Generation).
Job Responsibility
Job Responsibility
  • Autonomic Systems: Design and implement new and improve existing automated management, self-healing recovery mechanisms, and performance-tuning logic for OpenSearch clusters.
  • Next-Gen Infrastructure: Research, design, and deploy the first GPU-powered instances within the Aiven ecosystem to support intensive search and AI workloads.
  • AI Node Development: Develop and integrate specialized AI node types to enhance OpenSearch's capability in handling modern machine learning tasks.
  • Engineering Ownership: Own the health, reliability, and availability metrics of our OpenSearch platform.
What we offer
What we offer
  • Participate in Aiven’s equity plan.
  • Balance work and life with our hybrid work policy.
  • Choose the equipment you need to set yourself up for success.
  • Use your Professional Development Plan budget for learning opportunities.
  • Receive holistic wellbeing support through our global Employee Assistance Program.
  • Inquire about our Global Time Off Commitment (Parental and Sick Leave, as well as Personal Time)
  • Enjoy country-specific benefits for our global cast.
  • Fulltime
Read More
Arrow Right