CrawlJobs Logo

Software Engineer - Reliability GPU Infrastructure

lumalabs.ai Logo

Luma AI

Location Icon

Location:
United States; United Kingdom , Palo Alto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

170000.00 - 360000.00 USD / Year

Job Description:

Luma AI is a capital-intensive lab building the future of creative intelligence. This unique position offers the leverage to build systems of immense scale while retaining individual ownership over the architecture and strategy of our infrastructure.

Job Responsibility:

  • Define the technical strategy for compute substrate
  • Determine how to provision, manage, and scale multi-cloud and on-premise GPU footprint
  • Bridge the gap between hardware vendors and software stack
  • Architect a seamless infrastructure mesh that spans multiple cloud providers and bare-metal environments
  • Design the logic that allocates massive compute resources across competing priorities
  • Lead the effort to define entire stack as code
  • Build rigorous CI/CD and GitOps workflows

Requirements:

  • History of designing complex distributed systems
  • Deep expertise across various infrastructure providers
  • Ability to mentor the team and drive consensus on technical decisions

Additional Information:

Job Posted:
January 13, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer - Reliability GPU Infrastructure

Senior Software Engineer (Infrastructure)

We're hiring a Senior Software Engineer (Infrastructure) to be a technical drive...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 250000.00 USD / Year
helpcare.ai Logo
Helpcare AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You're a scrappy infrastructure generalist who has seen it all
  • You've worked with GPU cloud providers and understand what's needed to build reliable systems on top of them
  • You consider AWS your second home. You're comfortable spinning up new services and building simple repeatable processes for others to leverage
  • You thrive in an ambiguous and fast changing space
  • You bring a senior mindset: you set direction, own decisions, and get things over the finish line
  • You have incredibly communication skills and can communicate complex technical ideas clearly to both technical and non-technical team members
Job Responsibility
Job Responsibility
  • Work across across teams to own and extend our GPU infra as well as our traditional cloud infra (AWS)
  • Work closely with our external infrastructure partners to ensure stability and reliability for GPU deployments and GPU availability
  • Empower other engineers to move fast by building amazing developer experiences for setting up new systems
What we offer
What we offer
  • flexible work schedule
  • unlimited PTO
  • competitive healthcare
  • gear stipends
  • Fulltime
Read More
Arrow Right

Software Engineer - Cloud FinOps & Reliability

This is a foundational engineering position for a technical, data-driven expert ...
Location
Location
United States , Palo Alto
Salary
Salary:
120000.00 - 255000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer
  • Deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others
  • Proficient in Python for the purpose of scripting, data analysis, and building automation tooling
  • Strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage
  • Not an accountant
  • you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale
  • A tenacious troubleshooter and a data-driven decision-maker who thrives on finding the 'why' behind the numbers
Job Responsibility
Job Responsibility
  • Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning
  • Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets
  • Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately
  • Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs
  • Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders
  • Fulltime
Read More
Arrow Right

Machine Learning Platform / Backend Engineer

We are seeking a Machine Learning Platform/Backend Engineer to design, build, an...
Location
Location
Serbia; Romania , Belgrade; Timișoara
Salary
Salary:
Not provided
everseen.ai Logo
Everseen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4-5+ years of work experience in either ML infrastructure, MLOps, or Platform Engineering
  • Bachelors degree or equivalent focusing on the computer science field is preferred
  • Excellent communication and collaboration skills
  • Expert knowledge of Python
  • Experience with CI/CD tools (e.g., GitLab, Jenkins)
  • Hands-on experience with Kubernetes, Docker, and cloud services
  • Understanding of ML training pipelines, data lifecycle, and model serving concepts
  • Familiarity with workflow orchestration tools (e.g., Airflow, Kubeflow, Ray, Vertex AI, Azure ML)
  • A demonstrated understanding of the ML lifecycle, model versioning, and monitoring
  • Experience with ML frameworks (e.g., TensorFlow, PyTorch)
Job Responsibility
Job Responsibility
  • Design, build, and maintain scalable infrastructure that empowers data scientists and machine learning engineers
  • Own the design and implementation of the internal ML platform, enabling end-to-end workflow orchestration, resource management, and automation using cloud-native technologies (GCP/Azure)
  • Design and manage Kubernetes-based infrastructure for multi-tenant GPU and CPU workloads with strong isolation, quota control, and monitoring
  • Integrate and extend orchestration tools (Airflow, Kubeflow, Ray, Vertex AI, Azure ML or custom schedulers) to automate data processing, training, and deployment pipelines
  • Develop shared services for model behavior/performance tracking, data/datasets versioning, and artifact management (MLflow, DVC, or custom registries)
  • Build out documentation in relation to architecture, policies and operations runbooks
  • Share skills, knowledge, and expertise with members of the data engineering team
  • Foster a culture of collaboration and continuous learning by organizing training sessions, workshops, and knowledge-sharing sessions
  • Collaborate and drive progress with cross-functional teams to design and develop new features and functionalities
  • Ensure that the developed solutions meet project objectives and enhance user experience
  • Fulltime
Read More
Arrow Right

Principal Software Engineer, CoreAI

The CoreAI GPU Infrastructure team builds the foundational accelerated compute p...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field and 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience
  • Proven ability to design and operate large-scale, production infrastructure with high reliability and performance requirements
  • Strong problem-solving skills and the ability to debug complex, cross-layer systems issues
  • Demonstrated technical leadership, including mentoring engineers and driving cross-team architectural alignment
  • Hands-on experience with virtualization and/or container platforms (e.g., VMs, Kubernetes, container runtimes)
  • Strong collaboration and communication skills, with the ability to work across organizational boundaries
Job Responsibility
Job Responsibility
  • Design and build GPU accelerated infrastructure for training and inference workloads, spanning bare metal, virtual machines, and containerized environments
  • Develop systems for GPU device management, scheduling, isolation, and sharing (e.g., partial GPU allocation, multi-tenant usage)
  • Build and operate advanced orchestration and resource governance scenarios using platforms such as AKS, Dynamic Resource Allocation (DRA), and related Kubernetes ecosystem capabilities to enable fair sharing, isolation, and efficient utilization of accelerated resources
  • Build and evolve virtualization and container stacks to support modern AI workloads, including secure and confidential compute scenarios
  • Optimize performance, reliability, and utilization across large GPU fleets, including scale-up and scale-out configurations
  • Partner with networking and storage teams to enable high-performance interconnects (e.g., RDMA/InfiniBand class networking) for distributed workloads
  • Drive end-to-end platform features from design through production, including observability, diagnostics, and operational excellence
  • Influence platform architecture and technical direction across teams through design reviews and technical leadership
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, GPU Infrastructure (HPC)

The internal infrastructure team is responsible for building world-class infrast...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment
Job Responsibility
Job Responsibility
  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

In Azure Specialized we are collaboratively working to bring the next generation...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Willing to dive deeply into any level or layer of a problem.
  • Willing to learn emerging technologies, from hardware to software. Evaluate and make recommendations that advance Azure infrastructure for AI and other GPU-based workloads.
  • Leads by example within the team by producing extensible and maintainable. Optimizes, debugs, refactors, and reuses code to improve performance and maintainability, effectiveness, and return on investment (ROI). Applies metrics to drive the quality and stability of code, as well as appropriate coding patterns and best practices.
  • Maintains communication with key partners across the Microsoft ecosystem of engineers. Acts as a key contact for leadership to ensure alignment with partners' expectations.
  • Considers partner teams across organizations and their end goals for products to drive and achieve desirable user experiences and fitting dynamic needs of partners/customers through product development.
  • Drives identification of dependencies and the development of design documents for a product, application, service, or platform.
  • Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
  • Help to establish innovative infrastructure to manage data at scale, work closely with lead customers and build reliable monitoring pipelines
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Engineer

Help build the infrastructure that powers training, evaluation, and data platfor...
Location
Location
Switzerland , Zürich
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering background building reliable, scalable production systems (Python preferred)
  • Hands‑on experience supporting large‑scale ML / LLM training, evaluation, or experimentation infrastructure
  • Operating GPU‑heavy workloads in cloud environments using Docker and Kubernetes (scheduling, utilization, isolation)
  • Designing and running data / compute pipelines and orchestration (e.g., Airflow, Argo) with object storage (Azure Blob / S3)
  • Platform reliability and operability: observability, metrics, logging, tracing, alerting (Prometheus, Grafana, OpenTelemetry)
Job Responsibility
Job Responsibility
  • Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management
  • Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations
  • advocate for best practices in security, reproducibility, and cost efficiency
  • Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry)
  • Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage
  • Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams
  • Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management
  • Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals
  • Fulltime
Read More
Arrow Right

Senior Manager, Performance AI/ML Network Deployment Engineering

The Senior Manager, DC GPU Advanced Forward Deployment and Systems Engineering i...
Location
Location
United States , Santa Clara
Salary
Salary:
210400.00 - 315600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in networking and performance optimization for large-scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements
  • Prefer candidates with solid, hands-on expertise in at least one or more of 3 domains, namely compute, network, storage
  • Experience in working with large customers such as Cloud Service Providers and global enterprise customers
  • Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc
  • Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
  • Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
  • Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends
  • Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista experience is preferred
  • Direct, co-development/deployment experience in working with strategic customers/partners in bringing solutions to market
  • Excellent communication level from engineer to mid-management to C-level of audience
Job Responsibility
Job Responsibility
  • Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models
  • Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability
  • Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads
  • Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations
  • Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins
  • Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement
  • Engage with AMD product groups to drive resolution of application and customer issues
  • Develop and present training materials to internal audiences, at customer venues, and at industry conferences
Read More
Arrow Right