CrawlJobs Logo

Software Engineer - Reliability GPU Infrastructure

lumalabs.ai Logo

Luma AI

Location Icon

Location:
United States; United Kingdom , Palo Alto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

170000.00 - 360000.00 USD / Year

Job Description:

Luma AI is a capital-intensive lab building the future of creative intelligence. This unique position offers the leverage to build systems of immense scale while retaining individual ownership over the architecture and strategy of our infrastructure.

Job Responsibility:

  • Define the technical strategy for compute substrate
  • Determine how to provision, manage, and scale multi-cloud and on-premise GPU footprint
  • Bridge the gap between hardware vendors and software stack
  • Architect a seamless infrastructure mesh that spans multiple cloud providers and bare-metal environments
  • Design the logic that allocates massive compute resources across competing priorities
  • Lead the effort to define entire stack as code
  • Build rigorous CI/CD and GitOps workflows

Requirements:

  • History of designing complex distributed systems
  • Deep expertise across various infrastructure providers
  • Ability to mentor the team and drive consensus on technical decisions

Additional Information:

Job Posted:
January 13, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer - Reliability GPU Infrastructure

New

Software Engineer - Cloud FinOps & Reliability

This is a foundational engineering position for a technical, data-driven expert ...
Location
Location
United States , Palo Alto
Salary
Salary:
120000.00 - 255000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer
  • Deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others
  • Proficient in Python for the purpose of scripting, data analysis, and building automation tooling
  • Strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage
  • Not an accountant
  • you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale
  • A tenacious troubleshooter and a data-driven decision-maker who thrives on finding the 'why' behind the numbers
Job Responsibility
Job Responsibility
  • Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning
  • Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets
  • Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately
  • Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs
  • Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders
  • Fulltime
Read More
Arrow Right

Machine Learning Platform / Backend Engineer

We are seeking a Machine Learning Platform/Backend Engineer to design, build, an...
Location
Location
Serbia; Romania , Belgrade; Timișoara
Salary
Salary:
Not provided
everseen.ai Logo
Everseen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4-5+ years of work experience in either ML infrastructure, MLOps, or Platform Engineering
  • Bachelors degree or equivalent focusing on the computer science field is preferred
  • Excellent communication and collaboration skills
  • Expert knowledge of Python
  • Experience with CI/CD tools (e.g., GitLab, Jenkins)
  • Hands-on experience with Kubernetes, Docker, and cloud services
  • Understanding of ML training pipelines, data lifecycle, and model serving concepts
  • Familiarity with workflow orchestration tools (e.g., Airflow, Kubeflow, Ray, Vertex AI, Azure ML)
  • A demonstrated understanding of the ML lifecycle, model versioning, and monitoring
  • Experience with ML frameworks (e.g., TensorFlow, PyTorch)
Job Responsibility
Job Responsibility
  • Design, build, and maintain scalable infrastructure that empowers data scientists and machine learning engineers
  • Own the design and implementation of the internal ML platform, enabling end-to-end workflow orchestration, resource management, and automation using cloud-native technologies (GCP/Azure)
  • Design and manage Kubernetes-based infrastructure for multi-tenant GPU and CPU workloads with strong isolation, quota control, and monitoring
  • Integrate and extend orchestration tools (Airflow, Kubeflow, Ray, Vertex AI, Azure ML or custom schedulers) to automate data processing, training, and deployment pipelines
  • Develop shared services for model behavior/performance tracking, data/datasets versioning, and artifact management (MLflow, DVC, or custom registries)
  • Build out documentation in relation to architecture, policies and operations runbooks
  • Share skills, knowledge, and expertise with members of the data engineering team
  • Foster a culture of collaboration and continuous learning by organizing training sessions, workshops, and knowledge-sharing sessions
  • Collaborate and drive progress with cross-functional teams to design and develop new features and functionalities
  • Ensure that the developed solutions meet project objectives and enhance user experience
  • Fulltime
Read More
Arrow Right

Senior Manager, Performance AI/ML Network Deployment Engineering

The Senior Manager, DC GPU Advanced Forward Deployment and Systems Engineering i...
Location
Location
United States , Santa Clara
Salary
Salary:
210400.00 - 315600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in networking and performance optimization for large-scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements
  • Prefer candidates with solid, hands-on expertise in at least one or more of 3 domains, namely compute, network, storage
  • Experience in working with large customers such as Cloud Service Providers and global enterprise customers
  • Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc
  • Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
  • Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
  • Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends
  • Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista experience is preferred
  • Direct, co-development/deployment experience in working with strategic customers/partners in bringing solutions to market
  • Excellent communication level from engineer to mid-management to C-level of audience
Job Responsibility
Job Responsibility
  • Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models
  • Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability
  • Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads
  • Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations
  • Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins
  • Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement
  • Engage with AMD product groups to drive resolution of application and customer issues
  • Develop and present training materials to internal audiences, at customer venues, and at industry conferences
Read More
Arrow Right
New

Software Engineer - Reliability

We are looking for a hands-on, first-principles engineer who is fluent in Linux,...
Location
Location
United States , Palo Alto
Salary
Salary:
170000.00 - 360000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment
  • Deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance
  • Strong experience with providers like AWS or OCI
  • Thrive on solving complex, low-level problems where hardware and software intersect
  • Energetic and thrive in a less structured, fast-paced environment
  • Working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO
  • Practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs
Job Responsibility
Job Responsibility
  • Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale
  • Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance
  • Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices
  • Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level
  • Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure
  • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues
  • Fulltime
Read More
Arrow Right

AI Infrastructure Engineer IV

At ASI, we are revolutionizing industries with state-of-the-art autonomous robot...
Location
Location
United States , Mendon
Salary
Salary:
Not provided
asirobots.com Logo
Autonomous Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, or a related technical field
  • 10+ years of experience in cloud infrastructure, DevOps, or platform engineering with an emphasis on AI/ML systems
  • Strong understanding of modern AI infrastructure components, including distributed computing, GPU-accelerated systems, and large-scale storage
  • Hands-on experience with cloud platforms such as AWS, Azure, or Google Cloud
  • Proficiency with Kubernetes, Docker, Terraform, or similar containerization and orchestration tools
  • Strong programming skills in Python and/or C++, with experience supporting machine learning frameworks (TensorFlow, PyTorch, etc.)
  • Experience implementing CI/CD pipelines, MLOps practices, and automation tooling
Job Responsibility
Job Responsibility
  • Design, build, and maintain high-performance computing infrastructure including CPUs, GPUs, storage, and networking to support AI and ML workloads
  • Deploy and manage AI systems within cloud environments (AWS, Azure, GCP), ensuring scalability, cost-efficiency, and high availability
  • Collaborate with data scientists, ML engineers, and software teams to support AI model development, training, and deployment workflows
  • Implement automation, CI/CD, DevOps, and MLOps practices to create efficient, repeatable, and reliable AI infrastructure processes
  • Optimize compute and storage systems to achieve maximum performance and throughput for AI/ML pipelines
  • Monitor system health and troubleshoot performance bottlenecks, infrastructure issues, and deployment challenges
What we offer
What we offer
  • Full Benefits - 90% Medical, ESOP, 401K, Generous PTO
  • Fulltime
Read More
Arrow Right

AI Infrastructure Engineer IV

At ASI, we are revolutionizing industries with state-of-the-art autonomous robot...
Location
Location
United States , Lehi
Salary
Salary:
Not provided
asirobots.com Logo
Autonomous Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, or a related technical field
  • 10+ years of experience in cloud infrastructure, DevOps, or platform engineering with an emphasis on AI/ML systems
  • Strong understanding of modern AI infrastructure components, including distributed computing, GPU-accelerated systems, and large-scale storage
  • Hands-on experience with cloud platforms such as AWS, Azure, or Google Cloud
  • Proficiency with Kubernetes, Docker, Terraform, or similar containerization and orchestration tools
  • Strong programming skills in Python and/or C++, with experience supporting machine learning frameworks (TensorFlow, PyTorch, etc.)
  • Experience implementing CI/CD pipelines, MLOps practices, and automation tooling
Job Responsibility
Job Responsibility
  • Design, build, and maintain high-performance computing infrastructure including CPUs, GPUs, storage, and networking to support AI and ML workloads
  • Deploy and manage AI systems within cloud environments (AWS, Azure, GCP), ensuring scalability, cost-efficiency, and high availability
  • Collaborate with data scientists, ML engineers, and software teams to support AI model development, training, and deployment workflows
  • Implement automation, CI/CD, DevOps, and MLOps practices to create efficient, repeatable, and reliable AI infrastructure processes
  • Optimize compute and storage systems to achieve maximum performance and throughput for AI/ML pipelines
  • Monitor system health and troubleshoot performance bottlenecks, infrastructure issues, and deployment challenges
What we offer
What we offer
  • Full Benefits - 90% Medical, ESOP, 401K, Generous PTO
  • Fulltime
Read More
Arrow Right

Senior MLOps Engineer

If you’re passionate about scalability, automated deployment, and well-optimized...
Location
Location
Romania , Bucharest
Salary
Salary:
Not provided
it-genetics.com Logo
IT Genetics Romania
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • University degree, preferably in engineering (software, industrial, mechanical, process) or a related field
  • Over 5 years of experience in MLOps or machine learning engineering, with a focus on deploying and managing deep learning models at scale
  • Strong skills in Python, CI/CD pipelines, and ML frameworks (e.g., PyTorch, TensorFlow, OpenCV) for automating and scaling ML workflows
  • Expertise in monitoring and alert automation for ML workflows, including data pipelines, training processes, and model performance (e.g., Prometheus, Grafana)
  • Familiarity with distributed training techniques, multi-GPU strategies, and hardware optimization for deep learning
  • Strong communication and interpersonal skills
Job Responsibility
Job Responsibility
  • Design end-to-end architecture for the automated training of ML models
  • Create data pipelines to build relevant datasets and data annotation flows
  • Monitor ML model performance and data drift
  • Handle versioning, deployment, and integration with the software team
  • Develop and manage CI/CD pipelines for building, testing, and deploying models
  • Apply best practices for model versioning, rollback, and A/B testing to ensure reliable and accurate production releases
  • Set up a robust monitoring system and develop automated alerting solutions to proactively identify issues in data pipelines, model training, validation, and data variation
  • Promote MLOps best practices (Infrastructure as Code, reproducibility, security) and continuously improve internal processes to increase reliability and efficiency
  • Research and implement cutting-edge technologies to improve training efficiency (e.g., distributed training, HPC, multi-GPU strategies) for the research team
  • Explore future MLOps frameworks and GPU-based cloud solutions as part of the scalability roadmap
What we offer
What we offer
  • Meal tickets
  • A place where your voice truly matters
  • Performance bonuses
  • A day off on your birthday
  • Private medical subscription
  • Trainings and learning resources
  • Hybrid work model
  • Bookster subscription
  • A friendly, passionate, and solution-oriented team
  • Opportunities to grow or change your role within the company
Read More
Arrow Right

Product Architect - AI/ML

Product Architect - AI/ML. Define and drive the technical product vision for ent...
Location
Location
India , Pune
Salary
Salary:
Not provided
genzeon.com Logo
Genzeon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert Python development with PyTorch, Transformers, production ML frameworks
  • Comfortable handling NVIDIA/CUDA variants and their inner workings
  • Proven experience fine-tuning LLMs and deploying custom models in regulated environments
  • Track record of shipping production AI products with measurable business impact
  • Core Software Engineering skills. AI/ML expertise, DevOps
  • Flexibility to work with US hours overlap
  • 12-15 years experience (Minimum 6-7 years in AI/ML Product Architecture)
  • Hands-on technical leadership and a user-centric approach to technical architecture
  • Ownership mindset with accountability for outcomes and the ability to balance innovation with pragmatic engineering
  • Effective communicator capable of engaging both technical and non-technical audiences
Job Responsibility
Job Responsibility
  • Define and drive the technical product vision for enterprise AI/ML platforms in healthcare, translating business requirements into scalable architectures while ensuring delivery excellence and long-term product sustainability
  • Architect end-to-end product solutions for processing clinical records, claims, and healthcare documentation
  • Design hybrid Azure/on-premises architectures supporting multi-tenant SaaS and enterprise deployment models
  • Establish product architecture principles, design patterns, and technology stack decisions
  • Evaluate build-vs-buy decisions for AI capabilities, Drive technical feasibility assessments and rapid prototyping for new product features
  • Own non-functional requirements: performance, security, compliance (HIPAA), scalability, reliability
  • Present architectural proposals and technical roadmaps to leadership through clear presentations and documentation
  • Design and implement fine-tuning pipelines for domain-specific LLMs (e.g. Llama, Mistral, Phi-3) on medical datasets
  • Architect MLOps frameworks enabling continuous model improvement: training, evaluation, deployment, monitoring
  • Build “built-in-intelligence-products” using advanced AI models
  • Fulltime
Read More
Arrow Right