CrawlJobs Logo

Staff ML Infrastructure Engineer

United States, Austin, Texas 197000.00 - 326000.00 USD / Year · Job Posted March 03, 2026
Apply Position
Job Link Share

Job Description

The AI Validation Platform team owns the cloud-agnostic, reliable, and cost-efficient platform that powers GM’s AV efforts. We’re proud to serve as the infrastructure platform for teams developing autonomous vehicles (L3/L4/L5). Our platform supports the simulated validation of state-of-the-art (SOTA) machine learning models, with a focus on performance, availability, concurrency, and scalability. We enable rapid innovation and development by prioritizing high-impact, ML-centric use cases.

Job Responsibility

  • Collaborate with Simulation engineers, ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
  • Own the technical roadmap, lead technical decisions on Compute architecture, caching, capacity provisioning, and auto-scaling mechanisms
  • Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization
  • Proactively research and integrate frameworks, hardware accelerators, and distributed computing techniques
  • Lead large-scale technical initiatives across GM’s ML infrastructure
  • Raise the engineering bar through technical leadership and by establishing best practices

Requirements

  • 8+ years of industry experience, with a focus on high performance backend services
  • Strong expertise in container technologies like Docker and Kubernetes
  • Strong expertise in Go, or other similar coding languages
  • Experience working with cloud platforms such as GCP, Azure, or AWS
  • Experience in delivering cross-functional initiatives
  • Strong communication skills and a proven ability to drive cross-functional initiatives
  • Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities

Nice to have

  • Hands-on experience with Cloud VM services Google Compute Engine
  • Experience with hardware-in-the-loop validation systems
  • Experience with high performance computing (HPC)
  • Familiarity with telemetry, and other feedback loops to inform product improvements
  • Familiarity with hardware acceleration (GPUs) and optimizations

What we offer

  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • employee assistance program
  • GM vehicle discounts

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff ML Infrastructure Engineer

8 matching positions

Staff ML Infrastructure Engineer

Playlab seeks a Staff Machine Learning Engineer to join our growing Engineering ...
Location
Location
Salary
Salary:
180000.00 - 240000.00 USD / Year
playlab.ai Logo
Playlab
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years building production ML/data systems, with experience in ML operations and infrastructure
  • Strong experience with model serving, orchestration, and optimization in production environments
  • Proficient in Python and data pipeline technologies (Airflow, ETL tools, etc.)
  • Experience with cloud infrastructure (AWS preferred) and containerization (Kubernetes, Docker)
  • Experience with cost optimization strategies for LLM-based systems
  • Thrive in high-agency, high collaboration cultures
  • Great communication that makes working remote-first work
Job Responsibility
Job Responsibility
  • Design, build, and maintain production ML infrastructure that balances performance, cost, and reliability
  • Own data quality and research dataset creation - ensure data is properly scrubbed, documented, and useful for research partners
  • Stay on top of ML infrastructure technologies and techniques - from model serving to cost optimization to observability tools
  • Work cross-functionally with ML engineers, backend engineers, and product to ensure infrastructure supports real needs
  • Balance innovation with operational excellence - experiment with new approaches while maintaining system reliability and data quality
  • Mentor engineers on ML operations, cost optimization, and production ML best practices
  • Fulltime
Read More
Arrow Right

Staff ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...
Location
Location
United States , Sunnyvale
Salary
Salary:
189300.00 - 290700.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience building large-scale distributed systems, applications, or advanced ML systems
  • Proven track record of designing robust frameworks with high-quality, durable APIs
  • Deep understanding of machine learning algorithms with hands‑on application
  • Expertise in building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
  • End-to-end experience across the ML development lifecycle, including MLOps practices
  • Strong cross functional collaboration skills across teams and organizations
  • Exceptional coding skills in Python or C++
  • Strong interest in autonomous driving and its transformative potential
  • BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and deployment of scalable platforms and tools that drive machine learning model training and evaluation workflows across GM
  • Own complex technical projects end-to-end, making key architectural decisions and technical trade-offs
  • Take a holistic view of projects, considering their impact across multiple teams, and across a longer timeline
  • Proactively drive technical prioritization
  • Collaborate closely with partner teams to ensure maximum benefit from the systems we build
  • Help shape our team through technical interviewing with high, well-calibrated standards, and play an essential role in recruiting
  • Mentor and onboard junior engineers and interns, helping them grow their careers
What we offer
What we offer
  • Medical
  • Dental
  • Vision
  • Health Savings Account
  • Flexible Spending Accounts
  • Retirement savings plan
  • Sickness and accident benefits
  • Life insurance
  • Paid vacation & holidays
  • Tuition assistance programs
  • Fulltime
Read More
Arrow Right

Staff Machine Learning Engineer - ML Training Infrastructure

The Role:   We are seeking an experienced, technically strong, impact-driven ex...
Location
Location
United States , Austin; Mountain View
Salary
Salary:
185000.00 - 335300.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience
  • 8+ years of professional software engineering experience
  • 5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models
  • Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems
  • Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
  • Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact
  • Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost
  • Willingness to travel to Sunnyvale, CA as needed
  • Comfortable operating in highly ambiguous and dynamic environments
Job Responsibility
Job Responsibility
  • Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale
  • Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments
  • Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack
  • Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery
  • Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams
  • Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem
  • Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Staff Software Engineer (Distributed Systems & ML Infrastructure)

An Elite FinTech firm is expanding its world-class engineering team and looking ...
Location
Location
France , Paris
Salary
Salary:
160000.00 EUR / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Open to all experience levels
  • Proven experience coding in Python
  • Strong understanding or interest in distributed systems and ML infrastructure
  • Enthusiasm to learn Rust (supported by internal mentorship and training)
  • Excellent academic background
  • Experience in high-stakes, low-latency, mission-critical environments where reliability and performance are non-negotiable
Job Responsibility
Job Responsibility
  • Design and build high-performance, distributed systems for large-scale ML infrastructure
  • Drive best practices in software architecture, testing, and scalability
  • Lead and collaborate on multiple greenfield initiatives focused on performance, reliability, and scale
What we offer
What we offer
  • Up to €160,000 + Industry Leading Bonus
  • Work on next-gen distributed systems and ML infrastructure
  • Take ownership of multiple greenfield builds
  • Zero bureaucracy and a genuinely collaborative culture
  • Stunning offices
  • Dedicated time for personal projects every Friday
  • Fulltime
Read More
Arrow Right

Staff ML Engineer, Inference Platform

The ML Inference Platform is part of the AI Compute Platforms organization withi...
Location
Location
United States , Sunnyvale
Salary
Salary:
185500.00 - 270000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of industry experience, with focus on machine learning systems or high performance backend services
  • Expertise in either Go, Python, C++ or other relevant coding languages
  • Expertise in ML inference, model serving frameworks (triton, rayserve, vLLM etc)
  • Strong communication skills and a proven ability to drive cross-functional initiatives
  • Experience working with cloud platforms such as GCP, Azure, or AWS
  • Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities
Job Responsibility
Job Responsibility
  • Design and implement core platform backend software components
  • Collaborate with ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
  • Lead technical decision-making on model serving strategies, orchestration, caching, model versioning, and auto-scaling mechanisms
  • Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization of inference services
  • Proactively research and integrate state-of-the-art model serving frameworks, hardware accelerators, and distributed computing techniques
  • Lead large-scale technical initiatives across GM’s ML ecosystem
  • Raise the engineering bar through technical leadership, establishing best practices
  • Contribute to open source projects
  • represent GM in relevant communities
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Staff ML Engineer - Embodied AI Scaling Foundations

At General Motors, our product teams are redefining mobility. Through a human-ce...
Location
Location
United States , Sunnyvale, California
Salary
Salary:
189000.00 - 300000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s, Master’s, or PhD in Computer Science, Robotics, Machine Learning, or related field
  • Experience working with large-scale foundation models and alignment methods applied to real-world systems
  • Demonstrated ability to deliver applied ML solutions under real-world constraints and timelines
  • Proficiency in PyTorch and Python
  • Experience building and scaling model training pipelines enabling efficient iteration across teams
  • Strong data processing skills using tools such as NumPy, Pandas, and Apache Spark
  • Strong communication skills enabling effective collaboration across engineering teams
  • Experience deploying ML models into production environments and understanding end-to-end deployment workflows
Job Responsibility
Job Responsibility
  • Design and implement ML solutions aligned with GM’s autonomous driving objectives
  • Apply techniques such as unsupervised pre-training, imitation learning, reinforcement learning, model scaling/selection, foundation modeling, to solve problems in object detection/tracking/classification, trajectory generation, and safe AI
  • Collaborate with cross-functional teams to deploy models and algorithms into onboard driving systems
  • Contribute to applied research efforts and remain current with advancements in ML frameworks and methods
  • Design and build efficient infrastructure, pipelines, and tooling to facilitate fast-pace model iterations
  • Drive technical execution from prototyping through production deployment, documenting learnings and best practices
  • Support and mentor engineers through technical collaboration and code reviews, fostering knowledge sharing and engineering excellence
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Staff ML Engineer - Applied AI

Applied AI is a horizontal AI team at Uber partnering with product and platform ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of industry experience in machine learning or software engineering, with a proven record of delivering ML solutions to production
  • Strong knowledge of machine learning, deep learning, and exposure to generative AI techniques (e.g., transformers, LLMs, diffusion)
  • Experience designing and scaling ML systems or platforms, including training pipelines, serving infrastructure, and model lifecycle tooling
  • Fluency in ML frameworks (e.g., PyTorch, TensorFlow, JAX) and development in Python and/or scalable backend languages (e.g., Java, Go)
  • Excellent collaboration and communication skills with the ability to work across teams and functions
Job Responsibility
Job Responsibility
  • Design and implement ML-driven systems that power core Uber experiences, with a focus on scalability, reliability, and performance
  • Lead the technical execution of key projects involving classical ML, deep learning, and generative AI technologies (e.g., LLMs, multimodal models)
  • Collaborate closely with product, data science, and infrastructure teams to develop AI solutions from ideation through production deployment
  • Contribute to and influence the technical direction for Applied AI, particularly around system design, model architecture, and infrastructure decisions
  • Champion engineering best practices in ML development — including experimentation workflows, model versioning, evaluation, monitoring, and responsible AI
  • Provide mentorship to engineers on the team and across partner orgs to help raise the technical bar
  • Fulltime
Read More
Arrow Right

Sr Staff ML Engineer - Production & MLOps Focus - GenAI Security Platform

Join our team building a cutting-edge multi-tenanted GenAI Security Platform tha...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of ML engineering experience with hands-on LLM/NLP work
  • Practical experience building LLM-based applications (agents, multi-turn systems, evaluators)
  • Understanding of model fine-tuning, embedding optimization, and prompt engineering
  • Experience with LLM APIs (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI)
  • Knowledge of LLM orchestration frameworks ( LangChain, LlamaIndex, Pydantic AI, custom solutions)
  • Familiarity with model architectures and when to fine-tune vs prompt engineer
  • Strong experience deploying ML models to production at scale
  • Experience with Model serving frameworks (vLLM preferred
  • TensorRT-LLM, Ray Serve, or similar a plus)
  • Kubernetes and Docker proficiency for ML workload orchestration
Job Responsibility
Job Responsibility
  • Build and deploy LLM-based agents and multi-step evaluation workflows
  • Fine-tune models, optimize embeddings, and manage model weights and artifacts
  • Deploy and scale ML services on Kubernetes with proper monitoring and resource management
  • Implement experiment tracking, model versioning, and deployment automation
  • Develop observability dashboards for ML metrics, costs, latency, and quality
  • Optimize LLM API usage through caching, batching, and intelligent routing strategies
  • Manage vector database infrastructure and semantic search systems
  • Create CI/CD pipelines for ML artifacts and automated testing frameworks
  • Collaborate with ML researchers to productionize prototypes and scale experiments
  • Fulltime
Read More
Arrow Right