CrawlJobs Logo

ML Infrastructure Engineer

United States, Palo Alto 124000.00 - 186000.00 USD / Year · Job Posted March 08, 2026
Apply Position
Job Link Share

Job Description

As a member of our software engineering infra team, you'll solve technical challenges, including upgrading and implementing state-of-the-art software infrastructure. The team builds a high-performance, high availability, globally distributed ecosystem platform of services that in turn provide the foundation for rapid development of novel new systems that integrate into that ecosystem and improve it. Our infra team is responsible for providing and maintaining scalable infrastructure with high throughput and low latency to our bidding ecosystem. You will be exposed to the whole pipeline of model delivery, including training, serving, and optimizations, etc.

Job Responsibility

  • Design, develop, and maintain large-scale distributed systems
  • Collaborate with various engineering teams to meet a wide range of technological challenges
  • Work closely with our research science team and backend team to contribute and influence the roadmap of our products and technologies
  • Influence and inspire team members
  • Speed up the performance of our online models
  • Optimize the model delivery pipeline

Requirements

  • 0-2 years of experience
  • Minimum of a BS and/or MS in Computer Science
  • Excellent knowledge of computer science fundamentals including data structures, algorithms, and coding
  • Good experience with C++, Python and/or Golang is a plus
  • Experience independently creating and maintaining projects

Nice to have

Good experience with C++, Python and/or Golang

What we offer

  • Equity eligible
  • Medical, Dental, Vision, Life, Disability insurance
  • 401(k) Retirement Plan
  • Unlimited Discretionary Time Off
  • 10 paid holidays per year
  • 80 hours of paid sick leave per year

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

ML Infrastructure Engineer

8 matching positions

Senior ML Infrastructure / ML DevOps Engineer

We are looking for a Senior ML Infrastructure / DevOps Engineer who loves Linux,...
Location
Location
Salary
Salary:
Not provided
Pathway
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Former or current Linux / systems / network administrator comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing)
  • 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads
  • Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services
  • Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments
  • Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch
  • Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI)
  • Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations
  • Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents)
  • Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management
  • Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer
Job Responsibility
Job Responsibility
  • Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management)
  • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management
  • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback
  • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services
  • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch)
  • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges
  • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
  • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break
What we offer
What we offer
  • Intellectually stimulating work environment
  • Be a pioneer: you get to work with realtime data processing & AI
  • Work in one of the hottest AI startups, with exciting career prospects
  • Team members are distributed across the world
  • Responsibilities and ability to make significant contribution to the company’s success
  • Inclusive workplace culture
  • Fulltime
Read More
Arrow Right

Senior ML Infrastructure Engineer, Inference Platform

About the Team: The ML Inference Platform is part of the AV ML Infrastructure or...
Location
Location
United States , Austin, Texas; Mountain View, California; Sunnyvale, California
Salary
Salary:
155420.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience, with focus on machine learning systems or high performance backend services
  • Expertise in either Python, C++ or other relevant coding languages
  • Expertise in ML inference, model serving frameworks (triton, rayserve, vLLM etc)
  • Strong communication skills and a proven ability to drive cross-functional initiatives
  • Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities
Job Responsibility
Job Responsibility
  • Design and implement core platform backend software components
  • Collaborate with ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
  • Lead technical decision-making on model serving strategies, orchestration, caching, model versioning, and auto-scaling mechanisms for highly optimized use of accelerators
  • Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization of inference services
  • Proactively research and integrate state-of-the-art model serving frameworks, hardware accelerators, and distributed computing techniques
  • Lead technical initiatives across GM’s ML ecosystem
  • Raise the engineering bar through technical leadership, establishing best practices
  • Contribute to open source projects
  • represent GM in relevant communities
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Senior ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...
Location
Location
United States , Sunnyvale
Salary
Salary:
153200.00 - 234100.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience working on large-scale distributed systems, applications, or ML infrastructure
  • Experience designing robust services or frameworks with durable, well-designed APIs
  • Solid understanding of machine learning workflows and hands-on experience applying ML systems in production environments
  • Experience building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
  • Practical experience across the ML development lifecycle, including model training, deployment, and MLOps practices
  • Strong cross-functional collaboration skills across teams and organizations
  • Strong coding skills in Python or C++
  • Interest in autonomous driving and large-scale ML systems
  • BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Design, implement, and deploy scalable platforms and tools supporting machine learning training and evaluation workflows across GM
  • Drive complex technical projects with strong ownership of implementation, code quality, and system reliability
  • Contribute to technical design discussions and architectural decisions while collaborating with senior engineers and technical leads
  • Work closely with partner teams to ensure platforms meet real-world ML development needs and maximize adoption
  • Identify technical improvements and help prioritize platform investments to improve performance, reliability, and developer productivity
  • Contribute to a strong engineering culture through high-quality code reviews, documentation, and operational excellence
  • Support onboarding and mentoring of junior engineers and interns
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Staff ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...
Location
Location
United States , Sunnyvale
Salary
Salary:
189300.00 - 290700.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience building large-scale distributed systems, applications, or advanced ML systems
  • Proven track record of designing robust frameworks with high-quality, durable APIs
  • Deep understanding of machine learning algorithms with hands‑on application
  • Expertise in building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
  • End-to-end experience across the ML development lifecycle, including MLOps practices
  • Strong cross functional collaboration skills across teams and organizations
  • Exceptional coding skills in Python or C++
  • Strong interest in autonomous driving and its transformative potential
  • BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and deployment of scalable platforms and tools that drive machine learning model training and evaluation workflows across GM
  • Own complex technical projects end-to-end, making key architectural decisions and technical trade-offs
  • Take a holistic view of projects, considering their impact across multiple teams, and across a longer timeline
  • Proactively drive technical prioritization
  • Collaborate closely with partner teams to ensure maximum benefit from the systems we build
  • Help shape our team through technical interviewing with high, well-calibrated standards, and play an essential role in recruiting
  • Mentor and onboard junior engineers and interns, helping them grow their careers
What we offer
What we offer
  • Medical
  • Dental
  • Vision
  • Health Savings Account
  • Flexible Spending Accounts
  • Retirement savings plan
  • Sickness and accident benefits
  • Life insurance
  • Paid vacation & holidays
  • Tuition assistance programs
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , New York
Salary
Salary:
190800.00 - 286800.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
  • Fulltime
Read More
Arrow Right

Staff ML Infrastructure Engineer

The AI Validation Platform team owns the cloud-agnostic, reliable, and cost-effi...
Location
Location
United States , Austin, Texas; Sunnyvale, California
Salary
Salary:
197000.00 - 326000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of industry experience, with a focus on high performance backend services
  • Strong expertise in container technologies like Docker and Kubernetes
  • Strong expertise in Go, or other similar coding languages
  • Experience working with cloud platforms such as GCP, Azure, or AWS
  • Experience in delivering cross-functional initiatives
  • Strong communication skills and a proven ability to drive cross-functional initiatives
  • Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities
Job Responsibility
Job Responsibility
  • Collaborate with Simulation engineers, ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
  • Own the technical roadmap, lead technical decisions on Compute architecture, caching, capacity provisioning, and auto-scaling mechanisms
  • Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization
  • Proactively research and integrate frameworks, hardware accelerators, and distributed computing techniques
  • Lead large-scale technical initiatives across GM’s ML infrastructure
  • Raise the engineering bar through technical leadership and by establishing best practices
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Staff ML Infrastructure Engineer

Playlab seeks a Staff Machine Learning Engineer to join our growing Engineering ...
Location
Location
Salary
Salary:
180000.00 - 240000.00 USD / Year
playlab.ai Logo
Playlab
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years building production ML/data systems, with experience in ML operations and infrastructure
  • Strong experience with model serving, orchestration, and optimization in production environments
  • Proficient in Python and data pipeline technologies (Airflow, ETL tools, etc.)
  • Experience with cloud infrastructure (AWS preferred) and containerization (Kubernetes, Docker)
  • Experience with cost optimization strategies for LLM-based systems
  • Thrive in high-agency, high collaboration cultures
  • Great communication that makes working remote-first work
Job Responsibility
Job Responsibility
  • Design, build, and maintain production ML infrastructure that balances performance, cost, and reliability
  • Own data quality and research dataset creation - ensure data is properly scrubbed, documented, and useful for research partners
  • Stay on top of ML infrastructure technologies and techniques - from model serving to cost optimization to observability tools
  • Work cross-functionally with ML engineers, backend engineers, and product to ensure infrastructure supports real needs
  • Balance innovation with operational excellence - experiment with new approaches while maintaining system reliability and data quality
  • Mentor engineers on ML operations, cost optimization, and production ML best practices
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, ML Infrastructure

LMArena is seeking a Senior Software Engineer (Infrastructure) to lead the desig...
Location
Location
United States , Bay Area
Salary
Salary:
Not provided
arena.ai Logo
Arena Intelligence, Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering, with a focus on infrastructure or large-scale data and ML systems
  • Deep expertise in distributed systems, stream processing, and scalable backend architecture
  • Proven ability to design and operate low-latency, high-throughput, and fault-tolerant systems
  • Strong foundation in systems design, performance tuning, and building reliable, fault-tolerant services
  • Comfortable in a dynamic, high-ownership, fast-growth environment
Job Responsibility
Job Responsibility
  • Architect and scale high-performance, real-time API and data systems
  • Design and implement low-latency pipelines to process and analyze large-scale event streams
  • Ensure reliability through robust data integrity, availability, and consistency mechanisms
  • Mentor and guide engineers on infrastructure best practices, architecture, and performance tuning
  • Collaborate cross-functionally with AI researchers, product leaders, and engineers to anticipate evolving infrastructure needs and deliver resilient, extensible systems
What we offer
What we offer
  • Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs.
  • The opportunity to work on cutting-edge AI with a small, mission-driven team
  • A culture that values transparency, trust, and community impact
  • Fulltime
Read More
Arrow Right