ML Infrastructure Engineer Job at AppLovin (Palo Alto)

Senior ML Infrastructure / ML DevOps Engineer

We are looking for a Senior ML Infrastructure / DevOps Engineer who loves Linux,...

Location

Salary:

Not provided

Pathway

Expiration Date

Until further notice

Requirements

Former or current Linux / systems / network administrator comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing)
5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads
Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services
Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments
Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch
Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI)
Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations
Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents)
Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management
Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer

Job Responsibility

Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management)
Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management
Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback
Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services
Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch)
Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges
Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break

What we offer

Intellectually stimulating work environment
Be a pioneer: you get to work with realtime data processing & AI
Work in one of the hottest AI startups, with exciting career prospects
Team members are distributed across the world
Responsibilities and ability to make significant contribution to the company’s success
Inclusive workplace culture

Fulltime

Senior ML Infrastructure Engineer, Inference Platform

About the Team: The ML Inference Platform is part of the AV ML Infrastructure or...

Location

United States , Austin, Texas; Mountain View, California; Sunnyvale, California

Salary:

155420.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

5+ years of industry experience, with focus on machine learning systems or high performance backend services
Expertise in either Python, C++ or other relevant coding languages
Expertise in ML inference, model serving frameworks (triton, rayserve, vLLM etc)
Strong communication skills and a proven ability to drive cross-functional initiatives
Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities

Job Responsibility

Design and implement core platform backend software components
Collaborate with ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
Lead technical decision-making on model serving strategies, orchestration, caching, model versioning, and auto-scaling mechanisms for highly optimized use of accelerators
Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization of inference services
Proactively research and integrate state-of-the-art model serving frameworks, hardware accelerators, and distributed computing techniques
Lead technical initiatives across GM’s ML ecosystem
Raise the engineering bar through technical leadership, establishing best practices
Contribute to open source projects
represent GM in relevant communities

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Senior ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...

Location

United States , Sunnyvale

Salary:

153200.00 - 234100.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

3+ years of experience working on large-scale distributed systems, applications, or ML infrastructure
Experience designing robust services or frameworks with durable, well-designed APIs
Solid understanding of machine learning workflows and hands-on experience applying ML systems in production environments
Experience building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
Practical experience across the ML development lifecycle, including model training, deployment, and MLOps practices
Strong cross-functional collaboration skills across teams and organizations
Strong coding skills in Python or C++
Interest in autonomous driving and large-scale ML systems
BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience

Job Responsibility

Design, implement, and deploy scalable platforms and tools supporting machine learning training and evaluation workflows across GM
Drive complex technical projects with strong ownership of implementation, code quality, and system reliability
Contribute to technical design discussions and architectural decisions while collaborating with senior engineers and technical leads
Work closely with partner teams to ensure platforms meet real-world ML development needs and maximize adoption
Identify technical improvements and help prioritize platform investments to improve performance, reliability, and developer productivity
Contribute to a strong engineering culture through high-quality code reviews, documentation, and operational excellence
Support onboarding and mentoring of junior engineers and interns

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Staff ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...

Location

United States , Sunnyvale

Salary:

189300.00 - 290700.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

5+ years of experience building large-scale distributed systems, applications, or advanced ML systems
Proven track record of designing robust frameworks with high-quality, durable APIs
Deep understanding of machine learning algorithms with hands‑on application
Expertise in building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
End-to-end experience across the ML development lifecycle, including MLOps practices
Strong cross functional collaboration skills across teams and organizations
Exceptional coding skills in Python or C++
Strong interest in autonomous driving and its transformative potential
BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience

Job Responsibility

Lead the design, implementation, and deployment of scalable platforms and tools that drive machine learning model training and evaluation workflows across GM
Own complex technical projects end-to-end, making key architectural decisions and technical trade-offs
Take a holistic view of projects, considering their impact across multiple teams, and across a longer timeline
Proactively drive technical prioritization
Collaborate closely with partner teams to ensure maximum benefit from the systems we build
Help shape our team through technical interviewing with high, well-calibrated standards, and play an essential role in recruiting
Mentor and onboard junior engineers and interns, helping them grow their careers

What we offer

Medical
Dental
Vision
Health Savings Account
Flexible Spending Accounts
Retirement savings plan
Sickness and accident benefits
Life insurance
Paid vacation & holidays
Tuition assistance programs

Fulltime

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...

Location

United States , New York

Salary:

190800.00 - 286800.00 USD / Year

Plaid

Expiration Date

Until further notice

Requirements

5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
Proven experience delivering reliable and scalable infrastructure in production
Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
Strong communication skills and ability to collaborate across teams

Job Responsibility

Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
Contribute to technical strategy and architecture discussions within the team
Mentor and support other engineers through code reviews, design discussions, and technical guidance

Fulltime

Staff ML Infrastructure Engineer

The AI Validation Platform team owns the cloud-agnostic, reliable, and cost-effi...

Location

United States , Austin, Texas; Sunnyvale, California

Salary:

197000.00 - 326000.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

8+ years of industry experience, with a focus on high performance backend services
Strong expertise in container technologies like Docker and Kubernetes
Strong expertise in Go, or other similar coding languages
Experience working with cloud platforms such as GCP, Azure, or AWS
Experience in delivering cross-functional initiatives
Strong communication skills and a proven ability to drive cross-functional initiatives
Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities

Job Responsibility

Collaborate with Simulation engineers, ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
Own the technical roadmap, lead technical decisions on Compute architecture, caching, capacity provisioning, and auto-scaling mechanisms
Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization
Proactively research and integrate frameworks, hardware accelerators, and distributed computing techniques
Lead large-scale technical initiatives across GM’s ML infrastructure
Raise the engineering bar through technical leadership and by establishing best practices

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Staff ML Infrastructure Engineer

Playlab seeks a Staff Machine Learning Engineer to join our growing Engineering ...

Location

Salary:

180000.00 - 240000.00 USD / Year

Playlab

Expiration Date

Until further notice

Requirements

7+ years building production ML/data systems, with experience in ML operations and infrastructure
Strong experience with model serving, orchestration, and optimization in production environments
Proficient in Python and data pipeline technologies (Airflow, ETL tools, etc.)
Experience with cloud infrastructure (AWS preferred) and containerization (Kubernetes, Docker)
Experience with cost optimization strategies for LLM-based systems
Thrive in high-agency, high collaboration cultures
Great communication that makes working remote-first work

Job Responsibility

Design, build, and maintain production ML infrastructure that balances performance, cost, and reliability
Own data quality and research dataset creation - ensure data is properly scrubbed, documented, and useful for research partners
Stay on top of ML infrastructure technologies and techniques - from model serving to cost optimization to observability tools
Work cross-functionally with ML engineers, backend engineers, and product to ensure infrastructure supports real needs
Balance innovation with operational excellence - experiment with new approaches while maintaining system reliability and data quality
Mentor engineers on ML operations, cost optimization, and production ML best practices

Fulltime

Senior Software Engineer, ML Infrastructure

LMArena is seeking a Senior Software Engineer (Infrastructure) to lead the desig...

Location

United States , Bay Area

Salary:

Not provided

Arena Intelligence, Inc.

Expiration Date

Until further notice

Requirements

5+ years of experience in software engineering, with a focus on infrastructure or large-scale data and ML systems
Deep expertise in distributed systems, stream processing, and scalable backend architecture
Proven ability to design and operate low-latency, high-throughput, and fault-tolerant systems
Strong foundation in systems design, performance tuning, and building reliable, fault-tolerant services
Comfortable in a dynamic, high-ownership, fast-growth environment

Job Responsibility

Architect and scale high-performance, real-time API and data systems
Design and implement low-latency pipelines to process and analyze large-scale event streams
Ensure reliability through robust data integrity, availability, and consistency mechanisms
Mentor and guide engineers on infrastructure best practices, architecture, and performance tuning
Collaborate cross-functionally with AI researchers, product leaders, and engineers to anticipate evolving infrastructure needs and deliver resilient, extensible systems

What we offer

Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs.
The opportunity to work on cutting-edge AI with a small, mission-driven team
A culture that values transparency, trust, and community impact

Fulltime

Select Country

ML Infrastructure Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?