Staff ML Infrastructure Engineer Job at Playlab

Staff ML Infrastructure Engineer

The AI Validation Platform team owns the cloud-agnostic, reliable, and cost-effi...

Location

United States , Austin, Texas; Sunnyvale, California

Salary:

197000.00 - 326000.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

8+ years of industry experience, with a focus on high performance backend services
Strong expertise in container technologies like Docker and Kubernetes
Strong expertise in Go, or other similar coding languages
Experience working with cloud platforms such as GCP, Azure, or AWS
Experience in delivering cross-functional initiatives
Strong communication skills and a proven ability to drive cross-functional initiatives
Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities

Job Responsibility

Collaborate with Simulation engineers, ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
Own the technical roadmap, lead technical decisions on Compute architecture, caching, capacity provisioning, and auto-scaling mechanisms
Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization
Proactively research and integrate frameworks, hardware accelerators, and distributed computing techniques
Lead large-scale technical initiatives across GM’s ML infrastructure
Raise the engineering bar through technical leadership and by establishing best practices

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Staff ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...

Location

United States , Sunnyvale

Salary:

189300.00 - 290700.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

5+ years of experience building large-scale distributed systems, applications, or advanced ML systems
Proven track record of designing robust frameworks with high-quality, durable APIs
Deep understanding of machine learning algorithms with hands‑on application
Expertise in building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
End-to-end experience across the ML development lifecycle, including MLOps practices
Strong cross functional collaboration skills across teams and organizations
Exceptional coding skills in Python or C++
Strong interest in autonomous driving and its transformative potential
BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience

Job Responsibility

Lead the design, implementation, and deployment of scalable platforms and tools that drive machine learning model training and evaluation workflows across GM
Own complex technical projects end-to-end, making key architectural decisions and technical trade-offs
Take a holistic view of projects, considering their impact across multiple teams, and across a longer timeline
Proactively drive technical prioritization
Collaborate closely with partner teams to ensure maximum benefit from the systems we build
Help shape our team through technical interviewing with high, well-calibrated standards, and play an essential role in recruiting
Mentor and onboard junior engineers and interns, helping them grow their careers

What we offer

Medical
Dental
Vision
Health Savings Account
Flexible Spending Accounts
Retirement savings plan
Sickness and accident benefits
Life insurance
Paid vacation & holidays
Tuition assistance programs

Fulltime

Staff Machine Learning Engineer - ML Training Infrastructure

The Role:   We are seeking an experienced, technically strong, impact-driven ex...

Location

United States , Austin; Mountain View

Salary:

185000.00 - 335300.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience
8+ years of professional software engineering experience
5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models
Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems
Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact
Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost
Willingness to travel to Sunnyvale, CA as needed
Comfortable operating in highly ambiguous and dynamic environments

Job Responsibility

Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale
Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments
Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack
Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery
Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams
Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem
Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Staff Software Engineer (Distributed Systems & ML Infrastructure)

An Elite FinTech firm is expanding its world-class engineering team and looking ...

Location

France , Paris

Salary:

160000.00 EUR / Year

Hunter Bond

Expiration Date

Until further notice

Requirements

Open to all experience levels
Proven experience coding in Python
Strong understanding or interest in distributed systems and ML infrastructure
Enthusiasm to learn Rust (supported by internal mentorship and training)
Excellent academic background
Experience in high-stakes, low-latency, mission-critical environments where reliability and performance are non-negotiable

Job Responsibility

Design and build high-performance, distributed systems for large-scale ML infrastructure
Drive best practices in software architecture, testing, and scalability
Lead and collaborate on multiple greenfield initiatives focused on performance, reliability, and scale

What we offer

Up to €160,000 + Industry Leading Bonus
Work on next-gen distributed systems and ML infrastructure
Take ownership of multiple greenfield builds
Zero bureaucracy and a genuinely collaborative culture
Stunning offices
Dedicated time for personal projects every Friday

Fulltime

Staff ML Engineer, Inference Platform

The ML Inference Platform is part of the AI Compute Platforms organization withi...

Location

United States , Sunnyvale

Salary:

185500.00 - 270000.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

8+ years of industry experience, with focus on machine learning systems or high performance backend services
Expertise in either Go, Python, C++ or other relevant coding languages
Expertise in ML inference, model serving frameworks (triton, rayserve, vLLM etc)
Strong communication skills and a proven ability to drive cross-functional initiatives
Experience working with cloud platforms such as GCP, Azure, or AWS
Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities

Job Responsibility

Design and implement core platform backend software components
Collaborate with ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
Lead technical decision-making on model serving strategies, orchestration, caching, model versioning, and auto-scaling mechanisms
Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization of inference services
Proactively research and integrate state-of-the-art model serving frameworks, hardware accelerators, and distributed computing techniques
Lead large-scale technical initiatives across GM’s ML ecosystem
Raise the engineering bar through technical leadership, establishing best practices
Contribute to open source projects
represent GM in relevant communities

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Staff ML Engineer - Embodied AI Scaling Foundations

At General Motors, our product teams are redefining mobility. Through a human-ce...

Location

United States , Sunnyvale, California

Salary:

189000.00 - 300000.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Bachelor’s, Master’s, or PhD in Computer Science, Robotics, Machine Learning, or related field
Experience working with large-scale foundation models and alignment methods applied to real-world systems
Demonstrated ability to deliver applied ML solutions under real-world constraints and timelines
Proficiency in PyTorch and Python
Experience building and scaling model training pipelines enabling efficient iteration across teams
Strong data processing skills using tools such as NumPy, Pandas, and Apache Spark
Strong communication skills enabling effective collaboration across engineering teams
Experience deploying ML models into production environments and understanding end-to-end deployment workflows

Job Responsibility

Design and implement ML solutions aligned with GM’s autonomous driving objectives
Apply techniques such as unsupervised pre-training, imitation learning, reinforcement learning, model scaling/selection, foundation modeling, to solve problems in object detection/tracking/classification, trajectory generation, and safe AI
Collaborate with cross-functional teams to deploy models and algorithms into onboard driving systems
Contribute to applied research efforts and remain current with advancements in ML frameworks and methods
Design and build efficient infrastructure, pipelines, and tooling to facilitate fast-pace model iterations
Drive technical execution from prototyping through production deployment, documenting learnings and best practices
Support and mentor engineers through technical collaboration and code reviews, fostering knowledge sharing and engineering excellence

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Staff ML Engineer - Applied AI

Applied AI is a horizontal AI team at Uber partnering with product and platform ...

Location

India , Bangalore

Salary:

Not provided

Uber

Expiration Date

Until further notice

Requirements

10+ years of industry experience in machine learning or software engineering, with a proven record of delivering ML solutions to production
Strong knowledge of machine learning, deep learning, and exposure to generative AI techniques (e.g., transformers, LLMs, diffusion)
Experience designing and scaling ML systems or platforms, including training pipelines, serving infrastructure, and model lifecycle tooling
Fluency in ML frameworks (e.g., PyTorch, TensorFlow, JAX) and development in Python and/or scalable backend languages (e.g., Java, Go)
Excellent collaboration and communication skills with the ability to work across teams and functions

Job Responsibility

Design and implement ML-driven systems that power core Uber experiences, with a focus on scalability, reliability, and performance
Lead the technical execution of key projects involving classical ML, deep learning, and generative AI technologies (e.g., LLMs, multimodal models)
Collaborate closely with product, data science, and infrastructure teams to develop AI solutions from ideation through production deployment
Contribute to and influence the technical direction for Applied AI, particularly around system design, model architecture, and infrastructure decisions
Champion engineering best practices in ML development — including experimentation workflows, model versioning, evaluation, monitoring, and responsible AI
Provide mentorship to engineers on the team and across partner orgs to help raise the technical bar

Fulltime

Sr Staff ML Engineer - Production & MLOps Focus - GenAI Security Platform

Join our team building a cutting-edge multi-tenanted GenAI Security Platform tha...

Location

India , Bengaluru

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

4+ years of ML engineering experience with hands-on LLM/NLP work
Practical experience building LLM-based applications (agents, multi-turn systems, evaluators)
Understanding of model fine-tuning, embedding optimization, and prompt engineering
Experience with LLM APIs (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI)
Knowledge of LLM orchestration frameworks ( LangChain, LlamaIndex, Pydantic AI, custom solutions)
Familiarity with model architectures and when to fine-tune vs prompt engineer
Strong experience deploying ML models to production at scale
Experience with Model serving frameworks (vLLM preferred
TensorRT-LLM, Ray Serve, or similar a plus)
Kubernetes and Docker proficiency for ML workload orchestration

Job Responsibility

Build and deploy LLM-based agents and multi-step evaluation workflows
Fine-tune models, optimize embeddings, and manage model weights and artifacts
Deploy and scale ML services on Kubernetes with proper monitoring and resource management
Implement experiment tracking, model versioning, and deployment automation
Develop observability dashboards for ML metrics, costs, latency, and quality
Optimize LLM API usage through caching, batching, and intelligent routing strategies
Manage vector database infrastructure and semantic search systems
Create CI/CD pipelines for ML artifacts and automated testing frameworks
Collaborate with ML researchers to productionize prototypes and scale experiments

Fulltime

Select Country

Staff ML Infrastructure Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Staff ML Infrastructure Engineer

Staff ML Infrastructure Engineer

Staff ML Infrastructure Engineer - Embodied AI

Staff Machine Learning Engineer - ML Training Infrastructure

Staff Software Engineer (Distributed Systems & ML Infrastructure)

Staff ML Engineer, Inference Platform

Staff ML Engineer - Embodied AI Scaling Foundations

Staff ML Engineer - Applied AI

Sr Staff ML Engineer - Production & MLOps Focus - GenAI Security Platform

Our AI answers in your language