ML Infra Engineer Job at Physical Intelligence (San Francisco)

ML Infra Engineer (Data Systems)

As an ML Infra Engineer (Data Systems), you’ll build and operate the data infras...

Location

United States , San Francisco

Salary:

Not provided

Physical Intelligence

Expiration Date

Until further notice

Requirements

Strong software engineering fundamentals
Experience building distributed systems or large-scale data pipelines
Comfort reasoning about performance, memory, I/O, and storage efficiency
Familiarity with batch and/or streaming processing systems
Experience with object storage systems and data format tradeoffs
Ownership mindset: design, build, operate, and iterate on systems end-to-end
Enjoy working closely with researchers and unblocking fast-moving projects

Job Responsibility

Data Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw multimodal data
Batch & Streaming Systems: Operate large-scale batch and streaming workflows over massive datasets
Storage Systems: Design object storage layouts, metadata systems, and efficient access patterns
choose file formats with performance and scalability in mind
Data Lifecycle Management: Build systems for backfills, dataset rebuilds, garbage collection, and large-scale transformations
Training-Time Performance: Optimize dataloaders, sharding, prefetching, caching, and throughput to reduce time from data arrival → model training
Metadata & Indexing: Build scalable metadata stores for datasets, annotations, and training artifacts
Data Movement: Move hundreds of terabytes to petabytes efficiently across clusters and environments
Operational Correctness: Implement observability, validation, and guardrails to prevent silent data regressions
Cross-Functional Collaboration: Work closely with cross-functional teams of researchers, engineers and roboticists to translate evolving data needs into robust systems

Fulltime

Software Engineer: ML Infra

Generalist trains very large robot foundation models. This requires utilizing ve...

Location

United States , San Mateo; Somerville

Salary:

200000.00 - 350000.00 USD / Year

Generalist AI

Expiration Date

Until further notice

Requirements

Have managed large fleets of GPUs doing large-scale, long-term, highly distributed training runs or inference
Deep experience in Slurm or Kubernetes for ML workload orchestration
Have build high-scale ML data loaders and preparation systems
Deeply understand every layer of the ML hardware, storage, and networking stacks
Have experience in the NVidia GPU ecosystem

Job Responsibility

Owning our GPU compute fleets
Ensure our GPUs are easy for researchers to use and maximally utilized
Optimizing and improving ML data loading transport and storage in highly distributed fully utilized environments
Orchestration of robot inference fleets

What we offer

Offers Equity

Fulltime

Machine Learning Infra Engineer

As an ML Infra Engineer, you’ll play a key role in building the inference and tr...

Location

United States , San Francisco

Salary:

150000.00 - 300000.00 USD / Year

Reducto

Expiration Date

Until further notice

Requirements

Hold yourself to a high bar for quality and precision
Enjoy solving complex problems and building from first principles
Strong Python skills + a background in systems engineering
Comfortable with Kubernetes and distributed training frameworks
Love getting your hands dirty with real-world implementation challenges
Operate well in fast-changing, high-growth environments
Collaborate effectively across technical and non-technical teams
Take full ownership from strategy through execution

Job Responsibility

Build and maintain our training and inference stack with an emphasis for fast iteration on training + flexibility for exploring new methods and high performance in inference
Develop benchmarks for both sets of stacks to identify bottlenecks
Explore SOTA advances in training and inference and work to apply them
Design systems for scaling model training across multi-node, multi-GPU environments with strong reliability and observability
Scale distributed training and inference workloads across large GPU clusters while improving utilization, reliability, and cost efficiency
Build the tooling, abstractions, and observability that help ML engineers move faster from experiment to production

What we offer

Unlimited PTO
Lunch
Reimbursed Transportation
Insurance
Health and Wellness Budget
Parental Leave

Fulltime

Principal ML Engineer - Large Scale Training Performance Optimization

We are looking for a Principal Machine Learning Engineer to join our Models and ...

Location

United States , San Jose; Bellevue

Salary:

226400.00 - 339600.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
Experience with distributed training and distributed training frameworks, such as Megatron-LM, MaxText, TorchTitan
Experience with LLMs or computer vision, especially large models
Experience with GPU kernel optimization
Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
Experience with ML infra at kernel, framework, or system level
Strong communication and problem-solving skills
A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field

Job Responsibility

Train large models to convergence on AMD GPUs at scale
Improve the end-to-end training pipeline performance
Optimize the distributed training pipeline and algorithm to scale out
Contribute your changes to open source
Stay up-to-date with the latest training algorithms
Influence the direction of AMD AI platform
Collaborate across teams with various groups and stakeholders

Fulltime

Senior ML Engineer

Uber’s newly formed AI Security team, part of the Core Security Engineering orga...

Location

United States , Seattle; San Francisco

Salary:

202000.00 - 224000.00 USD / Year

Uber

Expiration Date

Until further notice

Requirements

5+ years experience in formulating ML problems from ambiguous business requirements, especially in risk, fraud, or security contexts
Proficiency across a broad range of ML algorithms: tree-based models (XGBoost, LightGBM), classical statistical models (logistic regression, SVMs), and deep learning architectures (CNNs, RNNs, Transformers), with the ability to select and apply the right approach based on context and data characteristics
Hands-on experience with feature engineering, model development, and productionization of ML pipelines
Proficiency in PyTorch, TensorFlow, or similar ML frameworks, and in Python or comparable languages for scalable, production-grade systems

Job Responsibility

Translate business and security needs into well-defined ML problems
Develop, iterate, and productionize ML models that drive risk-adaptive decisions in real-time
Engineer features from Uber’s risk systems, logs, and contextual signals
Integrate ML systems into Uber’s critical access pathways (containers, APIs, gateways, data)
Collaborate across Security, Risk, and Infra teams to deliver scalable, production-ready solutions
Provide leadership by mentoring junior engineers, evangelize ML best practices, and help shape ML strategy within AI Security

What we offer

Eligible to participate in Uber's bonus program
May be offered an equity award & other types of comp
All full-time employees are eligible to participate in a 401(k) plan
Eligible for various benefits (details at provided link)

Fulltime

AI Infra Engineer

We are looking for an AI Infra engineer to join our growing team. We work with K...

Location

United States , San Francisco; Palo Alto

Salary:

210000.00 - 385000.00 USD / Year

Perplexity

Expiration Date

Until further notice

Requirements

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
Experience with deploying and managing distributed training systems at scale
Deep understanding of container orchestration and distributed systems architecture
High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
Experience managing GPU clusters and optimizing compute resource utilization
Expert-level Kubernetes administration and YAML configuration management
Proficiency with Slurm job scheduling, resource management, and cluster configuration
Python and C++ programming with focus on systems and infrastructure automation
Hands-on experience with ML frameworks such as PyTorch in distributed training contexts

Job Responsibility

Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
Manage and optimize Slurm-based HPC environments for distributed training of large language models
Develop robust APIs and orchestration systems for both training pipelines and inference services
Implement resource scheduling and job management systems across heterogeneous compute environments
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

What we offer

Equity
Health
Dental
Vision
Retirement
Fitness
Commuter and dependent care accounts

Fulltime

ML Engineer

Sesame believes in a future where computers are lifelike - with the ability to s...

Location

United States , New York; San Francisco; Bellevue

Salary:

190000.00 - 320000.00 USD / Year

Sesame

Expiration Date

Until further notice

Requirements

Expert-level PyTorch
Proven software engineer who loves ML
comfortable writing production code across the stack
Hands-on experience training or fine-tuning large language or other large-scale models with a variety of techniques
Evaluation expert — you’ve designed metrics and harnesses that actually predict user happiness
Deep knowledge of the ML lifecycle: dataset ops, training pipelines, eval frameworks, deployment, and monitoring
History of shipping complex projects to production—especially user-facing, online ML systems—despite shifting requirements and surprise roadblocks
High agency and the judgment to know when to sprint solo vs. pull in the squad
Track record of setting technical direction, driving consensus, and partnering smoothly with product, infra, and research

Job Responsibility

Own evaluation pipelines — design, build, and automate offline and live evals that keep our speech and multimodal models honest in production
Harness the data — create tooling for safe, versioned, privacy-aware dataset curation and discovery
Ship models, not slide decks — partner with research and infra to prototype, train, and deploy state-of-the-art voice models that power Sesame’s real-time companion experience
Squeeze silicon — scale training and inference for LLM-class workloads
chase latency, throughput, and cost until the graphs flatten
Wire up monitoring and live evals — surface quality regressions before users or PMs notice
Move at startup speed — take ideas from whiteboard to production in days, not quarters
leave a clean trail of tests and dashboards behind

What we offer

401k matching
100% employer-paid health, vision, and dental benefits
Unlimited PTO and sick time
Flexible spending account matching (medical FSA)

Fulltime

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Extremely strong software engineering skills
Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
Experience using large-scale distributed training strategies
Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure

Job Responsibility

Design and write high-performant and scalable software for training
Improve our training setup from an infrastructure and codebase performance standpoint
Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
Research, implement, and experiment with ideas on our supercompute and data infrastructure
Learn from and work with the best researchers in the field

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Select Country

ML Infra Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?