Distributed Training Engineer Job at OpenAI (San Francisco)

Member of Technical Staff - Distributed Training Engineer

Our Training Infrastructure team is building the distributed systems that power ...

Location

United States , San Francisco; Boston

Salary:

Not provided

Liquid AI

Expiration Date

Until further notice

Requirements

Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP)
Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers)
Understanding of hardware accelerators and networking topologies
Experience optimizing data pipelines for ML workloads

Job Responsibility

Design and build core systems that make large training runs fast and reliable
Build scalable distributed training infrastructure for GPU clusters
Implement and tune parallelism/sharding strategies for evolving architectures
Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation)
Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
Develop checkpointing mechanisms balancing memory constraints with recovery needs
Create monitoring, profiling, and debugging tools for training stability and performance

What we offer

Competitive base salary with equity in a unicorn-stage company
We pay 100% of medical, dental, and vision premiums for employees and dependents
401(k) matching up to 4% of base pay
Unlimited PTO plus company-wide Refill Days throughout the year

Fulltime

Research Engineer - Distributed Training

Building Open Superintelligence Infrastructure. Prime Intellect is building the ...

Location

United States , San Francisco

Salary:

Not provided

Prime Intellect

Expiration Date

Until further notice

Requirements

Strong background in AI/ML engineering, with extensive experience in designing and implementing end-to-end pipelines for training and deploying large-scale AI models
Deep expertise in distributed training techniques, frameworks (e.g., PyTorch Distributed, DeepSpeed, MosaicML’s LLM Foundry), and tools (e.g. Ray) for optimizing the performance and scalability of AI workloads
Experience in large-scale model training incl. distributed training techniques such as data, tensor & pipeline parallelism
Solid understanding of MLOps best practices, including model versioning, experiment tracking, and continuous integration/deployment (CI/CD) pipelines
Passion for advancing the state-of-the-art in decentralized AI model training and democratizing access to AI capabilities for researchers, developers, and businesses worldwide

Job Responsibility

Lead and participate in novel research to build a massive scale, highly reliable and secure decentralized training orchestration solution
Optimize the performance, cost, and resource utilization of AI workloads by leveraging the most recent advances for compute & memory optimization techniques
Contribute to the development of our open-source libraries and frameworks for distributed model training
Publish research in top-tier AI conferences such as ICML & NeurIPS
Distill highly technical project outcomes in layman approachable technical blogs to our customers and developers
Stay up-to-date with the latest advancements in AI/ML infrastructure and tools, decentralized training research and proactively identify opportunities to enhance our platform's capabilities and user experience

What we offer

Competitive compensation, including equity incentives, aligning your success with the growth and impact of Prime Intellect
Flexible work arrangements, with the option to work remotely or in-person at our offices in San Francisco
Visa sponsorship and relocation assistance for international candidates
Quarterly team off-sites, hackathons, conferences and learning opportunities
Opportunity to work with a talented, hard-working and mission-driven team, united by a shared passion for leveraging technology to accelerate science and AI

Fulltime

Senior AI Infrastructure Engineer - Training Platform

As a Software Engineer on the Machine Learning Infrastructure team, you will bui...

Location

United States , San Francisco; Seattle; New York

Salary:

216000.00 - 270000.00 USD / Year

Scale

Expiration Date

Until further notice

Requirements

5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
Proven ability to solve complex problems and work independently in fast-moving environments

Job Responsibility

Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
Design and implement scheduling primitives to optimize the lifecycle of training jobs
Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
Work closely with Finance and Procurement teams to drive our capacity planning process
Participate in our team's on call process to ensure the availability of our services
Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment

What we offer

Comprehensive health, dental and vision coverage
retirement benefits
a learning and development stipend
generous PTO
commuter stipend (may be eligible)

Fulltime

Software Engineer (Distributed Systems & ML Infrastructure)

Overview: An Elite FinTech firm is expanding its world-class engineering team an...

Location

Singapore , Singapore

Salary:

250000.00 SGD / Year

Hunter Bond

Expiration Date

Until further notice

Requirements

Open to all experience levels
Proven experience coding in Python
Strong understanding or interest in distributed systems and ML infrastructure
Enthusiasm to learn Rust (supported by internal mentorship and training)
Excellent academic background
Experience in high-stakes, low-latency, mission-critical environments where reliability and performance are non-negotiable

Job Responsibility

Design and build high-performance, distributed systems for large-scale ML infrastructure
Drive best practices in software architecture, testing, and scalability
Lead and collaborate on multiple greenfield initiatives focused on performance, reliability, and scale

What we offer

Industry Leading Bonus
Work on next-gen distributed systems and ML infrastructure
Take ownership of multiple greenfield builds
Zero bureaucracy and a genuinely collaborative culture
Stunning offices
Dedicated time for personal projects every Friday!

Fulltime

Staff Software Engineer (Distributed Systems & ML Infrastructure)

An Elite FinTech firm is expanding its world-class engineering team and looking ...

Location

France , Paris

Salary:

160000.00 EUR / Year

Hunter Bond

Expiration Date

Until further notice

Requirements

Open to all experience levels
Proven experience coding in Python
Strong understanding or interest in distributed systems and ML infrastructure
Enthusiasm to learn Rust (supported by internal mentorship and training)
Excellent academic background
Experience in high-stakes, low-latency, mission-critical environments where reliability and performance are non-negotiable

Job Responsibility

Design and build high-performance, distributed systems for large-scale ML infrastructure
Drive best practices in software architecture, testing, and scalability
Lead and collaborate on multiple greenfield initiatives focused on performance, reliability, and scale

What we offer

Up to €160,000 + Industry Leading Bonus
Work on next-gen distributed systems and ML infrastructure
Take ownership of multiple greenfield builds
Zero bureaucracy and a genuinely collaborative culture
Stunning offices
Dedicated time for personal projects every Friday

Fulltime

ML Engineer, Training Infrastructure

You’ll take on challenging engineering tasks crucial to the development of tabul...

Location

Germany; United States , Berlin; Freiburg; New York; San Francisco

Salary:

Not provided

Prior Labs

Expiration Date

Until further notice

Requirements

Exceptional software engineering fundamentals and expert-level Python proficiency, with 5+ years of hands-on industry experience building and operating production systems
Proven track record of designing and building complex, scalable software, preferably for data processing or distributed systems
Deep, practical knowledge of the modern ML ecosystem (PyTorch, scikit-learn, etc.) and a genuine interest in applying systems thinking to solve hard problems in AI
Core MLOps Concepts: Strong understanding of the entire machine learning lifecycle (MLLC) from data ingestion and preparation to model deployment, monitoring, and retraining. Familiarity with MLOps principles and best practices (e.g., reproducibility, versioning, automation, continuous integration/delivery for ML)

Job Responsibility

Training & research compute infrastructure: Own our cloud GPU cluster (operations, reliability, and cost/performance) currently based on Slurm. Design and implement future versions as our compute needs scale and we expand across multiple cloud/HPC providers
Training & inference performance: Work closely with researchers to identify and resolve performance bottlenecks in distributed training and inference. Support high hardware utilization and efficient memory usage through systems-level debugging, profiling, and infrastructure improvements
Developer productivity: Manage our internal repositories on GitHub and keep their CI and other pipelines speedy. Ensure our experiment tracking, model registry, data processing pipelines are working smoothly
Try out your own ideas! We operate an open environment. If you’ve got the next SOTA tabular architecture up your sleeve, go ahead and train it

What we offer

Competitive compensation package with meaningful equity
30 days of paid vacation + public holidays
Comprehensive benefits including healthcare, transportation, and fitness
Work with state-of-the-art ML architecture, substantial compute resources and with a world-class team

Fulltime

Software Engineer, Distributed Data Systems

The Sora team is pioneering multimodal capabilities for OpenAI’s foundation mode...

Location

United States , San Francisco

Salary:

230000.00 - 385000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Strong experience with distributed systems and large-scale infrastructure
Detail-oriented with rigor in building and maintaining reliable systems
Excellent software engineering fundamentals and organizational skills
Comfortable with ambiguity and rapid change
Strong interest in data

Job Responsibility

Design, build, and maintain data infrastructure systems such as distributed compute, data orchestration, distributed storage, streaming infrastructure, machine learning infrastructure while ensuring scalability, reliability, and security
Ensure our data platform can scale by orders of magnitude while remaining reliable and efficient
Partner with researchers to deeply understand requirements and translate them into production-ready systems
Harden, optimize, and maintain critical data infrastructure systems that power multimodal training and evaluation

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Machine Learning Engineer, Distributed Data Systems

As a Research Engineer, Distributed Data Systems, you will design and scale the ...

Location

United States , San Francisco

Salary:

295000.00 - 445000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Strong experience with distributed systems and large-scale infrastructure
Detail-oriented and bring rigor to building and maintaining reliable systems
Excellent software engineering fundamentals and organizational skills
Comfortable with ambiguity and rapid change

Job Responsibility

Design, build, and maintain data infrastructure systems such as distributed compute, data orchestration, distributed storage, streaming infrastructure, machine learning infrastructure while ensuring scalability, reliability, and security
Ensure our data platform can scale by orders of magnitude while remaining reliable and efficient
Partner with researchers to deeply understand requirements and translate them into production-ready systems
Harden, optimize, and maintain critical data infrastructure systems that power multimodal training and evaluation

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Select Country

Distributed Training Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?