This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
In this role you will help scale and optimize our training systems and core model code. You’ll own critical infrastructure for large-scale training, from managing GPU/TPU compute and job orchestration to building reusable and efficient JAX training pipelines. You’ll work closely with researchers and model engineers to translate ideas into experiments—and those experiments into production training runs. This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure.
Job Responsibility:
Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging
Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction
Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization
Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments
Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost
Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale
Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics
Requirements:
Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms
Hands-on large-scale training experience in JAX (preferred), PyTorch
Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines
Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS)
Ability to debug and optimize performance bottlenecks across the training stack
Strong cross-functional communication and ownership mindset
Nice to have:
Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels)
Experience operating close to hardware (GPU/TPU performance tuning)
Background in robotics, multimodal models, or large-scale foundation models
Experience designing abstractions that balance researcher flexibility with system reliability