This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Drive the performance of post‑training workloads on AMD Instinct™ GPUs. You’ll work across kernels, distributed training, and framework integrations to deliver fast, stable, and reproducible training pipelines on ROCm.
Job Responsibility:
Lead performance for finetuning and RL training solutions on AMD GPUs
Improve throughput, memory efficiency, and stability across data, model, and optimizer steps
Optimize multi-GPU/multi-node training and communication patterns
Contribute efficient kernels/ops and targeted graph-level optimizations
Profile, diagnose, and resolve bottlenecks using standard tooling
prevent regressions in CI
Ship reproducible pipelines and documentation adopted by internal teams and external developers
Collaborate with framework, compiler, and model teams to land durable improvements
Requirements:
Proven GPU performance engineering for deep learning (ROCm/HIP, Triton, or similar)
Hands-on with SFT. LoRA and RL-based training at scale
Strong PyTorch experience (torch.distributed, FSDP/ZeRO or equivalent)
Proficient in Python and C++
comfortable reading/writing kernels when needed
Experience with distributed systems and collective communication libraries
Track record of turning profiles into fixes, upstreaming changes, and documenting results