This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a Principal Machine Learning Engineer to join our Models and Applications team. If you are excited by the challenge of distributed training of large models on a large number of GPUs, and if you are passionate about improving training efficiency while innovating and generating new ideas, then this role is for you. You will be part of a world class team focused on addressing the challenge of training generative AI at scale.
Job Responsibility:
Train large models to convergence on AMD GPUs at scale
Improve the end-to-end training pipeline performance
Optimize the distributed training pipeline and algorithm to scale out
Contribute your changes to open source
Stay up-to-date with the latest training algorithms
Influence the direction of AMD AI platform
Collaborate across teams with various groups and stakeholders
Requirements:
Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
Experience with distributed training and distributed training frameworks, such as Megatron-LM, MaxText, TorchTitan
Experience with LLMs or computer vision, especially large models
Experience with GPU kernel optimization
Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
Experience with ML infra at kernel, framework, or system level
Strong communication and problem-solving skills
A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field
Nice to have:
Experience with LLMs or computer vision, especially large models, is a plus