This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re looking for a Machine Learning Systems Engineer to strengthen the performance and scalability of our distributed training infrastructure. In this role, you'll work closely with researchers to streamline the development and execution of large-scale training runs, helping them make the most of our compute resources. You’ll contribute to building tools that make distributed training more efficient and accessible, while continuously refining system performance through careful analysis and optimization. This position is a great fit for someone who enjoys working at the intersection of distributed systems and machine learning, values high-performance code, and has an interest in supporting innovative machine learning efforts.
Job Responsibility:
Collaborate with researchers to enable them to develop systems-efficient models and architectures
Apply the latest techniques to our internal training runs to achieve impressive hardware efficiency for our training runs
Create tooling to help researchers distribute their training jobs more effectively
Profile and optimize our training runs
Requirements:
Experience with large-scale ML training pipelines and distributed training frameworks
Strong software engineering skills in python
Passion for diving deep into systems implementations and understanding fundamentals to improve their performance and maintainability
Experience improving resource efficiency across distributed computing environments by leveraging profiling, benchmarking, and implementing system-level optimizations