This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking an experienced, technical oriented, impact delivering-driven expert in ML Training Infrastructure with a strong ability to execute hands-on technical work. In this role, you will be responsible for designing and building scalable, reliable, and high-performance AI/ML platform infrastructure to support advanced AI research and model development initiatives. As a Senior ML Engineer, you will collaborate closely with machine learning engineers, research scientists, and other partners to develop state-of-the-art AI solutions that enable the future of intelligent driving technologies across General Motors vehicles.
Job Responsibility:
Design and development of scalable, reliable, high-performance ML framework to support model training at scale
Model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost
Raise the bar on system observability, debuggability, and operational excellence, and user experience
Collaborate with cross-functional teams to integrate new features and technologies into the platform
Requirements:
Bachelors degree or higher in Computer Science or equivalent major OR equivalent relevant experience
3+ years professional software engineering experience
2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
Strong programming skills in Python, with proficiency in frameworks such as,PyTorch (preferred), TensorFlow, or similar
Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
Willingness to travel to Sunnyvale, CA as needed
Comfortable working in highly ambiguous and dynamic environments
Nice to have:
5+ years of professional software engineering experience
Extensive knowledge and experience with PyTorch 2.x+ and distributed training framework
Experience with design and development of training framework that supports FSDP, Pipeline Parallelism and other scalable solutions to training large foundational models
Experience with profiling, analysis, debugging and optimizing training and data loading performance
Excellent communication skills to resolve controversial, make consensus, communicate risks and give constructive feedback