CrawlJobs Logo

Principal ML Engineer - Large Scale Training Performance Optimization

United States, San Jose 226400.00 - 339600.00 USD / Year · Job Posted March 25, 2026
Apply Position
Job Link Share

Job Description

We are looking for a Principal Machine Learning Engineer to join our Models and Applications team. If you are excited by the challenge of distributed training of large models on a large number of GPUs, and if you are passionate about improving training efficiency while innovating and generating new ideas, then this role is for you. You will be part of a world class team focused on addressing the challenge of training generative AI at scale.

Job Responsibility

  • Train large models to convergence on AMD GPUs at scale
  • Improve the end-to-end training pipeline performance
  • Optimize the distributed training pipeline and algorithm to scale out
  • Contribute your changes to open source
  • Stay up-to-date with the latest training algorithms
  • Influence the direction of AMD AI platform
  • Collaborate across teams with various groups and stakeholders

Requirements

  • Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training and distributed training frameworks, such as Megatron-LM, MaxText, TorchTitan
  • Experience with LLMs or computer vision, especially large models
  • Experience with GPU kernel optimization
  • Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
  • Experience with ML infra at kernel, framework, or system level
  • Strong communication and problem-solving skills
  • A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field

Nice to have

  • Experience with LLMs or computer vision, especially large models, is a plus
  • Experience with GPU kernel optimization is a plus

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal ML Engineer - Large Scale Training Performance Optimization

8 matching positions

Principal Engineer - Evaluation & Simulation

As a Principal Engineer in Evaluation & Simulation, you will drive the architect...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
302000.00 - 336000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of working experience in Software Engineering, Autonomous Systems, Simulation, or Robotics
  • Proven experience leading the architecture and delivery of large-scale distributed systems or complex simulation platforms from conception to production
  • Bachelor's degree in Computer Science, Computer Engineering, or related fields
  • Expert-level proficiency in C++ and Python within Linux environments
  • Deep expertise in high-performance computing, system optimization, and cloud architecture (AWS, GCP, etc.)
Job Responsibility
Job Responsibility
  • Strategic Simulation Architecture: Lead the technical roadmap for our large-scale, cloud-based simulation platform, ensuring it can efficiently scale to run millions of closed-loop scenarios and validate complex urban edge cases
  • High-Fidelity Virtual Validation: Design and oversee the implementation of advanced simulation frameworks that integrate sensor data (LiDAR, camera, radar), cutting-edge neural rendering, and highly realistic traffic agent behaviors
  • Metrics & Scenario Generation: Define the deterministic and probabilistic evaluation metrics used to score autonomous behavior. Pioneer the systems used for procedural and data-driven generation of rare, long-tail edge-case scenarios
  • End-to-End System Integration: Act as the crucial bridge between simulation infrastructure and the core ML stack, ensuring seamless integration so that onboard models can be trained, tested, and validated in highly accurate virtual environments prior to field deployment
  • Technical Mentorship & Influence: Mentor senior and lead engineers, fostering a culture of rigorous software architecture, testing, and engineering excellence. You will influence the technical direction of multiple infrastructure and autonomy teams
What we offer
What we offer
  • Bonus program
  • Equity award
  • 401(k) plan
  • Various benefits
  • Fulltime
Read More
Arrow Right

Principal Engineer, ASIC Development Engineering (Frontend Architect - AI Storage Solutions)

In this Frontend Architect position, you will develop AI Storage Solutions based...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
sandisk.com Logo
Sandisk
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors or Masters or PhD in Computer/Electrical Engineering with 8+ years of hands-on Architecture experience authoring specifications
  • Strong technical background architecting SoC and I/O subsystems involving PCIe and PCIe-DMA engines, or UCIe or CXL or UAL
  • Strong IO subsystem microarchitecture, technical, and working knowledge of the PCIe/UCIe protocol specifications
  • Knowledge of I/O Subsystem and DMA interactions with internal embedded processor-subsystems (x86, RISC-V or ARM) and external host CPU
  • Good understanding of computer/graphics architecture, ML, LLM
  • Architecting an GPU/TPU/xPU Accelerator systems with optimized high bandwidth memory hierarchy and frontend architecture for multi-trillion parameter LLM training/inference including Dense, Mixture of Experts (MoE) with multiple modalities (text, vision, speech)
  • Deep experience optimizing large-scale ML systems, GPU architectures
  • Proficiency in principles and methods of microarchitecture, software, and hardware relevant to performance engineering
  • Multi-disciplinary experience, including familiarity with Firmware and ASIC design
  • Expertise in CUDA programming, GPU memory hierarchies, and hardware-specific optimizations
Job Responsibility
Job Responsibility
  • Responsible for driving the SoC architecture, with a particular focus on I/O subsystems connected over UCIe, PCIe, UAL or CXL
  • Define I/O subsystem and PCIe DMA architectures, including their interactions with internal embedded processor-subsystems, Network on Chip, Memory controllers, and FPGA fabric
  • Create flexible and modular I/O subsystem architectures that can be deployed in either chiplet, monolithic or 3D form factors
  • Work with customers, and cross-functional teams to scope SoC requirements, analyze PPA tradeoffs, and then define architectural requirements that meet the PPA and schedule targets
  • Define I/O subsystem and DMA hardware, software, and firmware interactions with embedded processing subsystems and SoC CPUs on the device side and Host CPUs
  • Author architecture specifications in clear and concise language. Guide and assist pre-silicon design/verification and post-silicon validation during the execution phase
  • Responsible for improving the AI/ML ASIC Architecture performance through hardware & software co-optimization, post-silicon performance analysis, and influencing the strategic product roadmap
  • LLM Workload analysis and characterization of ASIC and competitive datacenter and AI solutions to identify opportunities for performance improvement in our products
  • Experience architecting one or some components of AI/ML accelerator ASICs such as HBM, PCIe/UCIe/CXL, NoC, DMA, Firmware Interactions, NAND, xPU, fabrics, etc
  • Drive the AI Storage Solutions frontend system architecture with GPU/TPU/NPU/xPU to match or exceed the nextgen HBM bandwidth
  • Fulltime
Read More
Arrow Right

Principal Engineer - Marketplace

Principal Engineer role in the Marketplace Engineering team to lead breakthrough...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
302000.00 - 336000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • PhD in Computer Science, Machine Learning, Operations Research, or related quantitative field OR Master’s degree with 12+ years of industry experience
  • 10+ years of experience building and deploying ML models in large-scale production environments
  • Expert-level proficiency in modern ML frameworks (TensorFlow, PyTorch, JAX) and distributed computing platforms (Spark, Ray)
  • Deep expertise across multiple areas including: Deep Learning, Causal Inference, Reinforcement Learning, Multi-objective Optimization, Algorithmic Game Theory, and Large-scale Ads Ranking/Auction Systems
  • Proven track record of leading complex ML projects from research through production with significant measurable business impact
  • Strong programming skills in Python, Java, or Go with experience building production ML systems
  • Experience with feature engineering, model serving, and ML infrastructure at scale (handling millions of predictions per second)
  • Technical leadership experience including mentoring senior engineers and driving cross-team technical initiatives
  • Advanced Deep Learning and Neural Network architectures
  • Scalable ML architecture and distributed model training
Job Responsibility
Job Responsibility
  • Lead the design and implementation of advanced ML systems for dynamic pricing algorithms serving millions of drivers across 70+ countries around the world
  • Architect real-time ML infrastructure handling 1M+ pricing decisions per second with sub-50ms latency requirements
  • Drive breakthrough research in causal ML, reinforcement learning, algorithmic game theory, and multi-objective optimization for marketplace optimization with strategic agents
  • Own end-to-end ML model lifecycle from research through production deployment and continuous optimization
  • Develop and enforce best practices in system design, ensuring data integrity, security, and optimal performance
  • Serve as a representative for the Marketplace organization to the broader internal and external technical community
  • Contribute to the eng brand for Marketplace and serve as a talent magnet to help attract and retain talent for the team
  • Stay abreast of industry trends and emerging technologies in software engineering, focused particularly on ML/AI, to enhance our systems and processes continually
  • Build scalable ML architecture and feature management systems supporting Driver Pricing and broader Marketplace teams
  • Design experimentation frameworks enabling rapid testing of pricing algorithms using A/B, Switchback, Synthetic Control, and other experimental methodologies
What we offer
What we offer
  • Eligible to participate in Uber's bonus program
  • May be offered an equity award & other types of comp
  • Eligible to participate in a 401(k) plan
  • Eligible for various benefits (details at provided link)
  • Fulltime
Read More
Arrow Right

Principal Engineer, Model Dev Platform

As the Principal Engineer for the Model Development Platform at Wayve, you will ...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
wayve.ai Logo
Wayve
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Technical Leadership at Scale – 10+ years of experience designing and building large-scale distributed systems, ML/AI infrastructure, full stack web application, or developer platforms, including at least 3 years as a staff or principal-level engineer
  • Architectural Depth & Breadth – Proven ability to design systems spanning web platforms, ML pipelines, and large-scale compute orchestration (e.g., Spark, Ray, Kubernetes, Airflow, MLflow)
  • Reliability & Performance Mindset – Experience driving platform reliability improvements, defining SLAs/SLOs, and building self-healing and observable systems that operate at “four nines” availability or better
  • Hands-On Systems Design – Deep understanding of distributed computing, workflow orchestration, data modeling, and API design, with the ability to write and review production-quality code
  • Collaborative Influence – Excellent communication and cross-functional collaboration skills
  • ability to guide engineers, managers, and researchers toward unified technical direction
  • Mentorship & Culture – Demonstrated success in mentoring engineers across levels and cultivating a culture of engineering excellence
  • Education – Bachelor’s degree in Computer Science, Software Engineering, or related field (advanced degree preferred, or equivalent experience)
Job Responsibility
Job Responsibility
  • Design and evolve the overarching architecture of the model development platform, ensuring system-wide reliability, observability, and scalability
  • Work across disciplines—from front-end web UIs to large-scale distributed training, from Spark-based data pipelines to experiment scheduling algorithms using linear optimization—to unify the platform’s architecture and ensure smooth interoperability between systems
  • Dive deep into the thorniest technical challenges faced by individual subteams, bringing your expertise in distributed systems, large-scale compute, and system design to bear
  • Develop and refine systems that optimize how models are tested—whether in simulation or on-road—balancing constraints like hardware availability, safety requirements, and research priorities
  • Architect data processing pipelines capable of ingesting, transforming, and enriching petabytes of sensor data from the global fleet
  • Serve as a mentor and coach for engineers across the organization—developing technical talent, improving design practices, and fostering a culture of learning and technical excellence
  • Partner with Product Management, Research, and Operations to align technical architecture with user needs and product vision
Read More
Arrow Right

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot
  • Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems
  • Advance the state of the art and translate breakthroughs into measurable customer and business impact
  • Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines
  • Set technical direction for large programs
  • drive alignment across Research, Engineering, and Product
  • Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem
  • Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards
  • Drive original research and thought leadership (whitepapers, internal notes, patents)
  • convert insights into shipped capabilities
  • Fulltime
Read More
Arrow Right

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
Job Responsibility
Job Responsibility
  • Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot
  • Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems
  • Advance the state of the art and translate breakthroughs into measurable customer and business impact
  • Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines
  • Set technical direction for large programs
  • drive alignment across Research, Engineering, and Product
  • Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem
  • Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards
  • Drive original research and thought leadership (whitepapers, internal notes, patents)
  • convert insights into shipped capabilities
  • Fulltime
Read More
Arrow Right

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot.
  • Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems.
  • Advance the state of the art and translate breakthroughs into measurable customer and business impact.
  • Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines.
  • Set technical direction for large programs
  • drive alignment across Research, Engineering, and Product.
  • Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem.
  • Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards.
  • Drive original research and thought leadership (whitepapers, internal notes, patents)
  • convert insights into shipped capabilities.
  • Fulltime
Read More
Arrow Right

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot
  • Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems
  • Advance the state of the art and translate breakthroughs into measurable customer and business impact
  • Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines
  • Set technical direction for large programs
  • drive alignment across Research, Engineering, and Product
  • Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem
  • Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards
  • Drive original research and thought leadership (whitepapers, internal notes, patents)
  • convert insights into shipped capabilities
  • Fulltime
Read More
Arrow Right