Principal ML Engineer - Large Scale Training Performance Optimization Job at AMD (San Jose)

Principal Engineer - Evaluation & Simulation

As a Principal Engineer in Evaluation & Simulation, you will drive the architect...

Location

United States , San Francisco; Sunnyvale

Salary:

302000.00 - 336000.00 USD / Year

Uber

Expiration Date

Until further notice

Requirements

10+ years of working experience in Software Engineering, Autonomous Systems, Simulation, or Robotics
Proven experience leading the architecture and delivery of large-scale distributed systems or complex simulation platforms from conception to production
Bachelor's degree in Computer Science, Computer Engineering, or related fields
Expert-level proficiency in C++ and Python within Linux environments
Deep expertise in high-performance computing, system optimization, and cloud architecture (AWS, GCP, etc.)

Job Responsibility

Strategic Simulation Architecture: Lead the technical roadmap for our large-scale, cloud-based simulation platform, ensuring it can efficiently scale to run millions of closed-loop scenarios and validate complex urban edge cases
High-Fidelity Virtual Validation: Design and oversee the implementation of advanced simulation frameworks that integrate sensor data (LiDAR, camera, radar), cutting-edge neural rendering, and highly realistic traffic agent behaviors
Metrics & Scenario Generation: Define the deterministic and probabilistic evaluation metrics used to score autonomous behavior. Pioneer the systems used for procedural and data-driven generation of rare, long-tail edge-case scenarios
End-to-End System Integration: Act as the crucial bridge between simulation infrastructure and the core ML stack, ensuring seamless integration so that onboard models can be trained, tested, and validated in highly accurate virtual environments prior to field deployment
Technical Mentorship & Influence: Mentor senior and lead engineers, fostering a culture of rigorous software architecture, testing, and engineering excellence. You will influence the technical direction of multiple infrastructure and autonomy teams

What we offer

Bonus program
Equity award
401(k) plan
Various benefits

Fulltime

Principal Engineer, ASIC Development Engineering (Frontend Architect - AI Storage Solutions)

In this Frontend Architect position, you will develop AI Storage Solutions based...

Location

India , Bangalore

Salary:

Not provided

Sandisk

Expiration Date

Until further notice

Requirements

Bachelors or Masters or PhD in Computer/Electrical Engineering with 8+ years of hands-on Architecture experience authoring specifications
Strong technical background architecting SoC and I/O subsystems involving PCIe and PCIe-DMA engines, or UCIe or CXL or UAL
Strong IO subsystem microarchitecture, technical, and working knowledge of the PCIe/UCIe protocol specifications
Knowledge of I/O Subsystem and DMA interactions with internal embedded processor-subsystems (x86, RISC-V or ARM) and external host CPU
Good understanding of computer/graphics architecture, ML, LLM
Architecting an GPU/TPU/xPU Accelerator systems with optimized high bandwidth memory hierarchy and frontend architecture for multi-trillion parameter LLM training/inference including Dense, Mixture of Experts (MoE) with multiple modalities (text, vision, speech)
Deep experience optimizing large-scale ML systems, GPU architectures
Proficiency in principles and methods of microarchitecture, software, and hardware relevant to performance engineering
Multi-disciplinary experience, including familiarity with Firmware and ASIC design
Expertise in CUDA programming, GPU memory hierarchies, and hardware-specific optimizations

Job Responsibility

Responsible for driving the SoC architecture, with a particular focus on I/O subsystems connected over UCIe, PCIe, UAL or CXL
Define I/O subsystem and PCIe DMA architectures, including their interactions with internal embedded processor-subsystems, Network on Chip, Memory controllers, and FPGA fabric
Create flexible and modular I/O subsystem architectures that can be deployed in either chiplet, monolithic or 3D form factors
Work with customers, and cross-functional teams to scope SoC requirements, analyze PPA tradeoffs, and then define architectural requirements that meet the PPA and schedule targets
Define I/O subsystem and DMA hardware, software, and firmware interactions with embedded processing subsystems and SoC CPUs on the device side and Host CPUs
Author architecture specifications in clear and concise language. Guide and assist pre-silicon design/verification and post-silicon validation during the execution phase
Responsible for improving the AI/ML ASIC Architecture performance through hardware & software co-optimization, post-silicon performance analysis, and influencing the strategic product roadmap
LLM Workload analysis and characterization of ASIC and competitive datacenter and AI solutions to identify opportunities for performance improvement in our products
Experience architecting one or some components of AI/ML accelerator ASICs such as HBM, PCIe/UCIe/CXL, NoC, DMA, Firmware Interactions, NAND, xPU, fabrics, etc
Drive the AI Storage Solutions frontend system architecture with GPU/TPU/NPU/xPU to match or exceed the nextgen HBM bandwidth

Fulltime

Principal Engineer - Marketplace

Principal Engineer role in the Marketplace Engineering team to lead breakthrough...

Location

United States , San Francisco; Sunnyvale

Salary:

302000.00 - 336000.00 USD / Year

Uber

Expiration Date

Until further notice

Requirements

PhD in Computer Science, Machine Learning, Operations Research, or related quantitative field OR Master’s degree with 12+ years of industry experience
10+ years of experience building and deploying ML models in large-scale production environments
Expert-level proficiency in modern ML frameworks (TensorFlow, PyTorch, JAX) and distributed computing platforms (Spark, Ray)
Deep expertise across multiple areas including: Deep Learning, Causal Inference, Reinforcement Learning, Multi-objective Optimization, Algorithmic Game Theory, and Large-scale Ads Ranking/Auction Systems
Proven track record of leading complex ML projects from research through production with significant measurable business impact
Strong programming skills in Python, Java, or Go with experience building production ML systems
Experience with feature engineering, model serving, and ML infrastructure at scale (handling millions of predictions per second)
Technical leadership experience including mentoring senior engineers and driving cross-team technical initiatives
Advanced Deep Learning and Neural Network architectures
Scalable ML architecture and distributed model training

Job Responsibility

Lead the design and implementation of advanced ML systems for dynamic pricing algorithms serving millions of drivers across 70+ countries around the world
Architect real-time ML infrastructure handling 1M+ pricing decisions per second with sub-50ms latency requirements
Drive breakthrough research in causal ML, reinforcement learning, algorithmic game theory, and multi-objective optimization for marketplace optimization with strategic agents
Own end-to-end ML model lifecycle from research through production deployment and continuous optimization
Develop and enforce best practices in system design, ensuring data integrity, security, and optimal performance
Serve as a representative for the Marketplace organization to the broader internal and external technical community
Contribute to the eng brand for Marketplace and serve as a talent magnet to help attract and retain talent for the team
Stay abreast of industry trends and emerging technologies in software engineering, focused particularly on ML/AI, to enhance our systems and processes continually
Build scalable ML architecture and feature management systems supporting Driver Pricing and broader Marketplace teams
Design experimentation frameworks enabling rapid testing of pricing algorithms using A/B, Switchback, Synthetic Control, and other experimental methodologies

What we offer

Eligible to participate in Uber's bonus program
May be offered an equity award & other types of comp
Eligible to participate in a 401(k) plan
Eligible for various benefits (details at provided link)

Fulltime

Principal Engineer, Model Dev Platform

As the Principal Engineer for the Model Development Platform at Wayve, you will ...

Location

United States , Sunnyvale

Salary:

Not provided

Wayve

Expiration Date

Until further notice

Requirements

Technical Leadership at Scale – 10+ years of experience designing and building large-scale distributed systems, ML/AI infrastructure, full stack web application, or developer platforms, including at least 3 years as a staff or principal-level engineer
Architectural Depth & Breadth – Proven ability to design systems spanning web platforms, ML pipelines, and large-scale compute orchestration (e.g., Spark, Ray, Kubernetes, Airflow, MLflow)
Reliability & Performance Mindset – Experience driving platform reliability improvements, defining SLAs/SLOs, and building self-healing and observable systems that operate at “four nines” availability or better
Hands-On Systems Design – Deep understanding of distributed computing, workflow orchestration, data modeling, and API design, with the ability to write and review production-quality code
Collaborative Influence – Excellent communication and cross-functional collaboration skills
ability to guide engineers, managers, and researchers toward unified technical direction
Mentorship & Culture – Demonstrated success in mentoring engineers across levels and cultivating a culture of engineering excellence
Education – Bachelor’s degree in Computer Science, Software Engineering, or related field (advanced degree preferred, or equivalent experience)

Job Responsibility

Design and evolve the overarching architecture of the model development platform, ensuring system-wide reliability, observability, and scalability
Work across disciplines—from front-end web UIs to large-scale distributed training, from Spark-based data pipelines to experiment scheduling algorithms using linear optimization—to unify the platform’s architecture and ensure smooth interoperability between systems
Dive deep into the thorniest technical challenges faced by individual subteams, bringing your expertise in distributed systems, large-scale compute, and system design to bear
Develop and refine systems that optimize how models are tested—whether in simulation or on-road—balancing constraints like hardware availability, safety requirements, and research priorities
Architect data processing pipelines capable of ingesting, transforming, and enriching petabytes of sensor data from the global fleet
Serve as a mentor and coach for engineers across the organization—developing technical talent, improving design practices, and fostering a culture of learning and technical excellence
Partner with Product Management, Research, and Operations to align technical architecture with user needs and product vision

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...

Location

United States , Redmond

Salary:

163000.00 - 296400.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot
Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems
Advance the state of the art and translate breakthroughs into measurable customer and business impact
Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines
Set technical direction for large programs
drive alignment across Research, Engineering, and Product
Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem
Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards
Drive original research and thought leadership (whitepapers, internal notes, patents)
convert insights into shipped capabilities

Fulltime

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...

Location

United States , Redmond

Salary:

163000.00 - 296400.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check

Job Responsibility

Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot
Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems
Advance the state of the art and translate breakthroughs into measurable customer and business impact
Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines
Set technical direction for large programs
drive alignment across Research, Engineering, and Product
Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem
Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards
Drive original research and thought leadership (whitepapers, internal notes, patents)
convert insights into shipped capabilities

Fulltime

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...

Location

United States , Redmond

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Job Responsibility

Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot.
Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems.
Advance the state of the art and translate breakthroughs into measurable customer and business impact.
Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines.
Set technical direction for large programs
drive alignment across Research, Engineering, and Product.
Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem.
Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards.
Drive original research and thought leadership (whitepapers, internal notes, patents)
convert insights into shipped capabilities.

Fulltime

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...

Location

United States , Redmond

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot
Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems
Advance the state of the art and translate breakthroughs into measurable customer and business impact
Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines
Set technical direction for large programs
drive alignment across Research, Engineering, and Product
Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem
Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards
Drive original research and thought leadership (whitepapers, internal notes, patents)
convert insights into shipped capabilities

Fulltime

Select Country

Principal ML Engineer - Large Scale Training Performance Optimization

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?