CrawlJobs Logo

Principal ML Engineer - Large Scale Training Performance Optimization

amd.com Logo

AMD

Location Icon

Location:
United States , San Jose

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

226400.00 - 339600.00 USD / Year

Job Description:

We are looking for a Principal Machine Learning Engineer to join our Models and Applications team. If you are excited by the challenge of distributed training of large models on a large number of GPUs, and if you are passionate about improving training efficiency while innovating and generating new ideas, then this role is for you. You will be part of a world class team focused on addressing the challenge of training generative AI at scale.

Job Responsibility:

  • Train large models to convergence on AMD GPUs at scale
  • Improve the end-to-end training pipeline performance
  • Optimize the distributed training pipeline and algorithm to scale out
  • Contribute your changes to open source
  • Stay up-to-date with the latest training algorithms
  • Influence the direction of AMD AI platform
  • Collaborate across teams with various groups and stakeholders

Requirements:

  • Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training and distributed training frameworks, such as Megatron-LM, MaxText, TorchTitan
  • Experience with LLMs or computer vision, especially large models
  • Experience with GPU kernel optimization
  • Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
  • Experience with ML infra at kernel, framework, or system level
  • Strong communication and problem-solving skills
  • A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field

Nice to have:

  • Experience with LLMs or computer vision, especially large models, is a plus
  • Experience with GPU kernel optimization is a plus

Additional Information:

Job Posted:
March 25, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal ML Engineer - Large Scale Training Performance Optimization

Senior Principal Machine Learning Engineer - LLM Post-Training and Optimization

Atlassian is seeking a highly skilled and experienced Senior Principle Machine L...
Location
Location
United States , Mountain View
Salary
Salary:
243100.00 - 407200.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ph.D. or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, or a related field
  • 8+ years of experience in machine learning, with a focus on large-scale model development and optimization
  • Deep expertise in LLM and transformer architectures (e.g., GPT, BERT, T5)
  • Strong proficiency in Python and ML frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training techniques and large-scale data processing pipelines
  • Proven track record of deploying machine learning models in production environments
  • Familiarity with model optimization techniques, including quantization, pruning, and knowledge distillation
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
  • Excellent communication skills and ability to translate technical concepts for diverse audiences
Job Responsibility
Job Responsibility
  • Lead the fine-tuning and post-training optimization of large language models (LLMs) for diverse applications
  • Develop and implement techniques for model compression, quantization, pruning, and knowledge distillation to optimize performance and reduce computational costs
  • Conduct research on advanced techniques in transfer learning, reinforcement learning, and prompt engineering for LLMs
  • Design and execute rigorous benchmarking and evaluation frameworks to assess model performance across multiple dimensions
  • Collaborate with infrastructure teams to optimize LLM deployment pipelines, ensuring scalability and efficiency in production environments
  • Stay at the forefront of advancements in LLM technologies, sharing insights, driving innovation within the team, and leading agile development
  • Mentoring other team members, facilitating within/across team workshops, fostering a culture of technical excellence and continuous learning
What we offer
What we offer
  • health coverage
  • paid volunteer days
  • wellness resources
  • Fulltime
Read More
Arrow Right

Principal Detection Engineer

We are seeking a highly skilled Principal Cyber Detection Engineer to join our t...
Location
Location
United States , Spring
Salary
Salary:
117500.00 - 270000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or master’s degree in computer science, cybersecurity, data science, or related engineering field
  • Certifications such as CISSP, CISM, CEH or OSCP preferred
  • Proven experience (8+ years) in cybersecurity, with a focus on threat detection and response
  • Deep understanding of cybersecurity frameworks and concepts, including attack vectors, threat landscapes, and defense mechanisms
  • Familiarity with SIEM/SOAR/ and EDR/XDR platforms
  • Strong expertise in Machine Learning (ML) and Artificial Intelligence (AI), including model design, training, and deployment
  • Knowledge of adversarial machine learning and techniques for defending against model exploitation
  • Experience with anomaly detection, behavioral modeling, and predictive analytics in cybersecurity contexts
  • Experience with deep learning architectures or natural language processing (NLP) applied to cybersecurity
  • Experience integrating machine learning models into security operations workflows in enterprise environments
Job Responsibility
Job Responsibility
  • Design, develop, and implement advanced threat detection systems leveraging ML/AI techniques to identify malicious activity, anomalies, and emerging risks
  • Build and optimize machine learning models for real-time detection, including supervised, unsupervised, and reinforcement learning approaches
  • Data engineering and pre-processing for cybersecurity applications
  • Analyze large-scale datasets to extract meaningful insights, detect patterns, and enhance the accuracy of detection systems
  • Develop and refine detection algorithms for intrusion detection, anomaly detection, endpoint security, behavioral analysis, and other cybersecurity applications
  • Automate detection workflows and processes to improve efficiency and scalability of security monitoring systems
  • Work closely with threat intelligence, red team, security operations, and data scientists to integrate detection models into security platforms and tools
  • Test, validate, and monitor the performance of detection models, ensuring reliability and minimizing false positives/negatives
  • Stay up to date with emerging threats, ML/AI technologies, and advancements in cybersecurity to continuously improve detection systems
  • Maintain clear documentation of models, processes, and methodologies for knowledge sharing across teams
What we offer
What we offer
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Programs catered to helping you reach career goals
  • Flexibility to manage work and personal needs
  • Fulltime
Read More
Arrow Right

Principal Engineer - Marketplace

Principal Engineer role in the Marketplace Engineering team to lead breakthrough...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
302000.00 - 336000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • PhD in Computer Science, Machine Learning, Operations Research, or related quantitative field OR Master’s degree with 12+ years of industry experience
  • 10+ years of experience building and deploying ML models in large-scale production environments
  • Expert-level proficiency in modern ML frameworks (TensorFlow, PyTorch, JAX) and distributed computing platforms (Spark, Ray)
  • Deep expertise across multiple areas including: Deep Learning, Causal Inference, Reinforcement Learning, Multi-objective Optimization, Algorithmic Game Theory, and Large-scale Ads Ranking/Auction Systems
  • Proven track record of leading complex ML projects from research through production with significant measurable business impact
  • Strong programming skills in Python, Java, or Go with experience building production ML systems
  • Experience with feature engineering, model serving, and ML infrastructure at scale (handling millions of predictions per second)
  • Technical leadership experience including mentoring senior engineers and driving cross-team technical initiatives
  • Advanced Deep Learning and Neural Network architectures
  • Scalable ML architecture and distributed model training
Job Responsibility
Job Responsibility
  • Lead the design and implementation of advanced ML systems for dynamic pricing algorithms serving millions of drivers across 70+ countries around the world
  • Architect real-time ML infrastructure handling 1M+ pricing decisions per second with sub-50ms latency requirements
  • Drive breakthrough research in causal ML, reinforcement learning, algorithmic game theory, and multi-objective optimization for marketplace optimization with strategic agents
  • Own end-to-end ML model lifecycle from research through production deployment and continuous optimization
  • Develop and enforce best practices in system design, ensuring data integrity, security, and optimal performance
  • Serve as a representative for the Marketplace organization to the broader internal and external technical community
  • Contribute to the eng brand for Marketplace and serve as a talent magnet to help attract and retain talent for the team
  • Stay abreast of industry trends and emerging technologies in software engineering, focused particularly on ML/AI, to enhance our systems and processes continually
  • Build scalable ML architecture and feature management systems supporting Driver Pricing and broader Marketplace teams
  • Design experimentation frameworks enabling rapid testing of pricing algorithms using A/B, Switchback, Synthetic Control, and other experimental methodologies
What we offer
What we offer
  • Eligible to participate in Uber's bonus program
  • May be offered an equity award & other types of comp
  • Eligible to participate in a 401(k) plan
  • Eligible for various benefits (details at provided link)
  • Fulltime
Read More
Arrow Right

Principal Engineer, ASIC Development Engineering (Frontend Architect - AI Storage Solutions)

In this Frontend Architect position, you will develop AI Storage Solutions based...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
sandisk.com Logo
Sandisk
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors or Masters or PhD in Computer/Electrical Engineering with 8+ years of hands-on Architecture experience authoring specifications
  • Strong technical background architecting SoC and I/O subsystems involving PCIe and PCIe-DMA engines, or UCIe or CXL or UAL
  • Strong IO subsystem microarchitecture, technical, and working knowledge of the PCIe/UCIe protocol specifications
  • Knowledge of I/O Subsystem and DMA interactions with internal embedded processor-subsystems (x86, RISC-V or ARM) and external host CPU
  • Good understanding of computer/graphics architecture, ML, LLM
  • Architecting an GPU/TPU/xPU Accelerator systems with optimized high bandwidth memory hierarchy and frontend architecture for multi-trillion parameter LLM training/inference including Dense, Mixture of Experts (MoE) with multiple modalities (text, vision, speech)
  • Deep experience optimizing large-scale ML systems, GPU architectures
  • Proficiency in principles and methods of microarchitecture, software, and hardware relevant to performance engineering
  • Multi-disciplinary experience, including familiarity with Firmware and ASIC design
  • Expertise in CUDA programming, GPU memory hierarchies, and hardware-specific optimizations
Job Responsibility
Job Responsibility
  • Responsible for driving the SoC architecture, with a particular focus on I/O subsystems connected over UCIe, PCIe, UAL or CXL
  • Define I/O subsystem and PCIe DMA architectures, including their interactions with internal embedded processor-subsystems, Network on Chip, Memory controllers, and FPGA fabric
  • Create flexible and modular I/O subsystem architectures that can be deployed in either chiplet, monolithic or 3D form factors
  • Work with customers, and cross-functional teams to scope SoC requirements, analyze PPA tradeoffs, and then define architectural requirements that meet the PPA and schedule targets
  • Define I/O subsystem and DMA hardware, software, and firmware interactions with embedded processing subsystems and SoC CPUs on the device side and Host CPUs
  • Author architecture specifications in clear and concise language. Guide and assist pre-silicon design/verification and post-silicon validation during the execution phase
  • Responsible for improving the AI/ML ASIC Architecture performance through hardware & software co-optimization, post-silicon performance analysis, and influencing the strategic product roadmap
  • LLM Workload analysis and characterization of ASIC and competitive datacenter and AI solutions to identify opportunities for performance improvement in our products
  • Experience architecting one or some components of AI/ML accelerator ASICs such as HBM, PCIe/UCIe/CXL, NoC, DMA, Firmware Interactions, NAND, xPU, fabrics, etc
  • Drive the AI Storage Solutions frontend system architecture with GPU/TPU/NPU/xPU to match or exceed the nextgen HBM bandwidth
  • Fulltime
Read More
Arrow Right

Principal Engineer, Model Dev Platform

As the Principal Engineer for the Model Development Platform at Wayve, you will ...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
wayve.ai Logo
Wayve
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Technical Leadership at Scale – 10+ years of experience designing and building large-scale distributed systems, ML/AI infrastructure, full stack web application, or developer platforms, including at least 3 years as a staff or principal-level engineer
  • Architectural Depth & Breadth – Proven ability to design systems spanning web platforms, ML pipelines, and large-scale compute orchestration (e.g., Spark, Ray, Kubernetes, Airflow, MLflow)
  • Reliability & Performance Mindset – Experience driving platform reliability improvements, defining SLAs/SLOs, and building self-healing and observable systems that operate at “four nines” availability or better
  • Hands-On Systems Design – Deep understanding of distributed computing, workflow orchestration, data modeling, and API design, with the ability to write and review production-quality code
  • Collaborative Influence – Excellent communication and cross-functional collaboration skills
  • ability to guide engineers, managers, and researchers toward unified technical direction
  • Mentorship & Culture – Demonstrated success in mentoring engineers across levels and cultivating a culture of engineering excellence
  • Education – Bachelor’s degree in Computer Science, Software Engineering, or related field (advanced degree preferred, or equivalent experience)
Job Responsibility
Job Responsibility
  • Design and evolve the overarching architecture of the model development platform, ensuring system-wide reliability, observability, and scalability
  • Work across disciplines—from front-end web UIs to large-scale distributed training, from Spark-based data pipelines to experiment scheduling algorithms using linear optimization—to unify the platform’s architecture and ensure smooth interoperability between systems
  • Dive deep into the thorniest technical challenges faced by individual subteams, bringing your expertise in distributed systems, large-scale compute, and system design to bear
  • Develop and refine systems that optimize how models are tested—whether in simulation or on-road—balancing constraints like hardware availability, safety requirements, and research priorities
  • Architect data processing pipelines capable of ingesting, transforming, and enriching petabytes of sensor data from the global fleet
  • Serve as a mentor and coach for engineers across the organization—developing technical talent, improving design practices, and fostering a culture of learning and technical excellence
  • Partner with Product Management, Research, and Operations to align technical architecture with user needs and product vision
Read More
Arrow Right

Principal Machine Learning Engineer

This is a high-leverage leadership role that spans architecture, execution, and ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
ema.co Logo
Ema
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s (or PhD) degree in Computer Science, Machine Learning, Statistics, or a related field
  • A strong track record (usually 10-12+ years) of applied experience with ML techniques, especially in large-scale settings
  • Experience building production ML systems that operate at scale (latency / throughput / cost constraints)
  • Experience in Knowledge retrieval and Search space
  • Exposure in building Agentic Systems and Frameworks
  • Proficiency in relevant programming languages (e.g. Python, C++, Java) and ML frameworks (TensorFlow, PyTorch, etc.)
  • Strong understanding of the full ML lifecycle: data pipelines, feature engineering, model training, serving, monitoring, maintenance
  • Experience designing systems for monitoring, diagnostics, logging, model versioning, etc.
  • Deep knowledge of computational trade-offs: distributed training, inference, optimizations (e.g. quantization, pruning, batching)
  • Excellent communication skills
Job Responsibility
Job Responsibility
  • Lead the technical direction of GenAI and agentic ML systems that power enterprise-grade AI agents — spanning reasoning, retrieval, tool use, and integrations across various SaaS products
  • Architect, design, and implement scalable production pipelines for model training, fine-tuning, retrieval (RAG), agent orchestration, and evaluation — ensuring robustness, latency efficiency, and continuous learning
  • Define and own the multi-year ML roadmap for GenAI infrastructure — including agent frameworks, RAG systems, world-class evaluation loops, and integration with MCP, browser, and vision pipelines
  • Identify and integrate cutting-edge ML methods / research (deep learning, large models, recommender systems, LLMs, etc.) into Ema’s products or infrastructure
  • Research, prototype, and integrate cutting-edge ML and LLM advancements (reasoning, memory architectures, multi-modal perception, long-context models, autonomous agents) into the platform
  • Optimize trade-offs between accuracy, latency, cost, interpretability, and real-world reliability across the agent lifecycle — from prompt design to orchestration and execution
  • Champion engineering excellence — drive observability, reproducibility, versioning, testing, and bias-aware development across ML and agentic systems
  • Mentor and elevate senior engineers and researchers, fostering a culture of scientific rigor, experimentation, and system-level thinking
  • Collaborate cross-functionally with product, infra, and research teams to align ML innovation with enterprise needs — enabling secure integrations, privacy-aware deployments, and scalable use cases
  • Influence data strategy — guide how retrieval indices, embeddings, structured/unstructured corpora, and feedback loops evolve to improve grounding, factuality, and reasoning depth
  • Fulltime
Read More
Arrow Right

Principal Software Engineering Manager - AI Frameworks

As a Principal Software Engineering Manager - AI Frameworks on the team, you wil...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 304200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Master’s Degree in Computer Science or related technical field AND 10+ years of software engineering experience, including 6+ years in engineering management, OR Bachelor’s Degree in Computer Science or related technical field AND 12+ years of software engineering experience, including 6+ years in engineering management, or equivalent experience
  • Strong technical foundation in software engineering principles, computer architecture, GPU architecture, and hardware acceleration for neural networks, with the ability to guide teams working in these areas
  • Experience leading teams responsible for end-to-end performance analysis and optimization of LLMs, AI systems, or HPC workloads, including use of GPU profiling and performance analysis tools
  • Demonstrated ability to lead cross-team initiatives, align stakeholders, and translate research or platform capabilities into scalable, production-ready solutions
  • Proven people leadership skills, including hiring, coaching, performance management, and career development, with a track record of building high-performing, inclusive teams
  • Exposure to AI / ML infrastructure, including DNN or LLM training and/or inference systems, and experience with at least one modern deep learning framework (e.g., PyTorch, TensorFlow, ONNX Runtime)
  • Familiarity with GPU software stacks and acceleration technologies such as CUDA, ROCm, Triton, or equivalent, sufficient to guide technical direction and evaluate tradeoffs
Job Responsibility
Job Responsibility
  • Lead and develop a team of engineers working across multiple layers of the AI software stack to enable large-scale training and inference
  • Set technical vision and execution strategy for model performance benchmarking, optimization, and deployment across GPUs and Microsoft hardware
  • Drive performance outcomes by prioritizing and overseeing efforts to benchmark, profile, debug, and optimize training and inference workloads
  • Own performance health by establishing mechanisms to monitor regressions, measure impact, and continuously improve time-to-deploy and hardware efficiency
  • Partner cross-functionally with research, product, infrastructure, and hardware teams to deliver scalable, production-ready AI performance improvements
  • Balance short-term delivery and long-term investments, ensuring the team’s work aligns with organizational goals, platform roadmaps, and Azure capex objectives
  • Build a strong engineering culture through coaching, feedback, hiring, and career development, enabling the team to operate with increasing autonomy and impact
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right