CrawlJobs Logo

Principal ML Engineer - Large Scale Training Performance Optimization

amd.com Logo

AMD

Location Icon

Location:
United States , San Jose

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

226400.00 - 339600.00 USD / Year

Job Description:

We are looking for a Principal Machine Learning Engineer to join our Models and Applications team. If you are excited by the challenge of distributed training of large models on a large number of GPUs, and if you are passionate about improving training efficiency while innovating and generating new ideas, then this role is for you. You will be part of a world class team focused on addressing the challenge of training generative AI at scale.

Job Responsibility:

  • Train large models to convergence on AMD GPUs at scale
  • Improve the end-to-end training pipeline performance
  • Optimize the distributed training pipeline and algorithm to scale out
  • Contribute your changes to open source
  • Stay up-to-date with the latest training algorithms
  • Influence the direction of AMD AI platform
  • Collaborate across teams with various groups and stakeholders

Requirements:

  • Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training and distributed training frameworks, such as Megatron-LM, MaxText, TorchTitan
  • Experience with LLMs or computer vision, especially large models
  • Experience with GPU kernel optimization
  • Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
  • Experience with ML infra at kernel, framework, or system level
  • Strong communication and problem-solving skills
  • A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field

Nice to have:

  • Experience with LLMs or computer vision, especially large models, is a plus
  • Experience with GPU kernel optimization is a plus

Additional Information:

Job Posted:
March 25, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal ML Engineer - Large Scale Training Performance Optimization

Senior Principal Machine Learning Engineer - LLM Post-Training and Optimization

Atlassian is seeking a highly skilled and experienced Senior Principle Machine L...
Location
Location
United States , Mountain View
Salary
Salary:
243100.00 - 407200.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ph.D. or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, or a related field
  • 8+ years of experience in machine learning, with a focus on large-scale model development and optimization
  • Deep expertise in LLM and transformer architectures (e.g., GPT, BERT, T5)
  • Strong proficiency in Python and ML frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training techniques and large-scale data processing pipelines
  • Proven track record of deploying machine learning models in production environments
  • Familiarity with model optimization techniques, including quantization, pruning, and knowledge distillation
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
  • Excellent communication skills and ability to translate technical concepts for diverse audiences
Job Responsibility
Job Responsibility
  • Lead the fine-tuning and post-training optimization of large language models (LLMs) for diverse applications
  • Develop and implement techniques for model compression, quantization, pruning, and knowledge distillation to optimize performance and reduce computational costs
  • Conduct research on advanced techniques in transfer learning, reinforcement learning, and prompt engineering for LLMs
  • Design and execute rigorous benchmarking and evaluation frameworks to assess model performance across multiple dimensions
  • Collaborate with infrastructure teams to optimize LLM deployment pipelines, ensuring scalability and efficiency in production environments
  • Stay at the forefront of advancements in LLM technologies, sharing insights, driving innovation within the team, and leading agile development
  • Mentoring other team members, facilitating within/across team workshops, fostering a culture of technical excellence and continuous learning
What we offer
What we offer
  • health coverage
  • paid volunteer days
  • wellness resources
  • Fulltime
Read More
Arrow Right

Principal Detection Engineer

We are seeking a highly skilled Principal Cyber Detection Engineer to join our t...
Location
Location
United States , Spring
Salary
Salary:
117500.00 - 270000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or master’s degree in computer science, cybersecurity, data science, or related engineering field
  • Certifications such as CISSP, CISM, CEH or OSCP preferred
  • Proven experience (8+ years) in cybersecurity, with a focus on threat detection and response
  • Deep understanding of cybersecurity frameworks and concepts, including attack vectors, threat landscapes, and defense mechanisms
  • Familiarity with SIEM/SOAR/ and EDR/XDR platforms
  • Strong expertise in Machine Learning (ML) and Artificial Intelligence (AI), including model design, training, and deployment
  • Knowledge of adversarial machine learning and techniques for defending against model exploitation
  • Experience with anomaly detection, behavioral modeling, and predictive analytics in cybersecurity contexts
  • Experience with deep learning architectures or natural language processing (NLP) applied to cybersecurity
  • Experience integrating machine learning models into security operations workflows in enterprise environments
Job Responsibility
Job Responsibility
  • Design, develop, and implement advanced threat detection systems leveraging ML/AI techniques to identify malicious activity, anomalies, and emerging risks
  • Build and optimize machine learning models for real-time detection, including supervised, unsupervised, and reinforcement learning approaches
  • Data engineering and pre-processing for cybersecurity applications
  • Analyze large-scale datasets to extract meaningful insights, detect patterns, and enhance the accuracy of detection systems
  • Develop and refine detection algorithms for intrusion detection, anomaly detection, endpoint security, behavioral analysis, and other cybersecurity applications
  • Automate detection workflows and processes to improve efficiency and scalability of security monitoring systems
  • Work closely with threat intelligence, red team, security operations, and data scientists to integrate detection models into security platforms and tools
  • Test, validate, and monitor the performance of detection models, ensuring reliability and minimizing false positives/negatives
  • Stay up to date with emerging threats, ML/AI technologies, and advancements in cybersecurity to continuously improve detection systems
  • Maintain clear documentation of models, processes, and methodologies for knowledge sharing across teams
What we offer
What we offer
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Programs catered to helping you reach career goals
  • Flexibility to manage work and personal needs
  • Fulltime
Read More
Arrow Right

Principal Engineer - Marketplace

Principal Engineer role in the Marketplace Engineering team to lead breakthrough...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
302000.00 - 336000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • PhD in Computer Science, Machine Learning, Operations Research, or related quantitative field OR Master’s degree with 12+ years of industry experience
  • 10+ years of experience building and deploying ML models in large-scale production environments
  • Expert-level proficiency in modern ML frameworks (TensorFlow, PyTorch, JAX) and distributed computing platforms (Spark, Ray)
  • Deep expertise across multiple areas including: Deep Learning, Causal Inference, Reinforcement Learning, Multi-objective Optimization, Algorithmic Game Theory, and Large-scale Ads Ranking/Auction Systems
  • Proven track record of leading complex ML projects from research through production with significant measurable business impact
  • Strong programming skills in Python, Java, or Go with experience building production ML systems
  • Experience with feature engineering, model serving, and ML infrastructure at scale (handling millions of predictions per second)
  • Technical leadership experience including mentoring senior engineers and driving cross-team technical initiatives
  • Advanced Deep Learning and Neural Network architectures
  • Scalable ML architecture and distributed model training
Job Responsibility
Job Responsibility
  • Lead the design and implementation of advanced ML systems for dynamic pricing algorithms serving millions of drivers across 70+ countries around the world
  • Architect real-time ML infrastructure handling 1M+ pricing decisions per second with sub-50ms latency requirements
  • Drive breakthrough research in causal ML, reinforcement learning, algorithmic game theory, and multi-objective optimization for marketplace optimization with strategic agents
  • Own end-to-end ML model lifecycle from research through production deployment and continuous optimization
  • Develop and enforce best practices in system design, ensuring data integrity, security, and optimal performance
  • Serve as a representative for the Marketplace organization to the broader internal and external technical community
  • Contribute to the eng brand for Marketplace and serve as a talent magnet to help attract and retain talent for the team
  • Stay abreast of industry trends and emerging technologies in software engineering, focused particularly on ML/AI, to enhance our systems and processes continually
  • Build scalable ML architecture and feature management systems supporting Driver Pricing and broader Marketplace teams
  • Design experimentation frameworks enabling rapid testing of pricing algorithms using A/B, Switchback, Synthetic Control, and other experimental methodologies
What we offer
What we offer
  • Eligible to participate in Uber's bonus program
  • May be offered an equity award & other types of comp
  • Eligible to participate in a 401(k) plan
  • Eligible for various benefits (details at provided link)
  • Fulltime
Read More
Arrow Right

Principal Engineer, Model Dev Platform

As the Principal Engineer for the Model Development Platform at Wayve, you will ...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
wayve.ai Logo
Wayve
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Technical Leadership at Scale – 10+ years of experience designing and building large-scale distributed systems, ML/AI infrastructure, full stack web application, or developer platforms, including at least 3 years as a staff or principal-level engineer
  • Architectural Depth & Breadth – Proven ability to design systems spanning web platforms, ML pipelines, and large-scale compute orchestration (e.g., Spark, Ray, Kubernetes, Airflow, MLflow)
  • Reliability & Performance Mindset – Experience driving platform reliability improvements, defining SLAs/SLOs, and building self-healing and observable systems that operate at “four nines” availability or better
  • Hands-On Systems Design – Deep understanding of distributed computing, workflow orchestration, data modeling, and API design, with the ability to write and review production-quality code
  • Collaborative Influence – Excellent communication and cross-functional collaboration skills
  • ability to guide engineers, managers, and researchers toward unified technical direction
  • Mentorship & Culture – Demonstrated success in mentoring engineers across levels and cultivating a culture of engineering excellence
  • Education – Bachelor’s degree in Computer Science, Software Engineering, or related field (advanced degree preferred, or equivalent experience)
Job Responsibility
Job Responsibility
  • Design and evolve the overarching architecture of the model development platform, ensuring system-wide reliability, observability, and scalability
  • Work across disciplines—from front-end web UIs to large-scale distributed training, from Spark-based data pipelines to experiment scheduling algorithms using linear optimization—to unify the platform’s architecture and ensure smooth interoperability between systems
  • Dive deep into the thorniest technical challenges faced by individual subteams, bringing your expertise in distributed systems, large-scale compute, and system design to bear
  • Develop and refine systems that optimize how models are tested—whether in simulation or on-road—balancing constraints like hardware availability, safety requirements, and research priorities
  • Architect data processing pipelines capable of ingesting, transforming, and enriching petabytes of sensor data from the global fleet
  • Serve as a mentor and coach for engineers across the organization—developing technical talent, improving design practices, and fostering a culture of learning and technical excellence
  • Partner with Product Management, Research, and Operations to align technical architecture with user needs and product vision
Read More
Arrow Right

Principal Machine Learning Engineer

This is a high-leverage leadership role that spans architecture, execution, and ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
ema.co Logo
Ema
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s (or PhD) degree in Computer Science, Machine Learning, Statistics, or a related field
  • A strong track record (usually 10-12+ years) of applied experience with ML techniques, especially in large-scale settings
  • Experience building production ML systems that operate at scale (latency / throughput / cost constraints)
  • Experience in Knowledge retrieval and Search space
  • Exposure in building Agentic Systems and Frameworks
  • Proficiency in relevant programming languages (e.g. Python, C++, Java) and ML frameworks (TensorFlow, PyTorch, etc.)
  • Strong understanding of the full ML lifecycle: data pipelines, feature engineering, model training, serving, monitoring, maintenance
  • Experience designing systems for monitoring, diagnostics, logging, model versioning, etc.
  • Deep knowledge of computational trade-offs: distributed training, inference, optimizations (e.g. quantization, pruning, batching)
  • Excellent communication skills
Job Responsibility
Job Responsibility
  • Lead the technical direction of GenAI and agentic ML systems that power enterprise-grade AI agents — spanning reasoning, retrieval, tool use, and integrations across various SaaS products
  • Architect, design, and implement scalable production pipelines for model training, fine-tuning, retrieval (RAG), agent orchestration, and evaluation — ensuring robustness, latency efficiency, and continuous learning
  • Define and own the multi-year ML roadmap for GenAI infrastructure — including agent frameworks, RAG systems, world-class evaluation loops, and integration with MCP, browser, and vision pipelines
  • Identify and integrate cutting-edge ML methods / research (deep learning, large models, recommender systems, LLMs, etc.) into Ema’s products or infrastructure
  • Research, prototype, and integrate cutting-edge ML and LLM advancements (reasoning, memory architectures, multi-modal perception, long-context models, autonomous agents) into the platform
  • Optimize trade-offs between accuracy, latency, cost, interpretability, and real-world reliability across the agent lifecycle — from prompt design to orchestration and execution
  • Champion engineering excellence — drive observability, reproducibility, versioning, testing, and bias-aware development across ML and agentic systems
  • Mentor and elevate senior engineers and researchers, fostering a culture of scientific rigor, experimentation, and system-level thinking
  • Collaborate cross-functionally with product, infra, and research teams to align ML innovation with enterprise needs — enabling secure integrations, privacy-aware deployments, and scalable use cases
  • Influence data strategy — guide how retrieval indices, embeddings, structured/unstructured corpora, and feedback loops evolve to improve grounding, factuality, and reasoning depth
  • Fulltime
Read More
Arrow Right

Principal Applied Researcher AI/NLP

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
United States
Salary
Salary:
195800.00 - 217500.00 USD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • PhD or comparable level of experience in Computer Science, Math, Physics, Engineering or a related field
  • 4-10+ year industry experience building solutions in commercial SaaS, including at least 4 years working in applications of NLP, Search or AI/ML technologies for healthcare
  • Strong interest in applying AI/ML/NLP to healthcare related problems and data
  • Expert-level practical, hands-on experience developing and applying a wide range of techniques in Natural Language Processing, including fine tuning of LLMs and other Transformer models, plus one or more additional AI/ML or Search related areas of expertise to solve real-world problems at scale
  • Demonstrated ability to lead and perform research and experimentation to select appropriate approaches, algorithms, evaluation methods, and frameworks, as well as tasks such as feature selection, language modeling, evaluation and fine tuning or training models, applying standard approaches or developing new tools or workflows as needed to meet project requirements
  • Significant experience building and deploying AI/machine learning and NLP models for large-scale SaaS products, including familiarity with industry standard software development concepts such as scaling issues, version control, CI/CD pipelines, and security
  • Solid understanding and experience with transformer models and multiple kinds of NLP and ML models and approaches including logistic regression, random forest, ensemble methods, SVM, KNN, reinforcement learning, and other ML techniques
  • Proficiency in Python and Java required. Proficiency in JavaScript or TypeScript and modern UI frameworks for building prototype or tool front ends desired
  • Proficiency doing data engineering for ML and NLP applications, including exposure to database systems and proficiency with SQL
  • Proficiency building models from big data using modern packages, models and data analysis stacks such as NumPy, SciPy, Pandas, Scikit-learn, PyTorch, Keras, LightGBM, fastText, NLTK, and spaCy. Proficiency fine tuning Hugging Face Transformers required
Job Responsibility
Job Responsibility
  • You will be applying NLP including GenAI and other AI/ML techniques to develop model systems and solutions, collaborating across functions to scale and integrate advanced solutions into successful end user experiences in large-scale cloud based SaaS production environments for healthcare
  • You will be working with product leaders, clinical informaticists, data scientists, UI/UX researchers and designers, other AI and machine learning and domain experts, engineering teams and others, including work with customers and users who are healthcare professionals
  • Design, build and evaluate solutions that may involve structured or unstructured data including speech or natural language for healthcare use cases, delivering capabilities such as summarization, predictive models, recommenders, semantic search, extraction, classification or other NLP, AI or machine learning based techniques
  • You will be performing research and experimentation to select appropriate approaches, algorithms, evaluation methods and frameworks and doing the R&D to deliver model systems
  • You will perform, oversee and assist in data collection, data cleaning, data analysis, algorithm selection or design, prompt tuning, parameter fine tuning, training, development and evaluation of systems that deliver responsible AI solutions at scale, using existing or developing new tools or workflows as needed
  • As a principal applied researcher, you will bring deep technical expertise and also provide mentorship on advanced AI, NLP, data science, statistical and machine learning methods and technologies, helping the organization develop new capabilities for innovative solutions
  • You will have substantial independence and responsibility from day one
What we offer
What we offer
  • Benefits starting from Day 1
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more
  • Fulltime
Read More
Arrow Right

Principal Research Engineer - Agent 365

Copilot usage is growing rapidly across Microsoft 365 and custom agent experienc...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Architect and deliver AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines
  • Set technical direction for large programs
  • drive alignment across Research, Engineering, and Product
  • Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem
  • Establish standards for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards
  • Drive original research and thought leadership (whitepapers, internal notes, patents)
  • convert insights into shipped capabilities
  • Research Translation: Continuously review emerging work
  • identify high-potential methods and adapt them to Microsoft problem spaces
  • Production Integration: Turn research prototypes into production-quality code optimized for scale, latency, and maintainability
  • Fulltime
Read More
Arrow Right

Principal Research Engineer

As a Principal Research Engineer at Microsoft, you will set the technical vision...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Define and execute technical strategy for foundational models, multi-agent systems, and next-generation Copilot experiences, especially within Business & Industry Copilot.
  • Lead cross-team efforts to deliver scalable, reliable, and responsible AI systems.
  • Advance the state of the art and translate breakthroughs into measurable customer and business impact.
  • Architect and deliver complex AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines.
  • Set technical direction for large programs
  • drive alignment across Research, Engineering, and Product.
  • Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem.
  • Establish best practices for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards.
  • Drive original research and thought leadership (whitepapers, internal notes, patents)
  • convert insights into shipped capabilities.
  • Fulltime
Read More
Arrow Right