CrawlJobs Logo

Gpu Training Optimization Engineer

China, Shanghai 800000.00 - 1000000.00 CNY / Year · Job Posted May 05, 2026
Apply Position
Job Link Share

Job Description

The company has almost 100 million customers based in Japan and 1 billion globally as well, providing more than 70 services in a variety such as e-commerce, payment services, financial services, telecommunication, media, sports, etc.

Job Responsibility

  • Optimize LLM training frameworks (e.g., PyTorch, DeepSpeed, Megatron-LM, FSDP) to maximize GPU utilization and reduce training time
  • Profile and optimize distributed training bottlenecks (e.g., NCCL issues, CUDA kernel efficiency, communication overhead)
  • Implement and tune inference optimizations (e.g., quantization, dynamic batching, KV caching) for low-latency, high-throughput LLM serving (vLLM, TensorRT-LLM, Triton, SGLang)
  • Collaborate with infrastructure teams to improve GPU cluster scheduling, resource allocation, and fault tolerance for large-scale training jobs
  • Develop benchmarking tools to measure and improve training throughput, memory efficiency, and inference latency
  • Research and apply cutting-edge techniques (e.g., mixture-of-experts, speculative decoding) to optimize LLM performance.

Requirements

  • 3+ years of hands-on experience in GPU-accelerated ML training & inference optimization, preferably for LLMs or large-scale deep learning models
  • Deep expertise in PyTorch, DeepSpeed, FSDP, or Megatron-LM, with experience in distributed training optimizations
  • Strong knowledge of LLM inference optimizations (e.g., quantization, pruning, KV caching, continuous batching)
  • Bachelor’s or higher degree in Computer Science, Engineering, or related field.

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Gpu Training Optimization Engineer

8 matching positions

Ai/ml Performance Engineer - Gpu Optimization

As an AI Performance Engineers you will focus on pushing machine learning worklo...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience with profiling, debugging, benchmarking, and optimization tools
  • Familiarity with ML frameworks (e.g., PyTorch, JAX, TF) and inference serving frameworks (e.g., vLLM, SGLang)
  • Strong C++ and/or Python skills, along the basics: unix, git, terminal, debugging, testing, thinking
  • Experience with Docker, container orchestration (Kubernetes), and job schedulers (Slurm)
  • Ability to work independently and collaboratively in a multi-cultural team
  • Excellent communication skills in a fast-moving environment
  • BSc, MSc, PhD or equivalent experience in Computer Science, Electrical Engineering or a related field
Job Responsibility
Job Responsibility
  • Explore and benchmark ML models and workloads (including diffusion models, LLMs, and multimodal systems) to identify bottlenecks across compute, memory, and networking layers
  • Optimize performance for inference and training on AMD GPUs, including parallelization strategies, quantization techniques, serving orchestration, network communication and distributed execution
  • Perform deep profiling to uncover inefficiencies in ML frameworks, data pipelines, compiler tools, and key tensor operations such GEMMs, Convs and Attention, to name a few
  • Support AMD top-tier customers to improve model throughput, reduce latency, and optimize resource utilization across multi-GPU and cluster environments
  • Work closely with hardware, compiler, and software teams to drive improvements across the full ROCm stack
  • Communicate performance bottlenecks, solutions, and optimization strategies to stakeholders
  • Work with international teams located across Europe, US and Asia
Read More
Arrow Right

Senior GPU Software Performance Engineer — Post‑Training

Drive the performance of post‑training workloads on AMD Instinct™ GPUs. You’ll w...
Location
Location
United States , San Jose
Salary
Salary:
204000.00 - 306000.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven GPU performance engineering for deep learning (ROCm/HIP, Triton, or similar)
  • Hands-on with SFT. LoRA and RL-based training at scale
  • Strong PyTorch experience (torch.distributed, FSDP/ZeRO or equivalent)
  • Proficient in Python and C++
  • comfortable reading/writing kernels when needed
  • Experience with distributed systems and collective communication libraries
  • Track record of turning profiles into fixes, upstreaming changes, and documenting results
Job Responsibility
Job Responsibility
  • Lead performance for finetuning and RL training solutions on AMD GPUs
  • Improve throughput, memory efficiency, and stability across data, model, and optimizer steps
  • Optimize multi-GPU/multi-node training and communication patterns
  • Contribute efficient kernels/ops and targeted graph-level optimizations
  • Profile, diagnose, and resolve bottlenecks using standard tooling
  • prevent regressions in CI
  • Ship reproducible pipelines and documentation adopted by internal teams and external developers
  • Collaborate with framework, compiler, and model teams to land durable improvements
  • Fulltime
Read More
Arrow Right

Principal ML Engineer - Large Scale Training Performance Optimization

We are looking for a Principal Machine Learning Engineer to join our Models and ...
Location
Location
United States , San Jose; Bellevue
Salary
Salary:
226400.00 - 339600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training and distributed training frameworks, such as Megatron-LM, MaxText, TorchTitan
  • Experience with LLMs or computer vision, especially large models
  • Experience with GPU kernel optimization
  • Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
  • Experience with ML infra at kernel, framework, or system level
  • Strong communication and problem-solving skills
  • A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field
Job Responsibility
Job Responsibility
  • Train large models to convergence on AMD GPUs at scale
  • Improve the end-to-end training pipeline performance
  • Optimize the distributed training pipeline and algorithm to scale out
  • Contribute your changes to open source
  • Stay up-to-date with the latest training algorithms
  • Influence the direction of AMD AI platform
  • Collaborate across teams with various groups and stakeholders
  • Fulltime
Read More
Arrow Right

Senior AI Infrastructure Engineer - Training Platform

As a Software Engineer on the Machine Learning Infrastructure team, you will bui...
Location
Location
United States , San Francisco; Seattle; New York
Salary
Salary:
216000.00 - 270000.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
  • Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
  • Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
  • Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
  • Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
  • Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
  • Proven ability to solve complex problems and work independently in fast-moving environments
Job Responsibility
Job Responsibility
  • Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
  • Design and implement scheduling primitives to optimize the lifecycle of training jobs
  • Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
  • Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
  • Work closely with Finance and Procurement teams to drive our capacity planning process
  • Participate in our team's on call process to ensure the availability of our services
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • commuter stipend (may be eligible)
  • Fulltime
Read More
Arrow Right

Member of Technical Staff - Distributed Training Engineer

Our Training Infrastructure team is building the distributed systems that power ...
Location
Location
United States , San Francisco; Boston
Salary
Salary:
Not provided
liquid.ai Logo
Liquid AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP)
  • Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers)
  • Understanding of hardware accelerators and networking topologies
  • Experience optimizing data pipelines for ML workloads
Job Responsibility
Job Responsibility
  • Design and build core systems that make large training runs fast and reliable
  • Build scalable distributed training infrastructure for GPU clusters
  • Implement and tune parallelism/sharding strategies for evolving architectures
  • Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation)
  • Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
  • Develop checkpointing mechanisms balancing memory constraints with recovery needs
  • Create monitoring, profiling, and debugging tools for training stability and performance
What we offer
What we offer
  • Competitive base salary with equity in a unicorn-stage company
  • We pay 100% of medical, dental, and vision premiums for employees and dependents
  • 401(k) matching up to 4% of base pay
  • Unlimited PTO plus company-wide Refill Days throughout the year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, GPU Infrastructure (HPC)

The internal infrastructure team is responsible for building world-class infrast...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment
Job Responsibility
Job Responsibility
  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Research Scientist / Engineer – Training Infrastructure

Luma’s mission is to build multimodal AI to expand human imagination and capabil...
Location
Location
United States , Palo Alto
Salary
Salary:
187500.00 - 395000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience with distributed PyTorch training and parallelisms in foundation model training
  • Deep understanding of GPU clusters, networking, and storage systems
  • Familiarity with communication libraries (NCCL, MPI) and distributed system optimization
Job Responsibility
Job Responsibility
  • Design, implement, and optimize efficient distributed training systems for models with thousands of GPUs
  • Research and implement advanced parallelization techniques (FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel)
  • Build monitoring, visualization, and debugging tools for large-scale training runs
  • Optimize training stability, convergence, and resource utilization across massive clusters
  • Fulltime
Read More
Arrow Right
New

Staff Machine Learning Engineer - ML Training Infrastructure

The Role:   We are seeking an experienced, technically strong, impact-driven ex...
Location
Location
United States , Austin; Mountain View
Salary
Salary:
185000.00 - 335300.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience
  • 8+ years of professional software engineering experience
  • 5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models
  • Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems
  • Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
  • Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact
  • Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost
  • Willingness to travel to Sunnyvale, CA as needed
  • Comfortable operating in highly ambiguous and dynamic environments
Job Responsibility
Job Responsibility
  • Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale
  • Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments
  • Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack
  • Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery
  • Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams
  • Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem
  • Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right