Gpu Training Optimization Engineer Job at Randstad (Shanghai)

Member of Technical Staff, Performance Optimization

We're looking for a Software Engineer focused on Performance Optimization to hel...

Location

United States , San Mateo

Salary:

175000.00 - 220000.00 USD / Year

Fireworks AI

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
5+ years of experience working on performance optimization or high-performance computing systems
Proficiency in CUDA or ROCm and experience with GPU profiling tools (e.g., Nsight, nvprof, CUPTI)
Familiarity with PyTorch and performance-critical model execution
Experience with distributed system debugging and optimization in multi-GPU environments
Deep understanding of GPU architecture, parallel programming models, and compute kernels

Job Responsibility

Optimize system and GPU performance for high-throughput AI workloads across training and inference
Analyze and improve latency, throughput, memory usage, and compute efficiency
Profile system performance to detect and resolve GPU- and kernel-level bottlenecks
Implement low-level optimizations using CUDA, Triton, and other performance tooling
Drive improvements in execution speed and resource utilization for large-scale model workloads (LLMs, VLMs, and video models)
Collaborate with ML researchers to co-design and tune model architectures for hardware efficiency
Improve support for mixed precision, quantization, and model graph optimization
Build and maintain performance benchmarking and monitoring infrastructure
Scale inference and training systems across multi-GPU, multi-node environments
Evaluate and integrate optimizations for emerging hardware accelerators and specialized runtimes

What we offer

Meaningful equity in a fast-growing startup
Competitive salary
Comprehensive benefits package

Fulltime

Software Development Engineer

As a core member of the team, you will play a pivotal role in optimizing and dev...

Location

China , Shanghai

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Bachelor’s and/or Master’s Degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field
5+ years of professional experience in technical software development, with a focus on GPU optimization, performance engineering, and framework development
Skilled engineer with strong technical and analytical expertise in C++ development within Linux environments
Strong problem-solving skills, a proactive approach, and a keen understanding of software engineering best practices are essential
GPU Kernel Development & Optimization: Experienced in designing and optimizing GPU kernels for deep learning on AMD GPUs using HIP, CUDA, and assembly (ASM)
Strong knowledge of AMD architectures (GCN, RDNA) and low-level programming
Leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
Deep Learning Integration: Experienced in integrating optimized GPU performance into machine learning frameworks (e.g., TensorFlow, PyTorch) to accelerate model training and inference
Software Engineering: Skilled in Python and C++
Experience in debugging, performance tuning, and test design

Job Responsibility

Optimize Deep Learning Frameworks: Enhance and optimize frameworks like TensorFlow and PyTorch for AMD GPUs in open-source repositories
Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations
Develop & Optimize Models: Design and optimize deep learning models specifically for AMD GPU performance
Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs
Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream
Work in Distributed Computing Environments: Optimize deep learning performance on both scale-up (multi-GPU) and scale-out (multi-node) systems
Utilize Cutting-Edge Compiler Tech: Leverage advanced compiler technologies to improve deep learning performance
Optimize Deep Learning Pipeline: Enhance the full pipeline, including integrating graph compilers
Software Engineering Best Practices: Apply sound engineering principles to ensure robust, maintainable solutions

What we offer

AMD benefits at a glance

Sr. Software Development Engineer

As a core member of the team, you will play a pivotal role in optimizing and dev...

Location

China , Shanghai

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Skilled engineer with strong technical and analytical expertise in C++ development within Linux environments
Ability to define goals, manage development efforts, and deliver high-quality solutions
Strong problem-solving skills
Proactive approach
Keen understanding of software engineering best practices
Experience in GPU kernel development & optimization for AMD GPUs using HIP, CUDA, and assembly (ASM)
Strong knowledge of AMD architectures (GCN, RDNA) and low-level programming
Experience leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
Experience in integrating optimized GPU performance into machine learning frameworks (e.g., TensorFlow, PyTorch)
Skilled in Python and C++

Job Responsibility

Optimize Deep Learning Frameworks: Enhance and optimize frameworks like TensorFlow and PyTorch for AMD GPUs in open-source repositories
Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations
Develop & Optimize Models: Design and optimize deep learning models specifically for AMD GPU performance
Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs
Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream
Work in Distributed Computing Environments: Optimize deep learning performance on both scale-up (multi-GPU) and scale-out (multi-node) systems
Utilize Cutting-Edge Compiler Tech: Leverage advanced compiler technologies to improve deep learning performance
Optimize Deep Learning Pipeline: Enhance the full pipeline, including integrating graph compilers
Software Engineering Best Practices: Apply sound engineering principles to ensure robust, maintainable solutions

Machine Learning Engineer - Pre-Training

We are seeking skilled engineers to join our Training Tech team working on optim...

Location

United Kingdom , London

Salary:

Not provided

Wayve

Expiration Date

Until further notice

Requirements

Experience optimize large scale training jobs on GPU compute clusters
Experience in working in platform teams and working with research teams
Experience in reporting and tracking over time benchmarked performance in an open and accessible way
Ability to write high quality, well-structured and tested Python code
BS or MS in Machine Learning, Computer Science, Engineering, or a related technical discipline or equivalent experience

Job Responsibility

Profile training jobs to identify their bottlenecks, e.g. using NVIDIA Nsight Systems
Design and implement efficiency improvements to maximise MFU, e.g. tensor parallelism, model compilation, mixed precision
Design and implement observability tools, e.g. to track MFU
Collaborate closely with Research teams to integrate training efficiency improvements and create a culture of performance optimization

Fulltime

Software Development Engineer

As a core member of the team, you will play a pivotal role in optimizing and dev...

Location

China , Shanghai

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Bachelor’s and/or Master’s Degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field
5+ years of professional experience in technical software development, with a focus on GPU optimization, performance engineering, and framework development
Skilled engineer with strong technical and analytical expertise in C++ development within Linux environments
Strong problem-solving skills, a proactive approach, and a keen understanding of software engineering best practices
Experience in GPU Kernel Development & Optimization for deep learning on AMD GPUs using HIP, CUDA, and assembly (ASM)
Strong knowledge of AMD architectures (GCN, RDNA) and low-level programming
Experience leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
Experience in Deep Learning Integration into machine learning frameworks (e.g., TensorFlow, PyTorch) to accelerate model training and inference
Skilled in Python and C++, with experience in debugging, performance tuning, and test design
Solid experience in running large-scale workloads on heterogeneous compute clusters

Job Responsibility

Optimize Deep Learning Frameworks: Enhance and optimize frameworks like TensorFlow and PyTorch for AMD GPUs in open-source repositories
Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations
Develop & Optimize Models: Design and optimize deep learning models specifically for AMD GPU performance
Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs
Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream
Work in Distributed Computing Environments: Optimize deep learning performance on both scale-up (multi-GPU) and scale-out (multi-node) systems
Utilize Cutting-Edge Compiler Tech: Leverage advanced compiler technologies to improve deep learning performance
Optimize Deep Learning Pipeline: Enhance the full pipeline, including integrating graph compilers
Software Engineering Best Practices: Apply sound engineering principles to ensure robust, maintainable solutions

What we offer

Benefits offered are described: AMD benefits at a glance

Software Development Engineer

As a core member of the team, you will play a pivotal role in optimizing and dev...

Location

China , Shanghai

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Master’s or PhD in Computer Science, Computer Engineering, Electrical Engineering, or related fields
5+ years of professional experience in technical software development, with a focus on GPU optimization, performance engineering, and framework development
Skilled engineer with strong technical and analytical expertise in C++ development within Linux environments
Strong problem-solving skills, a proactive approach, and a keen understanding of software engineering best practices
GPU Kernel Development & Optimization: Deep experienced in designing and optimizing GPU kernels for deep learning on AMD GPUs using HIP, CUDA, and assembly (ASM)
Strong knowledge of AMD architectures (GCN, RDNA) and low-level programming
Deep Learning Integration: Strong experienced in integrating optimized GPU performance into machine learning and LLM frameworks (e.g., vLLM, SGlang,TensorFlow, PyTorch)
End to end solution optimization: Understand the latest market trend of LLM and multimodal, solid hands-on E2E performance tuning experience on distributed inference (e.g, P/D disaggregation and Large-EP) and RL
Software Engineering: Skilled in Python and C++, with experience in debugging, performance tuning, and test design
High-Performance Computing: Expert experienced in running large-scale workloads on heterogeneous computing clusters

Job Responsibility

End to end optimization: Build and optimize end to end distributed inference (e.g, P/D disaggregation and Large-EP) and RL solutions on mainstream frameworks like vLLM and SGlang
Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs
Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream
Work in Distributed Computing Environments: Optimize deep learning performance on both scale-up (multi-GPU) and scale-out (multi-node) systems
Utilize Cutting-Edge Compiler Tech: Leverage advanced compiler technologies to improve deep learning performance
Optimize Deep Learning Pipeline: Enhance the full pipeline, including integrating graph compilers
Software Engineering Best Practices: Apply sound engineering principles to ensure robust, maintainable solutions

What we offer

AMD benefits at a glance

Senior Manager, Performance AI/ML Network Deployment Engineering

The Senior Manager, DC GPU Advanced Forward Deployment and Systems Engineering i...

Location

United States , Santa Clara

Salary:

210400.00 - 315600.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Expertise in networking and performance optimization for large-scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements
Prefer candidates with solid, hands-on expertise in at least one or more of 3 domains, namely compute, network, storage
Experience in working with large customers such as Cloud Service Providers and global enterprise customers
Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc
Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends
Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista experience is preferred
Direct, co-development/deployment experience in working with strategic customers/partners in bringing solutions to market
Excellent communication level from engineer to mid-management to C-level of audience

Job Responsibility

Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models
Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability
Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads
Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations
Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins
Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement
Engage with AMD product groups to drive resolution of application and customer issues
Develop and present training materials to internal audiences, at customer venues, and at industry conferences

AI Systems Engineer – AI Model (Training & Inference)

The AMD AI Group is looking for a Senior Software Development Engineer to own th...

Location

Canada , Markham

Salary:

106400.00 - 159600.00 CAD / Year

AMD

Expiration Date

Until further notice

Requirements

Industry experience shipping production AI/ML infrastructure, with hands-on work spanning both training and inference.
Bachelor’s or Master’s degree or Ph.D in Computer/Software Engineering, Computer Science, or related technical discipline

Job Responsibility

Enable and optimize large-scale model training (LLMs, VLMs, MoE architectures) on AMD Instinct GPU clusters, ensuring correctness, reproducibility, and competitive throughput.
Build and maintain training infrastructure: job orchestration, distributed checkpointing, data loading pipelines, and storage optimization for multi-thousand GPU clusters on Kubernetes.
Debug and resolve training-specific issues including gradient norm explosions, non-deterministic behavior across GPU generations, and compute-communication overlap in distributed training (FSDP, DeepSpeed, Megatron-LM).
Optimize RCCL collective communication patterns for training workloads, including all-reduce, all-gather, and reduce-scatter across multi-node topologies.
Develop monitoring, alerting, and compliance infrastructure to ensure training cluster health, data security, and SLA adherence at scale.
Design and build end-to-end validation and testing infrastructure using proxy workloads, synthetic benchmarks, and configurable workload generators to systematically validate platform readiness across AMD Instinct GPU generations.
Write and optimize high-performance GPU kernels (GEMM, attention, quantized matmul, GPTQ/AWQ) in HIP, Triton, and MLIR targeting AMD Instinct architectures, with demonstrated ability to outperform open-source baselines.
Drive end-to-end inference enablement on new AMD GPU silicon - be among the first to get frontier models running on each new Instinct generation, creating reproducible guides and reference implementations.
Optimize inference serving frameworks (vLLM, SGLang, TorchServe) for AMD GPUs: batching strategies, KV-cache management, speculative decoding, and continuous batching for production throughput/latency targets.
Develop novel approaches to inference acceleration, including bio-inspired algorithms, SLM-assisted batching, and custom scheduling strategies that exploit AMD hardware characteristics.

Fulltime

Gpu Training Optimization Engineer

Randstad

Location:
China , Shanghai

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:
May 05, 2026

Expiration:
June 29, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Gpu Training Optimization Engineer

Member of Technical Staff, Performance Optimization

Software Development Engineer

Sr. Software Development Engineer

Machine Learning Engineer - Pre-Training

Software Development Engineer

Software Development Engineer

Senior Manager, Performance AI/ML Network Deployment Engineering

AI Systems Engineer – AI Model (Training & Inference)

Gpu Training Optimization Engineer

Randstad

Location:China , Shanghai

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:May 05, 2026

Expiration:June 29, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Gpu Training Optimization Engineer

Member of Technical Staff, Performance Optimization

Software Development Engineer

Sr. Software Development Engineer

Machine Learning Engineer - Pre-Training

Software Development Engineer

Software Development Engineer

Senior Manager, Performance AI/ML Network Deployment Engineering

AI Systems Engineer – AI Model (Training & Inference)

Location:
China , Shanghai

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
May 05, 2026

Expiration:
June 29, 2026