CrawlJobs Logo

GPU Kernel Performance Engineer

China, Beijing · Job Posted March 19, 2026
Apply Position
Job Link Share

Job Description

AMD is looking for an influential software engineer who is passionate about improving the performance of key applications and benchmarks. You will be a member of a core team of incredibly talented industry specialists and will work with the very latest hardware and software technology. Deploy models on AMD Ryzen AI series devices to deliver high-performance, highly reliable deployment solutions. Engage in high-performance operator design, including GPU and NPU operators, and design and develop inference frameworks and inference compilers. Possess strong cross-team collaboration experience.

Job Responsibility

  • Design and deliver high‑performance computing solutions, providing competitive architectures and implementations for customers
  • Develop high‑performance operators across GPU/NPU platforms, including GEMM, MHA, and CONV
  • Build and optimize inference frameworks and inference compilers
  • Conduct performance evaluation and benchmarking of models and operators
  • Track and study cutting‑edge research papers, reproduce key methodologies, and integrate them into production solutions
  • Document technical work, summarize team achievements, and contribute to patents and publications
  • Build and maintain strong technical relationships with internal teams, industry peers, and ecosystem partners

Requirements

  • Strong expertise in GPU, NPU, and FPGA architectures, with a deep understanding of accelerator micro‑architecture and computation pipelines
  • Solid knowledge of AI inference, including operator/kernel development, AI compilers, and inference frameworks such as PyTorch and ONNX Runtime
  • Extensive experience in GPU kernel development, with strong proficiency in CUDA and/or HIP programming models
  • Strong object‑oriented programming background
  • proficiency in C/C++ is highly preferred
  • Proven ability to write high‑quality, efficient, and maintainable code, with strong attention to detail and robustness
  • Excellent communication skills and strong analytical/problem‑solving capabilities
  • Doctor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

GPU Kernel Performance Engineer

8 matching positions

Member of Technical Staff - GPU Performance Engineer

Our models and workflows require performance work that generic frameworks don’t ...
Location
Location
United States , San Francisco; Boston
Salary
Salary:
Not provided
liquid.ai Logo
Liquid AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Authored custom CUDA kernels (not only calling cuDNN/cuBLAS)
  • Strong understanding of GPU architecture and performance: memory hierarchy, warps, shared memory/register pressure, bandwidth vs compute limits
  • Proficiency with low-level profiling (Nsight Systems/Compute) and performance methodology
  • Strong C/C++ skills
Job Responsibility
Job Responsibility
  • Write high-performance GPU kernels for our novel model architectures
  • Integrate kernels into PyTorch pipelines (custom ops, extensions, dispatch, benchmarking)
  • Profile and optimize training and inference workflows to eliminate bottlenecks
  • Build correctness tests and numerics checks
  • Build/maintain performance benchmarks and guardrails to prevent regressions
  • Collaborate closely with researchers to turn promising ideas into shipped speedups
What we offer
What we offer
  • Competitive base salary with equity in a unicorn-stage company
  • We pay 100% of medical, dental, and vision premiums for employees and dependents
  • 401(k) matching up to 4% of base pay
  • Unlimited PTO plus company-wide Refill Days throughout the year
  • Fulltime
Read More
Arrow Right

Founding GPU Kernel Engineer

We're looking for a Founding GPU Kernel Engineer who lives right at the boundary...
Location
Location
United States , San Francisco
Salary
Salary:
285000.00 - 315000.00 USD / Year
workatastartup.com Logo
YC Work at a Startup
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in GPU architecture
  • Proven track record of hand-writing kernels that match or beat vendor libraries (cuBLAS, cuDNN, CUTLASS)
  • Strong skills with low-level profiling tools: Nsight Compute, Nsight Systems, rocprof, or equivalents
  • Experience reading and reasoning about PTX/SASS or GPU assembly
  • Solid systems programming in C++ and CUDA (or ROCm/HIP)
  • Good understanding of how high-level ML operations map to hardware execution
  • Experience with distributed training systems: collective ops like all-reduce and all-gather, NCCL/RCCL, multi-node communication patterns
Job Responsibility
Job Responsibility
  • Write and hand-optimize GPU kernels for ML workloads (matmuls, attention, normalization, etc.) to set the performance ceilings
  • Profile at the microarchitectural level: look into SM utilization, warp stalls, memory bank conflicts, register pressure, instruction throughput
  • Debug performance issues by digging deep into things like clock speeds, thermal throttling, driver behavior, hardware errata
  • Turn your hand-optimization insights into automated compiler passes (working closely with our compiler team)
  • Develop performance models that predict how kernels will behave across different GPU architectures
  • Build tools and methods for systematic kernel optimization
  • Work with NVIDIA, AMD, and emerging AI accelerators - understand the common parts and what's vendor-specific
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • relocation assistance
  • Fulltime
Read More
Arrow Right

GPU Kernel Development Engineer

As a core member of the team, you will play a pivotal role in optimizing and dev...
Location
Location
China , Shanghai
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's and/or PhD degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field
  • 5+ years of professional experience in technical software development, with a focus on GPU optimization, performance engineering, and framework development
  • Strong technical and analytical expertise in C++ development within Linux environments
  • Expert skills in Python and C++
  • Strong experience in designing and optimizing GPU kernels for deep learning on AMD GPUs using HIP, CUDA, and assembly (ASM)
  • Strong knowledge of AMD architectures (GCN, RDNA)
  • Sound understanding of compiler theory and tools like LLVM and ROCm
Job Responsibility
Job Responsibility
  • Optimize Deep Learning Frameworks: Enhance and optimize frameworks like TensorFlow and PyTorch for AMD GPUs in open-source repositories
  • Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations
  • Develop & Optimize Models: Design and optimize deep learning models specifically for AMD GPU performance
  • Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs
  • Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream
  • Work in Distributed Computing Environments: Optimize deep learning performance on both scale-up (multi-GPU) and scale-out (multi-node) systems
  • Utilize Cutting-Edge Compiler Tech: Leverage advanced compiler technologies to improve deep learning performance
  • Optimize Deep Learning Pipeline: Enhance the full pipeline, including integrating graph compilers
  • Software Engineering Best Practices: Apply sound engineering principles to ensure robust, maintainable solutions
  • Mentor and Guide: Provide mentorship to junior team members, fostering growth and collaboration through code reviews, knowledge sharing, and technical guidance
What we offer
What we offer
  • Benefits offered are described: AMD benefits at a glance
  • Fulltime
Read More
Arrow Right

Senior GPU Software Performance Engineer — Post‑Training

Drive the performance of post‑training workloads on AMD Instinct™ GPUs. You’ll w...
Location
Location
United States , San Jose
Salary
Salary:
204000.00 - 306000.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven GPU performance engineering for deep learning (ROCm/HIP, Triton, or similar)
  • Hands-on with SFT. LoRA and RL-based training at scale
  • Strong PyTorch experience (torch.distributed, FSDP/ZeRO or equivalent)
  • Proficient in Python and C++
  • comfortable reading/writing kernels when needed
  • Experience with distributed systems and collective communication libraries
  • Track record of turning profiles into fixes, upstreaming changes, and documenting results
Job Responsibility
Job Responsibility
  • Lead performance for finetuning and RL training solutions on AMD GPUs
  • Improve throughput, memory efficiency, and stability across data, model, and optimizer steps
  • Optimize multi-GPU/multi-node training and communication patterns
  • Contribute efficient kernels/ops and targeted graph-level optimizations
  • Profile, diagnose, and resolve bottlenecks using standard tooling
  • prevent regressions in CI
  • Ship reproducible pipelines and documentation adopted by internal teams and external developers
  • Collaborate with framework, compiler, and model teams to land durable improvements
  • Fulltime
Read More
Arrow Right

GPU Performance Attainment Engineer

As a senior member of the pre-silicon performance attainment team, you will be a...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Several years of experience in GPU pre-silicon performance analysis and debug
  • Proficiency with performance modeling and simulation tools
  • Strong understanding of GPGPU programming APIs and Machine Learning workloads
  • Expertise in C/C++ /Scripting (Python, Perl, Shell etc.)
  • Experience with hardware description languages such as Verilog is a plus
  • Familiarity with the software stack is a plus, preferably related to GPUs—such as applications, drivers, compilers, and firmware
  • Bachelor's or higher degree in Computer Science, Electrical Engineering, or a closely related field
Job Responsibility
Job Responsibility
  • Debug performance issues and analyze data from the full-chip Emulation Platform, RTL Simulator, and Architecture and Roofline Models
  • Analyze model projection results and identify algorithm issues to find novel solutions for improving the accuracy of projection for different families of products, and over multiple generations
  • Get performance projections for kernels using an analytical model
  • Identify technical problems, break them down, summarize multiple possible solutions, and help the team to make progress
  • Automate processes related to performance infrastructure and data collection tasks, to enhance productivity and refine processes for improved efficiency
  • Engage with the workloads team to acquire and align on required workloads, run the selected workload traces on the performance simulator, analyze the performance results and metrics to root cause any anomalies
  • Collaborate with simulator team to bridge gaps between the performance numbers and the performance targets
  • Influence design trade-offs and optimizations by working closely with compiler, driver, library, and hardware engineers to achieve the highest performance for selected workloads
  • Innovate new algorithmic improvements that exploit the strengths of the hardware architecture to deliver the best possible machine learning performance
Read More
Arrow Right

Power and Performance Engineer

AMD's Computing and Graphics business unit is seeking a technical leader to driv...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • PhD or MS in Electrical Engineering, Computer Engineering, Computer Science, Physics, or a related technical field
  • Prefer 8+ years of industry experience
  • Experience profiling software workloads on CPUs and GPUs
  • Knowledge of defining low-level software interfaces to influence hardware performance and power behavior
  • Background in SOC hardware design, clock distribution, power delivery and performance
  • Experience profiling and tuning hardware/software stacks to achieve power and performance efficiency on benchmark workloads
  • Prior algorithm development in C or Python
  • MATLAB experience preferred
  • Expertise in algorithm development for predictive systems using regression and classification frameworks for complex datasets
  • Experience with system and SoC control firmware, including RTOS and bare-metal development
Job Responsibility
Job Responsibility
  • Develop and architect new power algorithms and software features using state-of-the-art techniques, including classifiers, regression models, machine learning approaches for CPU/GPU power and performance optimization
  • Define and implement software interfaces that influence and optimize hardware behavior across the SoC
  • Prototype and develop new firmware for complex power-control algorithms throughout the SoC roadmap
  • Profile emerging NN and LLM (Agentic) workloads on client platforms
  • tune algorithms to optimize power during active, high-load, and interactive usage scenarios
  • Analyze power and performance benchmarks to identify opportunities for efficiency improvements
  • Architect, implement, and optimize control software and firmware features to achieve product-level power targets
  • Drive cross-organization alignment for innovative power-reduction methodologies and ensure convergence across design teams
  • Collaborate with global architecture and design teams to implement and validate SoC power-management solutions
  • Optimize real PC use cases for both absolute power and performance-per-watt
  • Fulltime
Read More
Arrow Right

AI Product Performance Engineer

WHAT YOU DO AT AMD CHANGES EVERYTHING. At AMD, our mission is to build great pro...
Location
Location
China , Shenzhen
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • deep knowledge of Data Center AI workloads such as LLM, Generative AI, Recommendation, NLP, Video Analytics, and/or transformer
  • hands-on experiences with various AI models, end-to-end pipeline, industry framework / SDKs and solutions
  • GPU Architecture Mastery
  • Kernel Programming Expertise: Strong proficiency in C++ and parallel computing, with extensive hands-on experience in NVIDIA CUDA or AMD HIP kernel programming
  • Performance Engineering: Demonstrated ability to debug and profile complex GPU workloads
  • Systems Knowledge: Familiarity with asynchronous execution, stream management, and host-device memory transfers
  • Python DSLs & Triton: Experience implementing kernels using OpenAI Triton or other Python-based DSLs
  • Inference Engine Experience: Hands-on experience integrating custom kernels into large-scale inference frameworks such as vLLM, SGLang, or TensorRT-LLM
  • Deep Learning Frameworks: Familiarity with writing custom extensions or operators for PyTorch (C++/CUDA extensions)
  • Hardware Agnosticism: Experience porting kernels between NVIDIA and AMD architectures or working with cross-platform HPC libraries
Job Responsibility
Job Responsibility
  • High-Performance Kernel Development: Design, implement, and optimize high-performance GPU kernels for AI/ML workloads to maximize hardware utilization
  • Performance Optimization: Analyze and optimize kernel execution for latency and throughput, addressing bottlenecks in memory bandwidth, instruction latency, and thread divergence
  • Workload Analysis: Evaluate the end-to-end performance impact of individual kernels on full-stack AI models, ensuring that micro-optimizations translate to application-level speedups
  • Profiling & Tuning: Utilize advanced GPU profiling tools (e.g., ROCm Profiler, Pytorch Profiler) to identify performance cliffs, stall pipelines, and memory hierarchy inefficiencies
  • Architecture Adaptation: Tailor implementation strategies to leverage specific features of modern GPU architectures (e.g., Matrix Cores, HBM characteristics)
  • Framework Integration: Collaborate with software stack teams to expose optimized kernels within high-level frameworks and inference engines
What we offer
What we offer
  • AMD benefits at a glance
  • Fulltime
Read More
Arrow Right

GPU Performance Architect

WHAT YOU DO AT AMD CHANGES EVERYTHING  At AMD, our mission is to build great pro...
Location
Location
United States , Folsom
Salary
Salary:
102400.00 - 153600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience spanning architecture, performance analysis and AI/ML/graphics/compute algorithms
  • Experience in Kernel development of LLMs using PyTorch, Triton, CUDA, HIP etc.
  • Excellent C/C++/Scripting (Python, etc.) experience
  • Knowledge of Graphics/Compute APIs (DirectX/Vulkan/CUDA/HIP etc.)
  • Knowledge of GPU architecture and Compilers
  • Experience in GFX profiling tools is a plus (PIX, RenderDoc, AMD tools, etc.)
  • Undergrad degree required. Bachelor of Science, Masters or PhD degree with emphasis in Electrical Engineering, Computer architecture, or Computer Science is preferred
Job Responsibility
Job Responsibility
  • Work on workload/competitive analysis of contemporary and futuristic game/AI/ML applications
  • Identify complex technical problems, break them down, summarize multiple possible solutions, and help the team make progress
  • Work with architects to understand bottlenecks in graphics cores and SoCs
  • Analyze existing and emerging graphics/compute paradigms and algorithms
  • Implement and run simulation models - i.e., both RTL and high-level simulation to estimate potential gains
  • Propose innovative solutions for PPA improvements
  • Drive initiatives to improve RTG analysis tools
  • Collaborate with engineers and managers on multiple sites
What we offer
What we offer
  • Benefits offered are described: AMD benefits at a glance
  • Fulltime
Read More
Arrow Right