GPU Kernel Performance Engineer Job at AMD (Beijing)

Member of Technical Staff - GPU Performance Engineer

Our models and workflows require performance work that generic frameworks don’t ...

Location

United States , San Francisco; Boston

Salary:

Not provided

Liquid AI

Expiration Date

Until further notice

Requirements

Authored custom CUDA kernels (not only calling cuDNN/cuBLAS)
Strong understanding of GPU architecture and performance: memory hierarchy, warps, shared memory/register pressure, bandwidth vs compute limits
Proficiency with low-level profiling (Nsight Systems/Compute) and performance methodology
Strong C/C++ skills

Job Responsibility

Write high-performance GPU kernels for our novel model architectures
Integrate kernels into PyTorch pipelines (custom ops, extensions, dispatch, benchmarking)
Profile and optimize training and inference workflows to eliminate bottlenecks
Build correctness tests and numerics checks
Build/maintain performance benchmarks and guardrails to prevent regressions
Collaborate closely with researchers to turn promising ideas into shipped speedups

What we offer

Competitive base salary with equity in a unicorn-stage company
We pay 100% of medical, dental, and vision premiums for employees and dependents
401(k) matching up to 4% of base pay
Unlimited PTO plus company-wide Refill Days throughout the year

Fulltime

Founding GPU Kernel Engineer

We're looking for a Founding GPU Kernel Engineer who lives right at the boundary...

Location

United States , San Francisco

Salary:

285000.00 - 315000.00 USD / Year

YC Work at a Startup

Expiration Date

Until further notice

Requirements

Deep expertise in GPU architecture
Proven track record of hand-writing kernels that match or beat vendor libraries (cuBLAS, cuDNN, CUTLASS)
Strong skills with low-level profiling tools: Nsight Compute, Nsight Systems, rocprof, or equivalents
Experience reading and reasoning about PTX/SASS or GPU assembly
Solid systems programming in C++ and CUDA (or ROCm/HIP)
Good understanding of how high-level ML operations map to hardware execution
Experience with distributed training systems: collective ops like all-reduce and all-gather, NCCL/RCCL, multi-node communication patterns

Job Responsibility

Write and hand-optimize GPU kernels for ML workloads (matmuls, attention, normalization, etc.) to set the performance ceilings
Profile at the microarchitectural level: look into SM utilization, warp stalls, memory bank conflicts, register pressure, instruction throughput
Debug performance issues by digging deep into things like clock speeds, thermal throttling, driver behavior, hardware errata
Turn your hand-optimization insights into automated compiler passes (working closely with our compiler team)
Develop performance models that predict how kernels will behave across different GPU architectures
Build tools and methods for systematic kernel optimization
Work with NVIDIA, AMD, and emerging AI accelerators - understand the common parts and what's vendor-specific

What we offer

bonus
equity
benefits
relocation assistance

Fulltime

GPU Kernel Development Engineer

As a core member of the team, you will play a pivotal role in optimizing and dev...

Location

China , Shanghai

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Master's and/or PhD degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field
5+ years of professional experience in technical software development, with a focus on GPU optimization, performance engineering, and framework development
Strong technical and analytical expertise in C++ development within Linux environments
Expert skills in Python and C++
Strong experience in designing and optimizing GPU kernels for deep learning on AMD GPUs using HIP, CUDA, and assembly (ASM)
Strong knowledge of AMD architectures (GCN, RDNA)
Sound understanding of compiler theory and tools like LLVM and ROCm

Job Responsibility

Optimize Deep Learning Frameworks: Enhance and optimize frameworks like TensorFlow and PyTorch for AMD GPUs in open-source repositories
Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations
Develop & Optimize Models: Design and optimize deep learning models specifically for AMD GPU performance
Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs
Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream
Work in Distributed Computing Environments: Optimize deep learning performance on both scale-up (multi-GPU) and scale-out (multi-node) systems
Utilize Cutting-Edge Compiler Tech: Leverage advanced compiler technologies to improve deep learning performance
Optimize Deep Learning Pipeline: Enhance the full pipeline, including integrating graph compilers
Software Engineering Best Practices: Apply sound engineering principles to ensure robust, maintainable solutions
Mentor and Guide: Provide mentorship to junior team members, fostering growth and collaboration through code reviews, knowledge sharing, and technical guidance

What we offer

Benefits offered are described: AMD benefits at a glance

Fulltime

Senior GPU Software Performance Engineer — Post‑Training

Drive the performance of post‑training workloads on AMD Instinct™ GPUs. You’ll w...

Location

United States , San Jose

Salary:

204000.00 - 306000.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Proven GPU performance engineering for deep learning (ROCm/HIP, Triton, or similar)
Hands-on with SFT. LoRA and RL-based training at scale
Strong PyTorch experience (torch.distributed, FSDP/ZeRO or equivalent)
Proficient in Python and C++
comfortable reading/writing kernels when needed
Experience with distributed systems and collective communication libraries
Track record of turning profiles into fixes, upstreaming changes, and documenting results

Job Responsibility

Lead performance for finetuning and RL training solutions on AMD GPUs
Improve throughput, memory efficiency, and stability across data, model, and optimizer steps
Optimize multi-GPU/multi-node training and communication patterns
Contribute efficient kernels/ops and targeted graph-level optimizations
Profile, diagnose, and resolve bottlenecks using standard tooling
prevent regressions in CI
Ship reproducible pipelines and documentation adopted by internal teams and external developers
Collaborate with framework, compiler, and model teams to land durable improvements

Fulltime

GPU Performance Attainment Engineer

As a senior member of the pre-silicon performance attainment team, you will be a...

Location

India , Hyderabad

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Several years of experience in GPU pre-silicon performance analysis and debug
Proficiency with performance modeling and simulation tools
Strong understanding of GPGPU programming APIs and Machine Learning workloads
Expertise in C/C++ /Scripting (Python, Perl, Shell etc.)
Experience with hardware description languages such as Verilog is a plus
Familiarity with the software stack is a plus, preferably related to GPUs—such as applications, drivers, compilers, and firmware
Bachelor's or higher degree in Computer Science, Electrical Engineering, or a closely related field

Job Responsibility

Debug performance issues and analyze data from the full-chip Emulation Platform, RTL Simulator, and Architecture and Roofline Models
Analyze model projection results and identify algorithm issues to find novel solutions for improving the accuracy of projection for different families of products, and over multiple generations
Get performance projections for kernels using an analytical model
Identify technical problems, break them down, summarize multiple possible solutions, and help the team to make progress
Automate processes related to performance infrastructure and data collection tasks, to enhance productivity and refine processes for improved efficiency
Engage with the workloads team to acquire and align on required workloads, run the selected workload traces on the performance simulator, analyze the performance results and metrics to root cause any anomalies
Collaborate with simulator team to bridge gaps between the performance numbers and the performance targets
Influence design trade-offs and optimizations by working closely with compiler, driver, library, and hardware engineers to achieve the highest performance for selected workloads
Innovate new algorithmic improvements that exploit the strengths of the hardware architecture to deliver the best possible machine learning performance

Power and Performance Engineer

AMD's Computing and Graphics business unit is seeking a technical leader to driv...

Location

India , Bangalore

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

PhD or MS in Electrical Engineering, Computer Engineering, Computer Science, Physics, or a related technical field
Prefer 8+ years of industry experience
Experience profiling software workloads on CPUs and GPUs
Knowledge of defining low-level software interfaces to influence hardware performance and power behavior
Background in SOC hardware design, clock distribution, power delivery and performance
Experience profiling and tuning hardware/software stacks to achieve power and performance efficiency on benchmark workloads
Prior algorithm development in C or Python
MATLAB experience preferred
Expertise in algorithm development for predictive systems using regression and classification frameworks for complex datasets
Experience with system and SoC control firmware, including RTOS and bare-metal development

Job Responsibility

Develop and architect new power algorithms and software features using state-of-the-art techniques, including classifiers, regression models, machine learning approaches for CPU/GPU power and performance optimization
Define and implement software interfaces that influence and optimize hardware behavior across the SoC
Prototype and develop new firmware for complex power-control algorithms throughout the SoC roadmap
Profile emerging NN and LLM (Agentic) workloads on client platforms
tune algorithms to optimize power during active, high-load, and interactive usage scenarios
Analyze power and performance benchmarks to identify opportunities for efficiency improvements
Architect, implement, and optimize control software and firmware features to achieve product-level power targets
Drive cross-organization alignment for innovative power-reduction methodologies and ensure convergence across design teams
Collaborate with global architecture and design teams to implement and validate SoC power-management solutions
Optimize real PC use cases for both absolute power and performance-per-watt

Fulltime

AI Product Performance Engineer

WHAT YOU DO AT AMD CHANGES EVERYTHING. At AMD, our mission is to build great pro...

Location

China , Shenzhen

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

deep knowledge of Data Center AI workloads such as LLM, Generative AI, Recommendation, NLP, Video Analytics, and/or transformer
hands-on experiences with various AI models, end-to-end pipeline, industry framework / SDKs and solutions
GPU Architecture Mastery
Kernel Programming Expertise: Strong proficiency in C++ and parallel computing, with extensive hands-on experience in NVIDIA CUDA or AMD HIP kernel programming
Performance Engineering: Demonstrated ability to debug and profile complex GPU workloads
Systems Knowledge: Familiarity with asynchronous execution, stream management, and host-device memory transfers
Python DSLs & Triton: Experience implementing kernels using OpenAI Triton or other Python-based DSLs
Inference Engine Experience: Hands-on experience integrating custom kernels into large-scale inference frameworks such as vLLM, SGLang, or TensorRT-LLM
Deep Learning Frameworks: Familiarity with writing custom extensions or operators for PyTorch (C++/CUDA extensions)
Hardware Agnosticism: Experience porting kernels between NVIDIA and AMD architectures or working with cross-platform HPC libraries

Job Responsibility

High-Performance Kernel Development: Design, implement, and optimize high-performance GPU kernels for AI/ML workloads to maximize hardware utilization
Performance Optimization: Analyze and optimize kernel execution for latency and throughput, addressing bottlenecks in memory bandwidth, instruction latency, and thread divergence
Workload Analysis: Evaluate the end-to-end performance impact of individual kernels on full-stack AI models, ensuring that micro-optimizations translate to application-level speedups
Profiling & Tuning: Utilize advanced GPU profiling tools (e.g., ROCm Profiler, Pytorch Profiler) to identify performance cliffs, stall pipelines, and memory hierarchy inefficiencies
Architecture Adaptation: Tailor implementation strategies to leverage specific features of modern GPU architectures (e.g., Matrix Cores, HBM characteristics)
Framework Integration: Collaborate with software stack teams to expose optimized kernels within high-level frameworks and inference engines

What we offer

AMD benefits at a glance

Fulltime

GPU Performance Architect

WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great pro...

Location

United States , Folsom

Salary:

102400.00 - 153600.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Experience spanning architecture, performance analysis and AI/ML/graphics/compute algorithms
Experience in Kernel development of LLMs using PyTorch, Triton, CUDA, HIP etc.
Excellent C/C++/Scripting (Python, etc.) experience
Knowledge of Graphics/Compute APIs (DirectX/Vulkan/CUDA/HIP etc.)
Knowledge of GPU architecture and Compilers
Experience in GFX profiling tools is a plus (PIX, RenderDoc, AMD tools, etc.)
Undergrad degree required. Bachelor of Science, Masters or PhD degree with emphasis in Electrical Engineering, Computer architecture, or Computer Science is preferred

Job Responsibility

Work on workload/competitive analysis of contemporary and futuristic game/AI/ML applications
Identify complex technical problems, break them down, summarize multiple possible solutions, and help the team make progress
Work with architects to understand bottlenecks in graphics cores and SoCs
Analyze existing and emerging graphics/compute paradigms and algorithms
Implement and run simulation models - i.e., both RTL and high-level simulation to estimate potential gains
Propose innovative solutions for PPA improvements
Drive initiatives to improve RTG analysis tools
Collaborate with engineers and managers on multiple sites

What we offer

Benefits offered are described: AMD benefits at a glance

Fulltime

Select Country

GPU Kernel Performance Engineer

Job Description

Job Responsibility

Requirements

Looking for more opportunities?