This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking an expert GPU Engineer 2 to join our AI Infrastructure team. In this role, you will architect and optimize the core inference engine that powers our large-scale AI models. You will be responsible for pushing the boundaries of hardware performance, reducing latency, and maximizing throughput for Generative AI and Deep Learning workloads. You will work at the intersection of Deep Learning algorithms and low-level hardware, designing custom operators and building a highly efficient training/inference execution engine from the ground up.
Job Responsibility:
Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries
Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization)
Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads
Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching
Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism)
Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy)
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Architectural Mastery: Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper)
Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution
Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy
Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel)
Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads
Performance Engineering: Mastery of NVIDIA Nsight Systems/Compute
Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput