This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a dynamic, upbeat software engineer to join our growing team. Your work will focus on building robust, efficient software components that enable high-performance execution of large language models and multimodal models across multi-GPU systems. You’ll collaborate with internal GPU library teams and open-source maintainers to implement features that improve throughput, latency, and scalability. This role emphasizes full-stack development within AI inference systems, with a strong focus on model behavior and framework integration.
Job Responsibility:
Deep Learning & LLM Framework Optimization for AMD GPUs
Model-Aware Implementation with LLMs and multimodal architectures
Performance-Conscious Coding in multi-GPU environments
Profiling using tools to evaluate impact of changes
End-to-End Performance Engineering across multi-GPU and multi-node setups
Compiler & Pipeline Acceleration using compiler technologies and graph compilers
Research & Advanced Techniques like speculative decoding and weight-only quantization
Cross-Team & Open-Source Collaboration with internal GPU library teams and open-source maintainers
Software Engineering Excellence for maintainable and production-quality performance optimizations
Requirements:
Familiarity in Python
Familiarity with C++ or async programming
Understanding of LLM or multimodal model concepts
Knowledge of transformer architectures, attention mechanisms, vision-language alignment, and inference pipelines
Theoretical grounding in Transformer/Attention/MoE/KV Cache, and quantization (FP8/FP4)
Linux development environment
Experience with profiling and diagnosing compute, memory, and communication bottlenecks across multi-GPU and multi-node environments
Solid Python/C++ coding skills and experience debugging and testing practices
Experience with multimodal models (e.g., Qwen-VL, Qwen-Image-Edit, Wan) or diffusion-based generative models
Familiarity with techniques like quantization, PagedAttention, continuous batching, or speculative decoding
Exposure to GPU computing (ROCm, CUDA) or performance profiling tools (e.g., PyTorch Profiler)
Experience with distributed inference for large-scale models (e.g., Tensor Parallel, Pipeline Parallel)
Bachelor's in Computer Science, Computer Engineering, Electrical Engineering, or a related field
Nice to have:
Familiarity with C++ or async programming
GPU Kernel Development & Optimization using HIP, CUDA, ASM, and tools like CK, CUTLASS, and Triton
Compiler & System-Level Optimization knowledge of LLVM, ROCm, and compiler-driven techniques
software engineering excellence & community contribution