This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a systems-minded engineer who lives at the intersection of large-scale model inference, distributed systems, and performance optimization. This role focuses on post-training and inference infrastructure, with particular emphasis on P/D disaggregation, KV cache lifecycle management, and efficient offloading mechanisms across both inference and reinforcement learning (RL) systems.
Job Responsibility
Research and deeply understand modern LLM inference frameworks
Analyze and compare inference execution paths to identify performance bottlenecks and inefficiencies
Develop and implement infrastructure-level features to improve inference latency, throughput, and memory efficiency
Optimize KV cache management and offloading strategies
Enhance scalability across multi-GPU and multi-node deployments
Apply the same research-driven approach to RL frameworks
Study post-training and RL systems
Debug performance and correctness issues in distributed RL pipelines
Optimize inference, rollout efficiency, and memory usage during training
Collaborate with research and applied ML teams
Translate model-level requirements into infrastructure capabilities
Validate performance gains with benchmarks and real workloads
Document findings, architectural insights, and best practices to guide future system design
Requirements
Strong background in systems engineering, distributed systems, or ML infrastructure
Hands-on experience with GPU-accelerated workloads and memory-constrained systems
Solid understanding of: LLM inference workflows (prefill vs decode)
Attention mechanisms and KV cache behavior
Multi-process / multi-GPU execution models
Proficiency in Python and C++ (or similar systems languages)
Experience debugging performance issues using profiling tools (GPU, CPU, memory)
Ability to read, understand, and modify complex open-source codebases
Strong analytical skills and comfort working in research-heavy, ambiguous problem spaces
Bachelor's or master's degree in computer science, computer engineering, electrical engineering, or equivalent
Nice to have
Direct experience with LLM inference frameworks or serving stacks
Familiarity with: GPU memory hierarchies (HBM, pinned memory, NUMA considerations)
KV cache compression, paging, or eviction strategies