This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are building the next generation of large-scale AI systems that power training and inference workloads at unprecedented scale and efficiency. You will design and develop high-performance distributed software that orchestrates massive compute and data pipelines across heterogeneous clusters. Your work will push the limits of concurrency, throughput, and scalability—enabling efficient execution of models at massive scale. This role sits at the intersection of systems engineering and machine learning performance, demanding both architectural depth and low-level implementation skills. You will help shape how models are executed and optimized end-to-end, from data ingestion to distributed execution, across cutting-edge hardware platforms. We’re hiring for runtime roles across both Training and Inference.
Job Responsibility:
Design and implement distributed runtime components to efficiently manage large-scale execution workloads
Develop and optimize high-performance data and communication pipelines that fully utilize CPU, memory, storage, and network resources
Enable scalable execution across multiple compute nodes, ensuring high concurrency and minimal bottlenecks
Collaborate closely with ML and compiler teams to integrate new model architectures, training regimes, and hardware-specific optimizations
Diagnose and resolve complex performance issues across the software stack using profiling and instrumentation tools
Contribute to overall system design, architecture reviews, and roadmap planning for large-scale AI workloads
Requirements:
3+ years of experience developing high-performance or distributed system software
Strong programming skills in C/C++, with expertise in multi-threading, memory management, and performance optimization
Experience with distributed systems, networking, or inter-process communication
Solid understanding of data structures, concurrency, and system-level resource management (CPU, I/O, and memory)
Proven ability to debug, profile, and optimize code across scales—from threads to clusters
Bachelor’s, Master’s, or equivalent experience in Computer Science, Electrical Engineering, or related field
Nice to have:
Familiarity with machine learning training or inference pipelines, especially distributed training and large-model scaling
Exposure to Python and PyTorch, particularly in the context of model training or performance tuning
Experience with compiler internals, custom hardware interfaces, or low-level protocol design
Prior work on high-performance clusters, HPC systems, or custom hardware/software co-design
Deep curiosity about how to unlock new levels of performance for large-scale AI workloads
What we offer:
Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs