This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Our Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership training systems role focused on runtime/performance/reliability (not a general platform/SRE role). You’ll work on a small team with fast feedback loops, building critical systems from the ground up rather than inheriting mature infrastructure.
Job Responsibility:
Design and build core systems that make large training runs fast and reliable
Build scalable distributed training infrastructure for GPU clusters
Implement and tune parallelism/sharding strategies for evolving architectures