This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Robust, reliable, and scalable distributed systems form the backbone of Krea. These systems support the infrastructure that powers our AI research, real-time user experiences, and large-scale model deployments.
Job Responsibility:
Design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving
Own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations
Collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment
Improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments
Requirements:
Kubernetes at scale (thousands of nodes)
Cloud infrastructure management (AWS/GCP/Azure)
High-performance and fault-tolerant networking
Low-level Linux interfaces and administration
Debugging complex distributed systems in production
Python, Golang, Ruby, Rust, and similar systems languages