This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Training Infrastructure Engineer, you'll design, build, and optimize the infrastructure that powers our large-scale model training operations. Your work will be essential to developing high-performance AI training infrastructure. You'll collaborate with AI researchers and engineers to create robust training pipelines, optimize distributed training workloads, and ensure reliable model development.
Job Responsibility:
Design and implement scalable infrastructure for large-scale model training workloads
Develop and maintain distributed training pipelines for LLMs and multimodal models
Optimize training performance across multiple GPUs, nodes, and data centers
Implement monitoring, logging, and debugging tools for training operations
Architect and maintain data storage solutions for large-scale training datasets
Automate infrastructure provisioning, scaling, and orchestration for model training
Collaborate with researchers to implement and optimize training methodologies
Analyze and improve efficiency, scalability, and cost-effectiveness of training systems
Troubleshoot complex performance issues in distributed training environments
Requirements:
Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience
3+ years of experience with distributed systems and ML infrastructure
Experience with PyTorch
Proficiency in cloud platforms (AWS, GCP, Azure)
Experience with containerization, orchestration (Kubernetes, Docker)
Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)
Nice to have:
Master's or PhD in Computer Science or related field
Experience training large language models or multimodal AI systems
Experience with ML workflow orchestration tools
Background in optimizing high-performance distributed computing systems
Familiarity with ML DevOps practices
Contributions to open-source ML infrastructure or related projects
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.