CrawlJobs Logo
Briefcase Icon
Category Icon

Filters

×

Member of Technical Staff, AI Training Infrastructure Jobs

1 Job Offers

Filters
Member of Technical Staff, AI Training Infrastructure
Save Icon
Join our team in San Mateo as a Training Infrastructure Engineer. You will design and optimize high-performance, scalable infrastructure for large-scale AI model training. Leverage your expertise in PyTorch, distributed systems, and cloud platforms to build robust pipelines. This role offers mean...
Location Icon
Location
United States , San Mateo
Salary Icon
Salary
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Explore Member of Technical Staff, AI Training Infrastructure jobs and discover a pivotal career at the intersection of cutting-edge artificial intelligence and robust systems engineering. Professionals in this role are the architects and builders of the foundational platforms that enable the training of large-scale AI models, such as large language models (LLMs) and complex multimodal systems. Their core mission is to design, implement, and optimize the high-performance, distributed computing environments that allow AI research to scale from concept to reality. This profession is critical for any organization aiming to push the boundaries of AI, making these jobs highly sought-after in the tech industry. The typical responsibilities for a Member of Technical Staff in AI Training Infrastructure are centered around creating scalable and efficient systems. A primary duty involves designing and implementing the underlying infrastructure that can handle massive, distributed training workloads spanning thousands of GPUs across multiple data centers or cloud regions. These engineers develop and maintain robust training pipelines, ensuring data flows seamlessly from storage to processing. They are deeply involved in performance optimization, employing techniques like data parallelism, model parallelism, and pipeline parallelism to minimize training time and maximize hardware utilization. Furthermore, they build critical tooling for monitoring, logging, and debugging to maintain system reliability and provide visibility into complex training runs. Automating infrastructure provisioning, scaling, and orchestration using modern technologies is also a standard part of the role, as is collaborating closely with AI researchers to translate novel training methodologies into stable, production-grade systems. To succeed in these jobs, individuals typically possess a strong blend of software engineering and machine learning infrastructure expertise. A background in computer science or a related field is common. Essential technical skills include deep proficiency in distributed systems principles and hands-on experience with ML frameworks like PyTorch or TensorFlow, particularly their distributed training capabilities. Expertise in cloud platforms (AWS, GCP, Azure) and containerization technologies like Docker and orchestration systems like Kubernetes is fundamental. Professionals in this field must also have a solid understanding of high-performance computing, networking for low-latency communication, and storage solutions for enormous datasets. Strong problem-solving skills are paramount for troubleshooting intricate performance bottlenecks and system failures in live training environments. For those passionate about building the engines that power the AI revolution, Member of Technical Staff, AI Training Infrastructure jobs offer a challenging and impactful career path where their work directly accelerates the development of transformative artificial intelligence.

Filters

×
Countries
Category
Location
Work Mode
Salary