This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The goal of this role is to build, scale, and optimise next-generation world model architectures (e.g. GAIA and successors) and bridge them into high-throughput training infrastructure, enabling synthetic data and simulation to dramatically accelerate autonomy development. You’ll design systems to acquire, process, and curate multimodal data at scale. You’ll turn raw experience into the high-quality datasets that fuel our models. You’ll sit at the intersection of machine learning research and data engineering, collaborating closely with scientists and infrastructure teams to ensure our workflows are robust, efficient, and deeply integrated with our model training stack. Your work will directly impact how quickly and effectively we can train, evaluate, and deploy embodied AI systems in the real world.
Job Responsibility:
Design and implement large-scale data acquisition, processing, and curation pipelines, owning the full lifecycle of high-quality datasets used to train advanced robotics and foundation models
Continuously improve dataset quality and utility through sophisticated data analysis, debugging, and experimentation
developing metrics, tests, and monitoring mechanisms that directly drive model performance improvements
Develop and scale multimodal data pipelines for ingestion, preprocessing, filtering, annotation, and storage across video, LiDAR, and telemetry modalities
Run systematic experiments on data ablations and composition to assess their impact on model training dynamics, generalisation, and downstream performance
Collaborate with ML researchers and platform engineers to ensure datasets are fit for purpose and efficiently integrated into large-scale training workflows
Build internal tools and workflows for dataset auditing, visualization, and versioning to streamline iteration and reproducibility
Advance best practices for data governance, reliability, and scalability across the data lifecycle
ensuring data safety, privacy, and long-term maintainability
Requirements:
Experience in ML engineering, data engineering, or applied ML roles focused on large-scale data systems
Proven experience building and maintaining large-scale data pipelines for machine learning, including data ingestion, transformation, and validation
Strong Python fundamentals and experience with modern ML and data frameworks (e.g. PyTorch, Ray, Dask, Spark, or equivalent)
Solid understanding of multimodal data (video, lidar, sensor telemetry) and its challenges in large-scale training
Experience defining and tracking data quality metrics, conducting dataset analysis, and driving data-informed improvements in model performance
Demonstrated ability to work collaboratively with ML researchers, platform engineers, and product teams in a fast-paced, experimental environment
Strong problem-solving skills, a data-driven mindset, and the ability to translate research needs into reliable data solutions
Nice to have:
Exposure to large-scale storage, distributed training systems, or cloud compute environments (Azure, AWS, GCP)
Experience designing high-throughput, distributed data pipelines (e.g. with Spark, Ray, Beam, or similar frameworks)
Familiarity with data versioning, lineage, and governance tools (e.g. LakeFS, DVC, MLflow, Delta Lake)
Experience in AVs, robotics, simulation, or other embodied AI domains
Familiarity with foundation models, generative models, or simulation-based data pipelines
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.