This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a Product-Minded Junior Research Infrastructure Engineer to join our growing team. This is a '70/30' role: you will spend 70% of your time on hardcore backend and infrastructure—tackling complex distributed systems—and 30% of your time building intuitive internal tools that transform our platform capabilities into a seamless product experience for researchers. You will design, build, and operate distributed data systems that power large-scale ingestion, processing, and transformation of datasets used for AI model training. This is a versatile role: you’ll own end-to-end pipelines, ensure data quality and scalability, and collaborate closely with ML researchers to prepare diverse datasets for cutting-edge model training. You’ll thrive in our fast-paced startup environment, where problem-solving, adaptability, and wearing multiple hats are the norm.
Job Responsibility:
Participate in the design and implementation of distributed task orchestration systems using Temporal or Celery
Architect pipelines across cloud object storage (S3, GCS), data lakes, and metadata catalogs
Implement partitioning, sharding, and caching strategies to ensure data processing pipelines are resilient, highly available, and consistent
Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries)
Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics
Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction
Implement validation and quality checks to ensure datasets meet ML training requirements
Collaborate with ML researchers to quickly adapt pipelines to evolving pretraining and evaluation needs
Use infrastructure-as-code (Terraform, Kubernetes, etc.) to manage scalable and reproducible environments
Manage data assets using Databricks Asset Bundles (DABs) and build rigorous CI/CD pipelines (GitHub Actions)
Focus on maximizing cluster utilization (CPU/Memory) and optimizing EC2 instance allocation to aggressively reduce compute costs
Take ownership of the platform's 'Interface' by building Data Explorers and management consoles using React or Next.js
Actively listen to researchers and data scientists to iterate on UI/UX based on their feedback
Simplify complex CLI operations into intuitive GUI interactions to boost overall developer experience (DevEx)
Requirements:
2+ years of experience in software engineering, backend development, or distributed systems
Strong programming skills in Python (plus Scala/Java/C++ a plus)
Familiarity with distributed frameworks (Spark, Dask, Ray) and cloud platforms (AWS/GCP/Azure)
Experience with workflow orchestration tools (Temporal, Celery, or Airflow)
Proficiency with Infrastructure as Code (Terraform) and CI/CD tools (GitHub Actions)
Experience building web applications or internal tools using React or Next.js
A 'product-first' mindset: an interest in how users interact with infrastructure and a desire to build clean, functional interfaces