This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
At Doctolib, we're on a mission to transform healthcare through the power of AI. As a Senior Data Engineer, you'll play a key role in building and optimizing the data foundations within the AI Team to deliver safe, scalable, and impactful models. You will join a dedicated team working on data infrastructure for LLM, VLM and RAG-based systems, powering our new AI Medical Companion. Your work will ensure that our engineers and data scientists can train, evaluate, and deploy AI models efficiently on high-quality, well-structured, and compliant data.
Job Responsibility:
Ensure high standards of data quality for AI model inputs
Design, build, and maintain scalable data pipelines on Google Cloud Platform (GCP) for AI and machine learning use cases
Implement data ingestion and transformation frameworks that power Retrieval systems and training datasets for LLMs and multimodal models
Architect and manage NoSQL and Vector Databases to store and retrieve embeddings, documents, and model inputs efficiently
Collaborate with ML and platform teams to define data schemas, partitioning strategies, and governance rules that ensure privacy, scalability, and reliability
Integrate unstructured and structured data sources (text, speech, image, documents, metadata) into unified data models ready for AI consumption
Optimize performance and cost of data pipelines using GCP native services (BigQuery, Dataflow, Pub/Sub, Cloud Storage, Vertex AI)
Contribute to data quality and lineage frameworks, ensuring AI models are trained on validated, auditable, and compliant datasets
Continuously evaluate and improve our data stack to accelerate AI experimentation and deployment
Requirements:
Master’s or Ph.D. degree in Computer Science, Data Engineering, or a related field
5+ years of experience in Data Engineering, ideally supporting AI or ML workloads
Strong experience with the GCP data ecosystem
Proficiency in Python and SQL, with experience in data pipeline orchestration (e.g., Airflow, Dagster, Cloud Composer)
Deep understanding of NoSQL systems (e.g., MongoDB) and vector databases (e.g., FAISS, Vector Search)
Experience designing data architectures for RAG, embeddings, or model training pipelines
Knowledge of data governance, security, and compliance for sensitive or regulated data
Familiarity with W&B / MLflow / Braintrust / DVC for experiment tracking and dataset versioning (extract snapshots, change tracking, reproducibility)
Familiarity with containerized environments (Docker, Kubernetes) and CI/CD for data workflows
A collaborative mindset and passion for building the data foundations of next-generation AI systems
What we offer:
Free comprehensive health insurance for you and your children
Parent Care Program: additional leave on top of the legal parental leave
Free mental health and coaching services through our partner Moka.care
For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
Work Council subsidy to refund part of a sport club membership or a creative class