This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a skilled and passionate ML Platform Engineer to join our team and build the next generation of our machine learning infrastructure. You will be responsible for designing, implementing, and maintaining the core MLOps platform that empowers our Data Science and ML Engineering teams to rapidly develop, deploy, and monitor high-performance models at scale. Crucially, you will contribute to the evolution of our unified AI Platform, covering both traditional ML and our growing LLM (Large Language Model) platform.
Job Responsibility:
Platform Development: Design, build, and maintain the end-to-end MLOps platform using Kubernetes and Cloud Services
Infrastructure as Code (IaC): Use Terraform or similar tools to manage, provision, and scale all ML-related infrastructure securely and efficiently
Pipeline Automation: Implement and optimize CI/CD/CT (Continuous Integration, Delivery, Training) pipelines to automate model training, testing, packaging, and deployment using tools like Argo and Kubeflow Pipelines
Serving Infrastructure: Build highly available, low-latency, and high-throughput model serving infrastructure
Observability: Implement robust monitoring, alerting, and logging solutions to track infrastructure health, model performance, and data/model drift
Tooling & Support: Evaluate, integrate, and support ML tools such as Feature Stores and distributed model training pipelines
Security & Compliance: Ensure platform security, implement RBAC (Role-Based Access Control), and manage secrets for sensitive data and production environments
Collaboration: Work closely with Data Scientists and ML Engineers to understand their needs and provide technical guidance on best practices for scaling their models
Requirements:
5+ years in backend software development
at least 2+ years focus on AI/ML Platform or MLOps infrastructure
deep expertise in MLOps practices, including automated deployment pipelines, model optimization, and production lifecycle management
proven experience designing and implementing low-latency model serving solutions
proficiency in Python
skill in writing high-quality, maintainable code
experience in design and development of large-scale distributed, high concurrency, low-latency inference, high availability systems
excellent communication and mentoring abilities
a relevant degree in Computer Science, Mathematics or related fields
Nice to have:
Familiarity with distributed compute/training frameworks (e.g., Ray, Spark)
experience configuring and managing ML workflows on cloud infrastructure (e.g., Kubernetes, Kubeflow)
working knowledge of LLM serving optimization (e.g., vLLM, TGI, Triton) and GPU resource management