This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a highly motivated, experienced SRE/MLOps engineer with Python and Ray.io to build and maintain the next generation AI platform. This role focuses on developing software on top of open-source libraries such as Ray, enabling internal teams to run ML workloads efficiently.
Job Responsibility:
Build, refactor, and release software for the AI platform (feature development and bug fixes)
Deploy and manage applications on Ray.io, including workload management, cluster deployment, distributed task scheduling, and troubleshooting
Use Ray Dashboard and CLI tools to monitor and debug distributed jobs
Work with Ray ecosystem libraries: Ray Train, Ray Tune, Ray Serve, Ray Data
Integrate with tools such as Airflow, MLflow, Dask, DeepSpeed (a plus)
Collaborate with AI platform developers to provide CI/CD pipelines for automated deployment and configuration
Ensure high availability (target 99.999%) and monitor production systems
Develop automation for problem management and operational efficiency
Write documentation and provide technical support for internal users
Follow best practices for development: versioning, source control, branching, and merging patterns
Requirements:
Main coding language: Python (C++ good to have)
Strong experience with Ray.io, including at least two areas such as Ray Train or Ray Serve
Kubernetes / Docker: Proficient / Experienced
Hands-on experience with distributed systems, cluster management, and cloud technologies
Familiarity with DevOps practices, CI/CD pipelines, and test automation
Excellent problem-solving, debugging, and triaging skills
Strong communication skills for collaboration with partners, customers, and engineers
Ability to manage multiple projects in a fast-paced environment
English proficiency (oral and written)
Nice to have:
TensorRT, DeepSpeed, PyTorch Distributed - will be a plus
C++ good to have
What we offer:
Flexible working format - remote, office-based or flexible
A competitive salary and good compensation package
Personalized career growth
Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
Active tech communities with regular knowledge sharing