This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As an MLOps Engineer in DAMO service line, you will be responsible for ensuring the reliability, safety, performance and continuous improvement of large-scale machine learning and AI systems in production, including both generative AI and traditional ML systems like computer vision and recommendation models. You will work across the full software delivery lifecycle, contributing to design, implementation, deployment and ongoing operational excellence.
Job Responsibility:
Design, implement and maintain monitoring and alerting for ML and AI operational signals
Build and operate robust evaluation and testing pipelines for all ML and AI systems
Investigate and resolve production issues related to model behaviour
Collaborate with infrastructure and platform teams to ensure stable, performant and cost-efficient AI inference
Manage the lifecycle of ML models, prompts, embeddings, vector indices and associated components
Design and operate effective feedback loops that incorporate real user interactions
Uphold governance, safety and compliance standards
Maintain clear, comprehensive documentation
Communicate system health, risks, upcoming changes and operational insights clearly
Support the growth and development of junior team members
Requirements:
High proficiency in Python (Pandas, NumPy, Scikit-learn) for scripting, analysis, and maintaining production models
Strong SQL skills for querying, data manipulation, and operational data checks
Experience building or maintaining GenAI / agentic solutions (e.g., RAG, LlamaIndex, CrewAI, or similar orchestration/RAG tooling)
Solid understanding of classical ML algorithms, model evaluation, and challenges like drift and bias
Hands-on experience with model monitoring (data quality, prediction quality, latency) using Prometheus, Grafana, or cloud-native tools
Experience with Azure (Databricks, Azure Machine Learning, etc.) for deployment and resource management
Familiarity with Agile methodologies (Scrum/Kanban)
Must be Singaporean citizens or already hold Singaporean Permanent Residency (PR) at the time of application
Willingness to be part of a 24x7 on-call rotation, as needed
Nice to have:
Experience with big data frameworks (Spark, Dask) for large-scale processing
Understanding of containerization/orchestration such as Docker and basic Kubernetes
Exposure to workflow/pipeline or IaC tooling (Airflow, Kubeflow, MLflow, Terraform)