This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
10Pearls is seeking a Staff/Senior MLOps Engineer – Azure ML Platform & LLMOps to design, build, and operate production-grade ML and LLM infrastructure at scale. This role is ideal for an experienced MLOps engineer who understands how to take machine learning and Generative AI systems from experimentation to reliable, secure, and scalable production environments. You will own critical platform capabilities including ML infrastructure, deployment automation, monitoring, observability, scalability, and operational excellence across Azure-based AI systems. This is a highly hands-on engineering role focused on enabling fast, safe, and cost-effective ML operations while partnering closely with ML Engineers, Data Engineers, and platform teams.
Job Responsibility:
Design and operate end-to-end ML infrastructure on Microsoft Azure, including training environments, model registries, deployment workflows, and scalable inference systems on Azure Kubernetes Service (AKS)
Own and evolve MLflow and Kubeflow platforms, including experiment tracking, model registry management, reproducible training workflows, and pipeline orchestration
Build and maintain robust CI/CD pipelines in GitLab for ML models and AI services, including validation gates, canary deployments, progressive delivery, and automated rollback strategies
Design scalable inference systems using AKS autoscaling, GPU scheduling, Redis caching, asynchronous processing with Azure Service Bus, and cost-aware infrastructure planning
Implement comprehensive monitoring and observability for ML and LLM systems, covering infrastructure metrics, latency, drift detection, token usage, quality metrics, and operational cost tracking
Define and enforce platform-level security controls including IAM policies, secrets management, network segmentation, audit logging, dependency scanning, and model access governance
Build highly available and fault-tolerant ML serving infrastructure with strong focus on scalability, disaster recovery, resilience, and platform reliability
Define and maintain platform SLOs for ML services, including incident response processes, postmortems, and operational improvement initiatives
Partner closely with ML Engineers to productionize new ML models, LLM systems, and agentic AI workflows with safe rollout and evaluation patterns
Optimize infrastructure utilization and operational cost across compute, GPU workloads, and LLM provider usage through batching, caching, autoscaling, and routing strategies
Ensure all production ML and AI services have actionable dashboards, alerts, observability standards, and operational playbooks for on-call readiness
Requirements:
Bachelor's degree in Computer Science, Engineering, or a related field (preferred)
5+ years of professional experience in MLOps, DevOps, SRE, Platform Engineering, or ML Infrastructure roles
Minimum 3 years of hands-on experience supporting production-grade ML systems and AI platforms
Strong hands-on experience with Microsoft Azure, including Azure Kubernetes Service (AKS), Azure Service Bus, Azure Storage, networking, identity management, and cloud cost optimization
Strong Kubernetes operational expertise including Helm, Ingress Controllers, autoscaling (HPA/VPA/KEDA), GPU scheduling, workload troubleshooting, and large-scale container orchestration
Production experience with MLflow, Kubeflow, or equivalent ML platform tooling for experiment tracking, model registries, and ML pipeline orchestration
Strong expertise in GitLab CI/CD or equivalent CI/CD tooling for automated deployments, validation gates, rollback workflows, and progressive delivery patterns
Hands-on experience with monitoring and observability platforms including Prometheus, Grafana, OpenTelemetry, Azure Monitor, Datadog, New Relic, or Elastic
Experience monitoring ML/LLM systems including latency, model performance, drift, token usage, infrastructure health, and operational costs
Strong proficiency in Python and shell scripting for automation and operational tooling
Experience with Infrastructure-as-Code tools such as Terraform, Bicep, or ARM templates
Strong troubleshooting, debugging, and incident response capabilities across distributed systems and cloud-native environments
Excellent written and verbal communication skills, including technical documentation, runbooks, and incident reporting
Nice to have:
Experience operating production-grade LLM or Generative AI systems, including prompt versioning, evaluation frameworks, routing layers, and vector store operations
Experience with Azure AI Foundry, AWS AgentCore, SageMaker, or similar AI platform services
Exposure to GPU infrastructure and inference tooling such as NVIDIA GPU Operator, Triton Inference Server, vLLM, or TGI
Familiarity with model observability and evaluation platforms such as Arize, Fiddler, WhyLabs, or Evidently
Experience implementing security and compliance controls for enterprise ML environments
Experience working with vector databases, semantic search systems, or Retrieval-Augmented Generation (RAG) architectures