This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Principal AI/ML Operations Engineer leads the architecture, automation, and operationalization of both machine learning and AI systems at scale. This role defines the strategy and technical standards for ML-Ops and AIOps across the organization, ensuring models and agents are evaluated, deployed, governed, and monitored with reliability, efficiency, and compliance. The candidate will collaborate across AI, data, and product engineering teams to drive best practices for serving, observability, automated retraining, evaluation flywheels, and operational guardrails for AI systems in production
Job Responsibility:
Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
Lead incident response and reliability strategies for ML/AI systems
Lead the deployment of AI models and systems in various environments
Collaborate with development teams to integrate AI solutions into existing workflows and applications
Ensure seamless integration with different platforms and technologies
Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
Implement logging, metering, and auditing for agent behavior, function calls, and compliance alignment
Architect end-to-end guardrails for AI agents including prompt injection protection, identity-aware routing, and tool usage authorization
Collaborate cross-functionally to standardize authentication, authorization, and session governance for multi-agent runtimes
Architect and standardize model registries and feature stores to support version tracking, lineage, and reproducibility across environments
Lead the deployment of machine learning models into production environments, ensuring scalability, reliability, and efficiency
Collaborate with software engineers to integrate machine learning models into existing applications and systems
Implement and maintain APIs for model inference
Design and manage training infrastructure including distributed training orchestration, GPU/TPU resource allocation, and automatic scaling
Implement CI/CD for model workflows using pipelines integrated with model validation, bias checks, and rollback automation
Build standardized experimentation frameworks for reproducible training, tuning, and deployment cycles (MLflow, W&B, Kubeflow)
Manage and optimize the infrastructure required for machine learning operations in cloud
Work closely with other teams to ensure the availability, security, and performance of machine learning systems
Implement robust monitoring solutions for deployed machine learning models to detect issues and ensure performance
Collaborate with data scientists and engineers to address and resolve model performance and data quality issues
Conduct regular system maintenance, updates, and optimizations to ensure optimal performance of machine learning solutions
Develop and maintain automation scripts and tools for managing machine learning workflows
Implement orchestration systems to streamline the end-to-end machine learning lifecycle, from data preparation to model deployment
Collaborate with data scientists to understand model requirements and constraints for deployment
Facilitate the transition of machine learning models from research to production, ensuring scalability and efficiency
Identify and implement optimizations to enhance the performance and efficiency of machine learning models in production
Conduct performance analysis and implement improvements based on resource utilization of metrics
Implement security measures to protect machine learning systems and data
Ensure compliance with regulatory requirements and industry standards related to machine learning and data privacy
Integrate audit controls, metadata storage, and lineage tracking across ML and AI workflows
Ensure complete monitoring and feedback loops including event logs, evaluations, and automated retraining triggers
Enforce secure deployment patterns with Infrastructure-as-Code and cloud-native secrets management
Define SLAs, error budgets, and compliance reporting mechanisms for ML and AI systems
Requirements:
Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
10+ years in ML infrastructure, DevOps, and software system architecture
4+ years in leading MLOps or AI Ops platforms
Strong programming skills in languages such as Python, Java, or Scala
Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Proficiency in containerization technologies (e.g., Docker, Kubernetes)
Proficient in scripting languages (e.g., Bash, python) for automation
Experience with workflow orchestration tools (e.g., Apache Airflow)
Expertise in managing and optimizing cloud-based infrastructure
Familiarity with DevOps practices and tools for automated deployment
Understanding of network configurations and security protocols
Ability to define problems, collect and analyze data, and propose innovative solutions. Strong critical thinking skills to evaluate models, identify limitations
Comfortable working in a fast-paced, rapidly evolving environment. Proactive in staying up to date with the latest trends, techniques, and technologies in AI/data science
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.