This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Lead the transformation of how applications and AI systems are delivered, operated, and evolved at enterprise scale. This role owns the design and execution of AI‑powered DevOps in the Marriott’s AI Platform. The goal is to enabling teams to ship production‑grade, observable, self‑healing services with minimal human toil. You will partner deeply with the Kubernetes platform team, devops platform team and other organizational leaders to help produce safe scalable solutions that protect the core AI platform services so we can provide high business value interactions to the org.
Job Responsibility:
Build resilient CI/CD pipelines for platform services that include testing, monitoring and auto‑rollback
Deploy models and workloads via Kubernetes + SageMaker, KFServing, Ray Serve, etc.
sustain latency/error budgets
Work with other platform teams to advance their innovation roadmaps as an early adopter
Embed OpenTelemetry traces, vector‑metrics, cost monitors into unified dashboards
Implement MCP‑compliant gateways for safe human‑and‑agent invocations
Champion the use of internal autonomous agents to eliminate repetitive DevOps and SRE toil across build, deploy, and runtime operations
Serve as a thought leader for AI‑based operations, influencing architecture standards, platform roadmaps, and engineering culture
Coach senior engineers and platform teams on modern DevOps, SRE, and AI‑Ops patterns
Delivery and reliability of the platform: Lead post‑incident learning and drive systemic improvements through blameless retrospectives and automation
Requirements:
Extensive experience working on highly scalable and available systems as a software engineering experience
Deep knowledge of standard devOps practices and cloud infrastructure. This includes identity management and networking
Experience in ML Ops working with live models
IaC mastery (CDK/Terraform) and secrets management (Vault, AWS Secrets Manager)
Proven record hitting SLOs for containerized ML services at fleet scale
Deep Experience working with cloud
Strong servant‑leader with a passion for work‑automation and incident retros
Extreme desire to be part of a committed team that is building for global scale to change the way the world does travel
Excellent verbal communication skills, with the ability to articulate complex architectural decisions clearly
Ability to produce/review extremely clean software documentation
Ability to effectively communicate async with remote team members across the globe
Nice to have:
Experience moving legacy CI to agent‑augmented pipelines