This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Security AI Platform team builds and operates production infrastructure that powers AI-native security capabilities at Microsoft scale. We are organized into two focused groups: Platform + Apps develops the core product, microservices, and architecture; AI Operations ensures reliability, deployments, and operational excellence. Together, we deliver mission-critical services that process millions of requests daily. We are seeking a Principal AI Operations Engineer to define the technical direction for the AI Operations group. In this role, you will design and architect operational systems, establish standards for branch health, CI/CD pipelines, production deployments, and on-call processes. You will drive reliability initiatives, maintain production health and uptime, and ensure the platform meets its SLOs. You will be the escalation point for complex incidents and work closely with the Platform team to ensure services are operationally ready.
Job Responsibility:
Define the operational vision, standards, and roadmap for the platform
establish SLOs, error budgets, and reliability targets
Drive technical direction for the AI Operations group: architecture for deployments, pipelines, branch health, and production reliability
Own CI/CD pipeline architecture: Azure DevOps/GitHub Actions pipelines, build optimization, artifact management, and deployment automation
Manage Kubernetes infrastructure: AKS cluster operations, Helm chart management, node pool configuration, GPU resource allocation, and autoscaling (KEDA)
Drive production deployments: canary/ring rollouts, safe deployment practices, rollback procedures, and release coordination with Platform team
Establish and operate first-level on-call: incident response procedures, escalation paths, runbooks, and post-incident reviews
Build and maintain observability infrastructure: Prometheus, Grafana, OpenTelemetry collectors, alerting rules, and dashboard curation
Manage infrastructure as code: Bicep templates for Azure resources, Helm charts for Kubernetes deployments, and environment parity
Ensure branch health and code quality gates: PR validation pipelines, automated testing, security scanning, and merge policies
Debug and diagnose production issues: analyze logs (Kusto/ADX), traces, and metrics to identify root causes and drive resolution
Collaborate with Platform team on operational readiness: review service designs for operability, define deployment requirements, and validate runbooks
establish operational effective practices and continuous improvement culture
Embody our culture and values
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
6+ years technical engineering experience in DevOps, SRE, or platform operations
6+ years driving complex operational initiatives across teams
demonstrated success leading without authority
4+ years hands-on experience with Kubernetes in production environments
3+ years building and maintaining CI/CD pipelines at scale
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Nice to have:
Experienced with Kubernetes: cluster operations, Helm, troubleshooting, autoscaling, and production management
Proficiency with CI/CD platforms: Azure DevOps, GitHub Actions, or similar pipeline tooling
Experience with cloud platforms (Azure preferred): AKS, networking, identity management, and resource provisioning
Infrastructure as Code: Bicep, Terraform, or Helm chart development
Observability tooling: Prometheus, Grafana, OpenTelemetry, and log analytics (Kusto/KQL)