This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Help build the infrastructure that powers training, evaluation, and data platforms for reliable deployment of world-class foundational AI models. We are on a mission to create state-of-the-art AI models and deploy them across Microsoft products at an unprecedented scale. You’ll collaborate across engineering and research to design, evolve, and operate core research infrastructure, so that product teams can train faster, evaluate more rigorously, and ship with confidence. You’ll work closely with the teams that transform pre-trained models into the consumer Copilot experience.
Job Responsibility:
Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management
Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations
advocate for best practices in security, reproducibility, and cost efficiency
Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry)
Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage
Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams
Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management
Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals
Requirements:
Strong software engineering background building reliable, scalable production systems (Python preferred)
Hands‑on experience supporting large‑scale ML / LLM training, evaluation, or experimentation infrastructure
Operating GPU‑heavy workloads in cloud environments using Docker and Kubernetes (scheduling, utilization, isolation)
Designing and running data / compute pipelines and orchestration (e.g., Airflow, Argo) with object storage (Azure Blob / S3)
Building secure, reproducible platforms using CI/CD, infrastructure‑as‑code (Terraform, Helm), container security, and secrets management
Experience working closely with AI researchers in fast‑moving, experimental, frontier‑scale research environments and building internal tools (CLIs, portals, APIs) to boost productivity