Hpc Operations Engineering Manager Job at Microsoft Corporation (Mountain View)

Hpc Operations Engineering Manager

Microsoft Corporation

Location:
United States , Mountain View

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

139900.00 - 274800.00 USD / Year

Save Job

Apply Position

Job Description:

Microsoft AI is seeking an experienced HPC Operations Engineering Manager to join our Infrastructure Team. In this role, you’ll lead a team of Site Reliability Engineers who blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You’ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.

Job Responsibility:

Team leadership: Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems
Observability: Design and help maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra
Automation & Tooling: Lead building of automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments
Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows

Requirements:

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with Site Reliability Engineering, DevOps, or Infrastructure Engineering Leadership roles AND 8+ years experience with Kubernetes, Docker, and container orchestration, AND 8+ years experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code, AND 6+ years experience with programming/scripting skills not limited to Python, Go, or Bash
OR equivalent experience
Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience AND 10+ years experience with Kubernetes, Docker, and container orchestration, AND 10+ years' experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
OR equivalent experience
6+ years people management experience
Experience in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
Experience running large-scale GPU clusters for ML/AI workloads
Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators)
Knowledge of CI/CD pipelines for Inference and ML model deployment
Solid knowledge of distributed systems, networking, and storage
Familiarity with ML training/inference pipelines
Background in capacity planning & cost optimization for GPU-heavy environments

Additional Information:

Job Posted:
February 16, 2026

Employment Type:

Fulltime

Work Type:

On-site work

Microsoft Corporation - All Job Offers

Job Link Share:

Hpc Operations Engineering Manager

Microsoft Corporation

Location:
United States , Mountain View

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:
February 16, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Hpc Operations Engineering Manager

HPC & AI System Test Engineering Manager

HPC & AI System Test Engineering Manager

Senior HPC Deployment Engineer

HPC Systems/Software Engineer

HPC & AI Systems Engineer for Integrated Systems Test

HPC & AI System Test Engineer

HPC & AI System Test Engineer

HPC & AI System Test Engineer

Hpc Operations Engineering Manager

Microsoft Corporation

Location:United States , Mountain View

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:February 16, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Hpc Operations Engineering Manager

HPC & AI System Test Engineering Manager

HPC & AI System Test Engineering Manager

Senior HPC Deployment Engineer

HPC Systems/Software Engineer

HPC & AI Systems Engineer for Integrated Systems Test

HPC & AI System Test Engineer

HPC & AI System Test Engineer

HPC & AI System Test Engineer

Location:
United States , Mountain View

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
February 16, 2026