CrawlJobs Logo

Senior Supercomputing Operations Engineer

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Multiple Locations

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

119800.00 - 234700.00 USD / Year

Job Description:

Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC) organization powers some of the world’s largest cloud‑native supercomputers used for frontier AI training, scientific computing, and large‑scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this supercomputing scale, reliability and operational excellence are engineering challenges of their own. As a Senior Supercomputing Operations Engineer, you will own day‑to‑day operations of InfiniBand and GPU interconnect fabrics and treating them as a single, mission‑critical reliability domain that directly impacts GPU availability, training throughput, and customer SLAs. You will lead incident triage and mitigation, debug complex fabric‑layer failures, and correlate telemetry across nodes, switches, SM behavior, and GPU subsystems to identify true root causes. Your work will focus on resolving real production incidents at scale, improving operational readiness, and preventing recurrence through better tooling, automation, and deep systems understanding. You will build and use state‑of‑the‑art tools to detect issues proactively, close operational gaps, and improve observability across our fabrics. You will contribute to TSGs, operational playbooks, and escalation guides while partnering with internal engineering teams and industry leading manufacturers to drive meaningful fixes. The solutions you develop and the operational improvements you drive will uplift the reliability of Azure’s largest supercomputing deployments and directly support the most compute‑intensive AI workloads running in the cloud.

Job Responsibility:

  • Act as DRI for InfiniBand and GPU interconnect fabric operations, ensuring GPU availability and AI training stability
  • Lead incident triage, mitigation, recovery, and root cause analysis for fabric-related production issues
  • Perform deep multi-layer debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, and GPU interactions
  • Drive operational excellence and prevention by identifying systemic failure patterns and authoring TSGs, playbooks, and escalation guides
  • Build and leverage automation, telemetry, and tooling to improve detection, debuggability, and mean time to mitigation

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience operating high performance computing (HPC), artificial intelligence (AI), or largescale distributed systems in production environments
  • Handson experience operating interconnect fabrics for HPC, AI, or largescale distributed systems in production
  • Strong Linux systems knowledge with demonstrated experience debugging lowlevel infrastructure issues
  • Demonstrated ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve production issues
  • Familiarity with InfiniBand Subnet Manager behavior, including routing, congestion control, and fabric telemetry

Additional Information:

Job Posted:
March 04, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Supercomputing Operations Engineer

Senior Research Engineer

The HPE HPC & AI EMEA Research Lab (ERL) is characterized by a unique blend of i...
Location
Location
Germany , Munich, Berlin
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Parallel programming experience, with programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages, etc.
  • An understanding of AI/ML frameworks, experience with frameworks such as TensorFlow or PyTorch is highly desirable
  • An interest in system- and data center monitoring and operational data analysis
  • Professional language skills in English and German
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Work with the most esteemed research partners across Europe
  • Enable high performance research software on pre-Exascale and Exascale supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run applications taking advantage of leading-edge hardware at scale
  • Make and operate HPC/AI systems and datacenters in a sustainable way
  • Manage modern data-intensive workloads in high performance environments
What we offer
What we offer
  • Competitive salary and extensive benefits package (pension scheme, insurances, bike and car leasing, and other fringe benefits)
  • Work-life balance (flexible working time and hybrid workplace model, 30 vacation days, four HPE Wellness-Fridays, up to six months paid parental leave)
  • Support for education, training, and career development
  • Diverse and dynamic work environment
Read More
Arrow Right

Principal Supercomputing Operations Engineering Manager

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
  • Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
  • Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
  • Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
  • Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
  • Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet
  • Fulltime
Read More
Arrow Right
New

Senior Technical Program Manager (Supercomputing Operations)

Azure Specialized Compute drives the hardware roadmap, software and services tha...
Location
Location
United States , Multiple Locations
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree AND 4+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
  • 2+ years of experience managing cross-functional and/or cross-team projects
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Drive highly impactful cross-organization initiatives around end to end experiences spanning multiple Azure products and services
  • Manage complex program/project to successful completion
  • Mentor junior members of the team and provide support to our Product Management team.
What we offer
What we offer
  • Certain roles may be eligible for benefits and other compensation.
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Infrastructure Data & Analytics

We are seeking experienced Infrastructure Data & Analytics Engineers to join our...
Location
Location
United States , Multiple Locations; Mountain View; San Francisco Bay area; New York City metropolitan area
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, or related technical field AND 8+ years technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 6+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 10+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Proven technical leadership in data engineering, analytics platforms, or large-scale telemetry systems
  • Hands-on experience with ETL orchestration frameworks such as Airflow, Dagster, or similar
  • Strong communication skills
  • can explain complex systems clearly to senior leader
Job Responsibility
Job Responsibility
  • Act as the technical lead and owner for infrastructure analytics across compute, storage, and networking
  • Design and build durable, scalable data pipelines that ingest telemetry from clusters, schedulers, health systems, and capacity trackers into Data Warehouse
  • Define and standardize core metrics and semantics (e.g., utilization, occupancy, MFU, goodput, capacity readiness, delivery-to-production)
  • Architect and maintain self-service dashboards and APIs for fleet, cluster, and squad-level visibility
  • Partner closely with stakeholders across Supercomputing Infra, Researchers, Strategy and Executives to ensure metrics reflect operational and business reality
  • Implement robust and fault-tolerant systems for data ingestion and processing
  • Lead data architecture and engineering decisions, applying strong technical judgment to proactively shape executive-level discussions and decisions
  • Identify data gaps and instrumentation issues
  • drive fixes by influencing upstream engineering teams
  • Establish data quality, validation, documentation, and governance so metrics are trusted and repeatable
  • Fulltime
Read More
Arrow Right

HPC Senior Technical Writer

In this position you will collaborate with knowledge management project leads an...
Location
Location
United States of America , Chippewa Falls
Salary
Salary:
81500.00 - 187500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Technical Communications, Computer Science, or related technical/communications field with 4-6 years related experience
  • Advanced University degree and 2-4 years' experience or equivalent
  • Understands concepts and develops in-depth working knowledge of products, applications, and systems in assigned area of responsibility
  • Ability to deliver on multiple project technical requirements, schedules, and information formats
  • Codes in HTML, DHTML, XML, JavaScript or similar as required
  • Applies developed subject matter knowledge to solve common and complex business issues and recommends appropriate alternatives
  • Works on problems of diverse complexity and scope
  • May act as a team or project leader providing direction to team activities and facilitates information validation and team decision making process
  • Exercises independent judgment to identify and select a solution
  • Knowledge of HPC system software and hardware components, including operating systems, programming languages, system monitoring applications, HPC storage, chassis, servers, compute nodes, blades, HPC storage, coolant systems, power supplies, high speed network switches and cabling, and more
Job Responsibility
Job Responsibility
  • Create technical product documentation for software products and hardware
  • Analyze customer information requirements and product specifications to define scope of work and documentation plan
  • Identify and address the needs of all user groups, including end users, system administrators, internal support engineers, product developers, integration test teams, and training developers
  • Test documentation for install or administrative tasks to improve information deliverables and provide feedback on ease of use and user interfaces to product development
  • Manage workload in Jira and source management tools, including SDL, Oxygen, Git, and Github, to manage changes in the shared work environment
  • Create, revise, and manage content in Oxygen Author (DITA), Markdown, and other content tools
  • Work with developers, testers, product managers, technical support, and training to identify new features and content that needs to be reworked
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right
New

Technical Support Engineer

Arketa is building the operating system for modern fitness and wellness. Our mis...
Location
Location
Mexico , Mexico City
Salary
Salary:
45000.00 - 60000.00 USD / Year
helpcare.ai Logo
Helpcare AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years experience as a software engineer
  • Strong written and spoken English fluency
  • Production experience in node.js and react/typescript
  • High autonomy and can navigate ambiguity
  • Experience in fast moving environment, such as small companies or a start-up
  • AI-tooling proficiency: AI as a force multiplier
  • Eligibility to work out of our Mexico City office
Job Responsibility
Job Responsibility
  • Triage, investigate, and resolve bugs reported by customers and the support team
  • Own issues end-to-end: reproduce → diagnose → fix → validate → communicate
  • Collaborate closely with Support to provide timely updates and clarity on issues
  • Improve system reliability by identifying root causes (not just patching symptoms)
  • Maintain high-quality standards (testing, edge cases, regressions)
  • 60% talking to customers and fixing either bugs or user experience edge cases
  • 40% actually improving systems long term
What we offer
What we offer
  • Competitive Salary
  • Stock Options
  • Unlimited PTO
  • Ownership and Opportunity for Advancement
  • Fulltime
Read More
Arrow Right
New

Production Specialist

This position (Production Specialist) is responsible for the preparation and com...
Location
Location
United States , Perrysburg
Salary
Salary:
Not provided
formlabs.com Logo
Formlabs GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Associates degree or High School degree with production experience
  • Strong working knowledge of spreadsheets
  • Excellent verbal and written communication skills
  • Solid problem-solving abilities
  • Self-motivated with the ability to work on an independent basis
  • Strong organizational and time management skills
Job Responsibility
Job Responsibility
  • Ensure a reliable live inventory of raw materials, WIP, or completed batches in the areas under his/her responsibility
  • Provide first hand support to Production on issues associated with the resins based on the actual process parameters that may result in significant downtime
  • Suggest procedural and process improvements to meet the plant’s productivity and quality goals
  • Ensure compliance with all Spectra health and safety and local, state, and federal regulatory requirements
  • Responsible for creating, reviewing, submitting and completing the resin recipes required to accomplish the production schedules
  • Submit and complete all the sampling associated with the batching process so batches are fully approved before use
  • Ensure the effective utilization of raw materials in the batching area so minimal waste or obsolescence is avoided while accomplishing his/her duties
  • Responsible to follow and maintain all the documentation associated with his/her activities as defined in the Quality Management System
  • Create or complete projects or reports assigned or requested by Supervisor
  • Address any potential equipment or batching process issues that may affect or delay the batching schedule
What we offer
What we offer
  • Medical, Dental, Vision
  • 3% match of yearly salary for 401(k)
  • EAP Program
  • In-office catering 3x per week
  • 120 hours of PTO per year
  • Fulltime
Read More
Arrow Right