CrawlJobs Logo

Senior Supercomputing Operations Engineer

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Multiple Locations

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

119800.00 - 234700.00 USD / Year

Job Description:

Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC) organization powers some of the world’s largest cloud‑native supercomputers used for frontier AI training, scientific computing, and large‑scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this supercomputing scale, reliability and operational excellence are engineering challenges of their own. As a Senior Supercomputing Operations Engineer, you will own day‑to‑day operations of InfiniBand and GPU interconnect fabrics and treating them as a single, mission‑critical reliability domain that directly impacts GPU availability, training throughput, and customer SLAs. You will lead incident triage and mitigation, debug complex fabric‑layer failures, and correlate telemetry across nodes, switches, SM behavior, and GPU subsystems to identify true root causes. Your work will focus on resolving real production incidents at scale, improving operational readiness, and preventing recurrence through better tooling, automation, and deep systems understanding. You will build and use state‑of‑the‑art tools to detect issues proactively, close operational gaps, and improve observability across our fabrics. You will contribute to TSGs, operational playbooks, and escalation guides while partnering with internal engineering teams and industry leading manufacturers to drive meaningful fixes. The solutions you develop and the operational improvements you drive will uplift the reliability of Azure’s largest supercomputing deployments and directly support the most compute‑intensive AI workloads running in the cloud.

Job Responsibility:

  • Act as DRI for InfiniBand and GPU interconnect fabric operations, ensuring GPU availability and AI training stability
  • Lead incident triage, mitigation, recovery, and root cause analysis for fabric-related production issues
  • Perform deep multi-layer debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, and GPU interactions
  • Drive operational excellence and prevention by identifying systemic failure patterns and authoring TSGs, playbooks, and escalation guides
  • Build and leverage automation, telemetry, and tooling to improve detection, debuggability, and mean time to mitigation

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience operating high performance computing (HPC), artificial intelligence (AI), or largescale distributed systems in production environments
  • Handson experience operating interconnect fabrics for HPC, AI, or largescale distributed systems in production
  • Strong Linux systems knowledge with demonstrated experience debugging lowlevel infrastructure issues
  • Demonstrated ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve production issues
  • Familiarity with InfiniBand Subnet Manager behavior, including routing, congestion control, and fabric telemetry

Additional Information:

Job Posted:
March 04, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Supercomputing Operations Engineer

Senior Research Engineer

The HPE HPC & AI EMEA Research Lab (ERL) is characterized by a unique blend of i...
Location
Location
Germany , Munich, Berlin
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Parallel programming experience, with programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages, etc.
  • An understanding of AI/ML frameworks, experience with frameworks such as TensorFlow or PyTorch is highly desirable
  • An interest in system- and data center monitoring and operational data analysis
  • Professional language skills in English and German
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Work with the most esteemed research partners across Europe
  • Enable high performance research software on pre-Exascale and Exascale supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run applications taking advantage of leading-edge hardware at scale
  • Make and operate HPC/AI systems and datacenters in a sustainable way
  • Manage modern data-intensive workloads in high performance environments
What we offer
What we offer
  • Competitive salary and extensive benefits package (pension scheme, insurances, bike and car leasing, and other fringe benefits)
  • Work-life balance (flexible working time and hybrid workplace model, 30 vacation days, four HPE Wellness-Fridays, up to six months paid parental leave)
  • Support for education, training, and career development
  • Diverse and dynamic work environment
Read More
Arrow Right

Principal Supercomputing Operations Engineering Manager

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
  • Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
  • Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
  • Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
  • Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
  • Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet
  • Fulltime
Read More
Arrow Right

Senior Technical Program Manager (Supercomputing Operations)

Azure Specialized Compute drives the hardware roadmap, software and services tha...
Location
Location
United States , Multiple Locations
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree AND 4+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
  • 2+ years of experience managing cross-functional and/or cross-team projects
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Drive highly impactful cross-organization initiatives around end to end experiences spanning multiple Azure products and services
  • Manage complex program/project to successful completion
  • Mentor junior members of the team and provide support to our Product Management team.
What we offer
What we offer
  • Certain roles may be eligible for benefits and other compensation.
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Infrastructure Data & Analytics

We are seeking experienced Infrastructure Data & Analytics Engineers to join our...
Location
Location
United States , Multiple Locations; Mountain View; San Francisco Bay area; New York City metropolitan area
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, or related technical field AND 8+ years technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 6+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 10+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Proven technical leadership in data engineering, analytics platforms, or large-scale telemetry systems
  • Hands-on experience with ETL orchestration frameworks such as Airflow, Dagster, or similar
  • Strong communication skills
  • can explain complex systems clearly to senior leader
Job Responsibility
Job Responsibility
  • Act as the technical lead and owner for infrastructure analytics across compute, storage, and networking
  • Design and build durable, scalable data pipelines that ingest telemetry from clusters, schedulers, health systems, and capacity trackers into Data Warehouse
  • Define and standardize core metrics and semantics (e.g., utilization, occupancy, MFU, goodput, capacity readiness, delivery-to-production)
  • Architect and maintain self-service dashboards and APIs for fleet, cluster, and squad-level visibility
  • Partner closely with stakeholders across Supercomputing Infra, Researchers, Strategy and Executives to ensure metrics reflect operational and business reality
  • Implement robust and fault-tolerant systems for data ingestion and processing
  • Lead data architecture and engineering decisions, applying strong technical judgment to proactively shape executive-level discussions and decisions
  • Identify data gaps and instrumentation issues
  • drive fixes by influencing upstream engineering teams
  • Establish data quality, validation, documentation, and governance so metrics are trusted and repeatable
  • Fulltime
Read More
Arrow Right

HPC Senior Technical Writer

In this position you will collaborate with knowledge management project leads an...
Location
Location
United States of America , Chippewa Falls
Salary
Salary:
81500.00 - 187500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Technical Communications, Computer Science, or related technical/communications field with 4-6 years related experience
  • Advanced University degree and 2-4 years' experience or equivalent
  • Understands concepts and develops in-depth working knowledge of products, applications, and systems in assigned area of responsibility
  • Ability to deliver on multiple project technical requirements, schedules, and information formats
  • Codes in HTML, DHTML, XML, JavaScript or similar as required
  • Applies developed subject matter knowledge to solve common and complex business issues and recommends appropriate alternatives
  • Works on problems of diverse complexity and scope
  • May act as a team or project leader providing direction to team activities and facilitates information validation and team decision making process
  • Exercises independent judgment to identify and select a solution
  • Knowledge of HPC system software and hardware components, including operating systems, programming languages, system monitoring applications, HPC storage, chassis, servers, compute nodes, blades, HPC storage, coolant systems, power supplies, high speed network switches and cabling, and more
Job Responsibility
Job Responsibility
  • Create technical product documentation for software products and hardware
  • Analyze customer information requirements and product specifications to define scope of work and documentation plan
  • Identify and address the needs of all user groups, including end users, system administrators, internal support engineers, product developers, integration test teams, and training developers
  • Test documentation for install or administrative tasks to improve information deliverables and provide feedback on ease of use and user interfaces to product development
  • Manage workload in Jira and source management tools, including SDL, Oxygen, Git, and Github, to manage changes in the shared work environment
  • Create, revise, and manage content in Oxygen Author (DITA), Markdown, and other content tools
  • Work with developers, testers, product managers, technical support, and training to identify new features and content that needs to be reworked
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Receptionist

We are looking for a detail-oriented Receptionist to join our team in Miami, Flo...
Location
Location
United States , Miami
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in using a multi-line phone system for managing calls
  • Strong customer service skills with the ability to handle inquiries professionally
  • Experience in data entry with attention to detail and accuracy
  • Ability to communicate effectively through email correspondence
  • Excellent interpersonal skills to interact with staff and visitors
  • Competence in Microsoft Excel, Outlook, and Word for administrative tasks
  • Organizational skills to manage files and maintain office order
  • Capability to schedule appointments and coordinate meetings efficiently
Job Responsibility
Job Responsibility
  • Oversee access to the office, ensuring security and proper protocols are followed
  • Manage the stocking and organization of supplies in the kitchens to maintain efficiency
  • Maintain the cleanliness and orderliness of the office environment to ensure a neat appearance
  • Handle incoming phone calls using a multi-line phone system, providing courteous and efficient service
  • Assist with scheduling appointments and coordinating meetings as needed
  • Perform accurate data entry tasks to support administrative functions
  • Organize and maintain files, ensuring easy accessibility and proper documentation
  • Communicate effectively via email to address inquiries and provide information
  • Execute various ad hoc projects and tasks as assigned to support office operations
What we offer
What we offer
  • medical, vision, dental, and life and disability insurance
  • eligible to enroll in our company 401(k) plan
Read More
Arrow Right

Psychiatrist

Astrya Global, a San Diego–based medical staffing agency, is hiring Psychiatrist...
Location
Location
United States , San Bernardino
Salary
Salary:
Not provided
astryaglobal.com Logo
Astrya Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Active CA licensure
  • Board certification
  • Active CA DEA
  • New graduates eligible
Job Responsibility
Job Responsibility
  • Evaluate and diagnose mental health disorders
  • Develop and implement treatment plans
  • See 15+ patients per day completing initial and follow up appointments
  • Prescribe and refill medications
  • Collaborate with up to 4 NPs as needed
What we offer
What we offer
  • Malpractice Insurance
  • Weekly pay
  • Full-service credentialing and licensing department
  • Dedicated corporate travel team with airfare, car rental and hotel booking
  • Referral Bonus up to $5,000
  • Fulltime
Read More
Arrow Right