CrawlJobs Logo

Senior Supercomputing Operations Engineer

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Multiple Locations

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

119800.00 - 234700.00 USD / Year

Job Description:

Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC) organization powers some of the world’s largest cloud‑native supercomputers used for frontier AI training, scientific computing, and large‑scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this supercomputing scale, reliability and operational excellence are engineering challenges of their own. As a Senior Supercomputing Operations Engineer, you will own day‑to‑day operations of InfiniBand and GPU interconnect fabrics and treating them as a single, mission‑critical reliability domain that directly impacts GPU availability, training throughput, and customer SLAs. You will lead incident triage and mitigation, debug complex fabric‑layer failures, and correlate telemetry across nodes, switches, SM behavior, and GPU subsystems to identify true root causes. Your work will focus on resolving real production incidents at scale, improving operational readiness, and preventing recurrence through better tooling, automation, and deep systems understanding. You will build and use state‑of‑the‑art tools to detect issues proactively, close operational gaps, and improve observability across our fabrics. You will contribute to TSGs, operational playbooks, and escalation guides while partnering with internal engineering teams and industry leading manufacturers to drive meaningful fixes. The solutions you develop and the operational improvements you drive will uplift the reliability of Azure’s largest supercomputing deployments and directly support the most compute‑intensive AI workloads running in the cloud.

Job Responsibility:

  • Act as DRI for InfiniBand and GPU interconnect fabric operations, ensuring GPU availability and AI training stability
  • Lead incident triage, mitigation, recovery, and root cause analysis for fabric-related production issues
  • Perform deep multi-layer debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, and GPU interactions
  • Drive operational excellence and prevention by identifying systemic failure patterns and authoring TSGs, playbooks, and escalation guides
  • Build and leverage automation, telemetry, and tooling to improve detection, debuggability, and mean time to mitigation

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience operating high performance computing (HPC), artificial intelligence (AI), or largescale distributed systems in production environments
  • Handson experience operating interconnect fabrics for HPC, AI, or largescale distributed systems in production
  • Strong Linux systems knowledge with demonstrated experience debugging lowlevel infrastructure issues
  • Demonstrated ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve production issues
  • Familiarity with InfiniBand Subnet Manager behavior, including routing, congestion control, and fabric telemetry

Additional Information:

Job Posted:
March 04, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Supercomputing Operations Engineer

Senior Research Engineer

The HPE HPC & AI EMEA Research Lab (ERL) is characterized by a unique blend of i...
Location
Location
Germany , Munich, Berlin
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Parallel programming experience, with programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages, etc.
  • An understanding of AI/ML frameworks, experience with frameworks such as TensorFlow or PyTorch is highly desirable
  • An interest in system- and data center monitoring and operational data analysis
  • Professional language skills in English and German
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Work with the most esteemed research partners across Europe
  • Enable high performance research software on pre-Exascale and Exascale supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run applications taking advantage of leading-edge hardware at scale
  • Make and operate HPC/AI systems and datacenters in a sustainable way
  • Manage modern data-intensive workloads in high performance environments
What we offer
What we offer
  • Competitive salary and extensive benefits package (pension scheme, insurances, bike and car leasing, and other fringe benefits)
  • Work-life balance (flexible working time and hybrid workplace model, 30 vacation days, four HPE Wellness-Fridays, up to six months paid parental leave)
  • Support for education, training, and career development
  • Diverse and dynamic work environment
Read More
Arrow Right
New

Principal Supercomputing Operations Engineering Manager

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
  • Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
  • Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
  • Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
  • Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
  • Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Infrastructure Data & Analytics

We are seeking experienced Infrastructure Data & Analytics Engineers to join our...
Location
Location
United States , Multiple Locations; Mountain View; San Francisco Bay area; New York City metropolitan area
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, or related technical field AND 8+ years technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 6+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 10+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Proven technical leadership in data engineering, analytics platforms, or large-scale telemetry systems
  • Hands-on experience with ETL orchestration frameworks such as Airflow, Dagster, or similar
  • Strong communication skills
  • can explain complex systems clearly to senior leader
Job Responsibility
Job Responsibility
  • Act as the technical lead and owner for infrastructure analytics across compute, storage, and networking
  • Design and build durable, scalable data pipelines that ingest telemetry from clusters, schedulers, health systems, and capacity trackers into Data Warehouse
  • Define and standardize core metrics and semantics (e.g., utilization, occupancy, MFU, goodput, capacity readiness, delivery-to-production)
  • Architect and maintain self-service dashboards and APIs for fleet, cluster, and squad-level visibility
  • Partner closely with stakeholders across Supercomputing Infra, Researchers, Strategy and Executives to ensure metrics reflect operational and business reality
  • Implement robust and fault-tolerant systems for data ingestion and processing
  • Lead data architecture and engineering decisions, applying strong technical judgment to proactively shape executive-level discussions and decisions
  • Identify data gaps and instrumentation issues
  • drive fixes by influencing upstream engineering teams
  • Establish data quality, validation, documentation, and governance so metrics are trusted and repeatable
  • Fulltime
Read More
Arrow Right

HPC Senior Technical Writer

In this position you will collaborate with knowledge management project leads an...
Location
Location
United States of America , Chippewa Falls
Salary
Salary:
81500.00 - 187500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Technical Communications, Computer Science, or related technical/communications field with 4-6 years related experience
  • Advanced University degree and 2-4 years' experience or equivalent
  • Understands concepts and develops in-depth working knowledge of products, applications, and systems in assigned area of responsibility
  • Ability to deliver on multiple project technical requirements, schedules, and information formats
  • Codes in HTML, DHTML, XML, JavaScript or similar as required
  • Applies developed subject matter knowledge to solve common and complex business issues and recommends appropriate alternatives
  • Works on problems of diverse complexity and scope
  • May act as a team or project leader providing direction to team activities and facilitates information validation and team decision making process
  • Exercises independent judgment to identify and select a solution
  • Knowledge of HPC system software and hardware components, including operating systems, programming languages, system monitoring applications, HPC storage, chassis, servers, compute nodes, blades, HPC storage, coolant systems, power supplies, high speed network switches and cabling, and more
Job Responsibility
Job Responsibility
  • Create technical product documentation for software products and hardware
  • Analyze customer information requirements and product specifications to define scope of work and documentation plan
  • Identify and address the needs of all user groups, including end users, system administrators, internal support engineers, product developers, integration test teams, and training developers
  • Test documentation for install or administrative tasks to improve information deliverables and provide feedback on ease of use and user interfaces to product development
  • Manage workload in Jira and source management tools, including SDL, Oxygen, Git, and Github, to manage changes in the shared work environment
  • Create, revise, and manage content in Oxygen Author (DITA), Markdown, and other content tools
  • Work with developers, testers, product managers, technical support, and training to identify new features and content that needs to be reworked
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right
New

Oracle EPM Consulting Analyst

Innofin Solutions is seeking college graduates with a background in either/or Ac...
Location
Location
Salary
Salary:
Not provided
ecpi.edu Logo
ECPI University
Expiration Date
March 06, 2026
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Business, Finance, Accounting, Information Systems, Computer Science or a related discipline
  • Advanced knowledge of the Microsoft Office suite of products (especially Excel)
  • Skills to communicate effectively orally and in writing
  • Speaking skills to deliver presentations internally and to clients
  • Ability to learn new things through mentoring and self-directed learning
  • Ability to meet project objectives under changing circumstances
  • Ability to travel domestically and internationally to work at client sites as required
  • Ability to establish and maintain effective working relationships, internal and with clients
Job Responsibility
Job Responsibility
  • Document project requirements
  • Create specifications for data integration
  • Build forms, smartlists, calculations, menus and reports
  • Assist with data conversion activities including validating numbers
  • Build security
  • Lead training sessions
  • Assist client with process change management
  • Manage projects
  • Work at client sites to deliver solutions that meet client objectives
What we offer
What we offer
  • Competitive compensation and bonus plans
  • Medical, Dental and Vision Insurance
  • PTO policy to allow necessary time to relax and recharge
  • 401(k) plan with matching
  • Frequent flyer miles and hotel points
  • Company provided laptop
  • Cell phone and internet reimbursement
  • Amazing growth opportunities to learn new things in a fast paced environment with customers in all sorts of industries
  • Fulltime
!
Read More
Arrow Right
New

Channel manager

At CDW, we make it happen, together. Trust, connection, and commitment are at th...
Location
Location
United States
Salary
Salary:
90000.00 - 140000.00 USD / Year
edtechjobs.io Logo
EdTech Jobs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience working in a Channel Sales or Alliance Management role
  • Experience working with public Cloud Computing principles and delivery models such as IaaS, SaaS, and DevOps
  • Working knowledge of AWS sales is preferred
  • Ability to work with Salesforce or a similar sales management tool
  • Ability to work independently in a high-growth business environment
  • Exceptional verbal, written, and presentation skills with the ability to present to large groups
  • Must be customer service oriented and believe in teamwork, collaboration, adaptability & initiative
  • Willing and able to travel up to 40%
  • AWS Cloud Practitioner certification(to be obtained within 1 year of employment)
Job Responsibility
Job Responsibility
  • Manage partner relationships
  • Lead meetings and reviews
  • Maintain sales portals
  • Track ROI
  • Support sales teams in AWS-related opportunities
  • Lead regular cadence meetings and quarterly business reviews with AWS and Independent Software Vendor (ISV) partners
  • Maintain partner sales portals and own a joint pipeline of channel-sourced opportunities
  • Evangelize Mission Cloud’s capabilities to AWS and ISV field teams and educate partners on new go-to-market strategies
  • Work with AWS and ISV marketing teams to execute local and virtual events
  • Tracking ROI and funnel contribution from AWS and partner ISVs
  • Fulltime
Read More
Arrow Right
New

Senior PLG Strategy & Operations Manager

We are hiring a Senior Manager to own Product-Led Growth (PLG) strategy and oper...
Location
Location
Salary
Salary:
193200.00 - 227000.00 USD / Year
confluent.io Logo
Confluent
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience owning strategy, growth, or operating initiatives with accountability for business outcomes, ideally in a high-growth SaaS or technology environment
  • Demonstrated ability to structure ambiguous problems and drive work forward in complex, cross-functional settings
  • Strong analytical foundation, including comfort working directly with data, designing analyses, and evaluating experiments
  • Fluency in core analytical and communication tools, including SQL for data exploration, spreadsheet-based modeling, and structured slide development for executive audiences
  • Executive-level communication skills, including the ability to produce clear, rigorous, and detail-oriented slide materials
  • Comfort influencing without authority and operating with a high degree of ownership and autonomy
Job Responsibility
Job Responsibility
  • Own PLG Strategy and Planning
  • Drive Executive Decision Support and Communications
  • Inform Product and GTM Strategy
  • Lead Funnel Performance and Growth Diagnostics
  • Orchestrate Cross-Functional Execution
  • Own Measurement and Insight Foundations
What we offer
What we offer
  • Remote-First Work
  • Robust Insurance Benefits
  • Flexible Time Away
  • The Best Teammates
  • Experience Ambassadors
  • Open and Honest Culture
  • Well-Being and Growth
  • Offers Equity
  • Fulltime
Read More
Arrow Right
New

Fmi administrator

Working as Part-time FMI Administrator, you will work in our FMI (Fastenal Manag...
Location
Location
United States , Winona
Salary
Salary:
16.00 - 18.00 USD / Hour
careers.fastenal.com Logo
Fastenal
Expiration Date
March 06, 2026
Flip Icon
Requirements
Requirements
  • 18 years of age or older, due to the nature of work
  • Prior administrative/customer service experience OR industry experience and product knowledge
  • Excellent written and oral communication skills
  • Proficient using Microsoft Office Suite
  • Exhibit strong problem solving, deductive reasoning and decision making skills
  • Demonstrate strong math aptitude, attention to detail and sense of urgency
  • Learn and perform multiple tasks in a fast paced environment
  • Highly motivated, self directed and customer service oriented
  • Work independently as well as in a team environment
  • Demonstrate strong organization, planning and prioritizing abilities
Job Responsibility
Job Responsibility
  • Processing new agreements and removals for various project types
  • Providing clerical and administrative support, including data entry
  • Providing support from the field for new agreements, removals and cancellations
  • Processing reports for FMI Technology departments and sales teams
  • Processing invoices for payment from suppliers
  • Creating custom label files to fulfill requests from field teams
  • Communicating with the department manager, Fastenal team members and customers
  • Maintaining a clean and safe work environment
  • Complying with safety regulations
What we offer
What we offer
  • 401(k) with an employer contribution
  • Parttime
!
Read More
Arrow Right