CrawlJobs Logo

Principal Supercomputing Operations Engineering Manager

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Multiple Locations

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139900.00 - 274800.00 USD / Year

Job Description:

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this scale, interconnect fabrics are a first order reliability system that directly determines GPU availability, training throughput, and customer SLAs. As a Principal Supercomputing Operations Engineering Manager, you own the operational strategy and organizational execution for interconnect fabric reliability across flagship AI supercomputing environments. You lead teams that operate InfiniBand and GPU interconnect fabrics as a single end to end reliability domain, defining how they are operated, debugged, hardened, and scaled in production. This is a hands on technical leadership role combined with people and operational management. You are accountable not only for technical outcomes, but for building and leading high performing engineering teams that consistently deliver availability, correctness, and resilience under extreme scale and ambiguity. You set expectations, drive execution through others, and ensure your team is prepared to respond decisively to the most complex production failures. You lead and oversee the most severe fabric related incidents, guiding technical direction, escalation strategy, and risk trade offs while empowering senior engineers to execute deep investigations. Beyond incident response, you define operational strategy, reliability models, and systemic prevention mechanisms that reduce recurrence at fleet scale. Your impact multiplies through organizational leadership: developing talent, setting operational standards, influencing engineering direction across organizations, and partnering deeply with platform, hardware, firmware, and service teams to deliver durable reliability improvements. You are responsible for ensuring that your organization produces high quality automation, diagnostics, telemetry, playbooks, and escalation models that materially improve operability and debuggability across the platform. Through your leadership, judgment, and technical direction, Azure’s largest AI supercomputing platforms scale safely, predictably, and sustainably to meet the demands of next generation AI workloads.

Job Responsibility:

  • Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
  • Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
  • Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
  • Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
  • Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
  • Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet

Requirements:

  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice to have:

  • Bachelor's Degree in Computer Science OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
  • OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • 4+ years people management experience.
  • 6+ years of experience operating largescale distributed systems, highperformance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
  • Demonstrated experience leading engineering teams responsible for mission critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
  • Strong hands-on background in operating and debugging interconnect fabrics or similarly complex infrastructure supporting largescale compute workloads
  • Solid Linux systems knowledge with experience reasoning across operating systems, drivers, services, and hardware layers
  • Proven ability to make highimpact technical and organizational decisions under ambiguity while balancing availability, risk, longterm correctness, and business impact

Additional Information:

Job Posted:
March 01, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Supercomputing Operations Engineering Manager

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Principal Supercomputing Operations Software Engineer

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • 6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
  • Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
  • Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads
  • Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
  • Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues
Job Responsibility
Job Responsibility
  • Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
  • Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
  • Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
  • Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
  • 5+ years hands on experience designing and developing high volume low latency pipelines using products such as AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus or equivalent products
  • 3+ years of experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
  • Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
  • Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
  • Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
  • Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities
  • Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Partner with appropriate stakeholders to determine user requirements for a set of scenarios.
  • Lead identification of dependencies and the development of design documents for a product, application, service, or platform, primarily catering towards exhaustive health monitoring of AI training supercomputers.
  • Build AI Supercomputer observability solutions at scale, with deep focus on actionability to improve availability and reliability of supercomputers.
  • Lead by example and mentor others to produce extensible and maintainable code used across products.
  • Leverage subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to drive multiple groups’ project plans, release plans, and work items.
  • Hold accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Proactively seek new knowledge and adapt to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and share knowledge with other engineers.
  • Fulltime
Read More
Arrow Right
New

Esri/ArcGIS Specialist

The Software Development Specialist Advisor role at NTT DATA involves maintainin...
Location
Location
United States , Arlington
Salary
Salary:
105840.00 - 245000.00 USD / Year
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 10 years' experience with application development roles/responsibilities
  • Master's degree
  • One-and-one-half (1.5) years of additional experience can substitute for one (1) year of a typical degree program
  • Ability to obtain a Public Trust clearance
Job Responsibility
Job Responsibility
  • Maintaining and enhancing the ArcGIS Enterprise platform
  • Providing geospatial analysis, data management, and visualization capabilities
  • Developing and managing geospatial data workflows, integrating data from systems like CAPTURE
  • Creating custom scripts using Python for Extract, Transform, Load (ETL) processes
  • Configuring and optimizing ArcGIS components, including ArcGIS Server, Portal for ArcGIS, and ArcGIS Data Store
  • Ensuring compliance with FedRAMP and other government security standards
  • Troubleshooting system performance
  • Supporting role-based access control (RBAC)
  • Creating advanced mapping and spatial analysis solutions to support law enforcement and fugitive apprehension missions
  • Collaborating with stakeholders to deliver secure, scalable, and reliable geospatial solutions
What we offer
What we offer
  • Medical, dental, and vision insurance with an employer contribution
  • Flexible spending or health savings account
  • Life and AD&D insurance
  • Short and long term disability coverage
  • Paid time off
  • Employee assistance
  • Participation in a 401k program with company match
  • Additional voluntary or legally-required benefits
  • Incentive compensation based on individual and/or company performance
  • Fulltime
Read More
Arrow Right
New

Senior Workday Analyst

JOB PROFILE SUMMARY THE SENIOR ERP SYSTEMS ANALYST IS A KEY TECHNICAL RESOURCE ...
Location
Location
United States , Fort Worth
Salary
Salary:
60.00 - 72.00 USD / Hour
medasource.com Logo
Medasource
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree required
  • 3+ years of direct ERP systems experience
  • working knowledge of Workday, UKG WFM Pro or comparable ERP system
Job Responsibility
Job Responsibility
  • Analysis, design, configuration, and ongoing support of the organization’s Enterprise Resource Planning (ERP) systems
  • translating complex business requirements into effective system solutions
  • ensuring optimal performance
  • configuring the systems
  • administering system security
  • working closely with cross-functional teams to implement system enhancements
  • troubleshooting complex issues
  • ensuring the ERP system aligns with strategic business objectives
  • assisting in mentoring and training ERP Systems Analysts
What we offer
What we offer
  • Competitive medical, dental, vision, Health Savings Account, Dependent Care FSA, and supplemental coverage
  • 401k plan with company match
  • paid time off
  • sick time
  • paid company holidays
  • Employee Assistance Program (EAP)
  • Fulltime
Read More
Arrow Right
New

Pharmacy Technician

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States , Boston
Salary
Salary:
17.00 - 27.00 USD / Hour
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
June 06, 2026
Flip Icon
Requirements
Requirements
  • Must comply with any state board of pharmacy requirements or laws governing the practice of pharmacy, which includes but is not limited to, age, education, and licensure/certification
  • If the state board of pharmacy does not address or mandate a minimum age requirement, must be at least 16 years of age
  • If the state board of pharmacy does not address or mandate a minimum educational requirement, must have a high school diploma or equivalent, or be actively enrolled in high school or high school equivalency program
  • Regular and predictable attendance, including nights and weekends
  • Ability to complete required training within designated timeframe
  • Attention and Focus: Ability to concentrate on a task over a period of time
  • Ability to pivot quickly from one task to another to meet patient and business needs
  • Ability to confirm prescription information and label accuracy, ensuring patient safety
  • Customer Service and Team Orientation: Actively look for ways to help people, and do so in a friendly manner
  • Notice and understand patients’ reactions, and respond appropriately
Job Responsibility
Job Responsibility
  • Living our purpose by following all company SOPs at each workstation to help our Pharmacists manage and improve patient health
  • Following pharmacy workflow procedures at each pharmacy workstation (i.e., production, pick-up, drive-thru, and drop-off) for safe and accurate prescription fulfillment
  • Contributing to positive patient experiences by showing empathy and genuine care: creating heartfelt and personalized moments while serving patients at pick-up, drive-thru, and over the phone
  • keeping patients healthy by offering immunizations and other services at the register and over the phone
  • and demonstrating compassionate care by solving or escalating patient problems
  • Completing basic inventory activities, as permitted by law, and as directed by the pharmacy leadership team, such as accurately putting away medication deliveries and completing cycle counts, returns-to-stocks, waiting bin inventories, etc.
  • Contributing to a high-performing team, embracing a growth mindset, and being receptive to feedback
  • actively seeking opportunities to expand clinical and technical knowledge needed to better assist patients
  • Remaining flexible for both scheduling and business needs, while contributing to a safe, inclusive, and engaging team dynamic
  • voluntarily traveling to stores in the market to work shifts as needed by the business
What we offer
What we offer
  • dental
  • vision
  • wellness resources
  • employee discounts
  • access to certain voluntary benefits
  • other programs
  • Parttime
Read More
Arrow Right
New

It Field Technician

Our client, a leading health system, is seeking IT Field Technicians for a 4-wee...
Location
Location
United States , West Valley City
Salary
Salary:
23.00 - 28.00 USD / Hour
medasource.com Logo
Medasource
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Prior experience in IT asset management, hardware decommissioning, or IT equipment handling
  • Ability to lift and carry equipment (monitors, arms, peripherals)
  • physical work is a core part of this role
  • Comfortable working in an active office/IT environment with shifting daily priorities
  • Familiarity with barcode scanners and asset tagging systems
  • Strong organizational skills and attention to detail
  • all equipment must be accurately stored and accounted for
  • Ability to follow direction from on-site leads and adjust priorities as the scope evolves
Job Responsibility
Job Responsibility
  • Remove monitor arms and monitors from cubicle walls and desks throughout the building
  • Conduct physical inventory of all on-site equipment, including monitors, docking stations, cables, peripherals, and other IT assets
  • Scan asset tags on monitors using a barcode scanner and document findings in Excel/Smartsheet
  • Sort and stage equipment by disposition: keep, consolidate, trash, or flag for e-recycling pickup
  • Move equipment within the building as directed, including between floors and towers (North to South) as needed
  • Support the DXC area on the North side of the building with staging and sorting
  • Assist the Media Services team with sorting and disposing of AV storage across the building
  • Perform general cleanup and trash removal throughout the facility
  • Maintain accurate, up-to-date inventory documentation throughout the project
  • Participate in daily stand-up meetings to align on priorities and report on progress
What we offer
What we offer
  • competitive medical, dental, vision, Health Savings Account, Dependent Care FSA, and supplemental coverage with plans that can fit each employee’s needs
  • 401k plan that includes a company match and is fully vested after you become eligible
  • paid time off
  • sick time
  • paid company holidays
  • Employee Assistance Program (EAP) that provides services like virtual counseling, financial services, legal services, life coaching, etc.
  • Fulltime
Read More
Arrow Right