CrawlJobs Logo

Principal Supercomputing Operations Engineering Manager

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Multiple Locations

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139900.00 - 274800.00 USD / Year

Job Description:

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this scale, interconnect fabrics are a first order reliability system that directly determines GPU availability, training throughput, and customer SLAs. As a Principal Supercomputing Operations Engineering Manager, you own the operational strategy and organizational execution for interconnect fabric reliability across flagship AI supercomputing environments. You lead teams that operate InfiniBand and GPU interconnect fabrics as a single end to end reliability domain, defining how they are operated, debugged, hardened, and scaled in production. This is a hands on technical leadership role combined with people and operational management. You are accountable not only for technical outcomes, but for building and leading high performing engineering teams that consistently deliver availability, correctness, and resilience under extreme scale and ambiguity. You set expectations, drive execution through others, and ensure your team is prepared to respond decisively to the most complex production failures. You lead and oversee the most severe fabric related incidents, guiding technical direction, escalation strategy, and risk trade offs while empowering senior engineers to execute deep investigations. Beyond incident response, you define operational strategy, reliability models, and systemic prevention mechanisms that reduce recurrence at fleet scale. Your impact multiplies through organizational leadership: developing talent, setting operational standards, influencing engineering direction across organizations, and partnering deeply with platform, hardware, firmware, and service teams to deliver durable reliability improvements. You are responsible for ensuring that your organization produces high quality automation, diagnostics, telemetry, playbooks, and escalation models that materially improve operability and debuggability across the platform. Through your leadership, judgment, and technical direction, Azure’s largest AI supercomputing platforms scale safely, predictably, and sustainably to meet the demands of next generation AI workloads.

Job Responsibility:

  • Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
  • Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
  • Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
  • Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
  • Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
  • Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet

Requirements:

  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice to have:

  • Bachelor's Degree in Computer Science OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
  • OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • 4+ years people management experience.
  • 6+ years of experience operating largescale distributed systems, highperformance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
  • Demonstrated experience leading engineering teams responsible for mission critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
  • Strong hands-on background in operating and debugging interconnect fabrics or similarly complex infrastructure supporting largescale compute workloads
  • Solid Linux systems knowledge with experience reasoning across operating systems, drivers, services, and hardware layers
  • Proven ability to make highimpact technical and organizational decisions under ambiguity while balancing availability, risk, longterm correctness, and business impact

Additional Information:

Job Posted:
March 01, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Supercomputing Operations Engineering Manager

Principal Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
  • 5+ years hands on experience designing and developing high volume low latency pipelines using products such as AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus or equivalent products
  • 3+ years of experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
  • Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
  • Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
  • Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
  • Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities
  • Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Partner with appropriate stakeholders to determine user requirements for a set of scenarios.
  • Lead identification of dependencies and the development of design documents for a product, application, service, or platform, primarily catering towards exhaustive health monitoring of AI training supercomputers.
  • Build AI Supercomputer observability solutions at scale, with deep focus on actionability to improve availability and reliability of supercomputers.
  • Lead by example and mentor others to produce extensible and maintainable code used across products.
  • Leverage subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to drive multiple groups’ project plans, release plans, and work items.
  • Hold accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Proactively seek new knowledge and adapt to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and share knowledge with other engineers.
  • Fulltime
Read More
Arrow Right
New

Customer Experience Specialist

Are you a car enthusiast with a passion for delivering an exceptional customer e...
Location
Location
Canada , Markham
Salary
Salary:
24.00 - 26.00 CAD / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
April 20, 2026
Flip Icon
Requirements
Requirements
  • Exceptional communication skills with a "customer-first" mindset
  • A genuine passion for vehicles, cars, or the automotive aftermarket industry
  • Technical aptitude—the ability to learn and explain automotive part differences
  • Proficiency in POS systems and basic computer applications (Outlook, Excel)
  • Problem-solving skills to handle customer inquiries with empathy and professionalism
  • Availability to work the full-time shift and occasionally stay a few minutes late to support a late-arriving customer
  • Ability to lift up to 50lbs.
Job Responsibility
Job Responsibility
  • Welcoming customers to the immersive showroom and providing an engaging, high-touch brand experience
  • Educating customers on various brake rotors, pads, and accessories, explaining technical OEM specifications in simple, friendly terms
  • Managing the end-to-end sales process, from initial inquiry to final payment using POS systems
  • Assisting customers with order pickups and ensuring the showroom remains organized, visually appealing, and stocked
  • Physically handling inventory, including lifting and moving automotive parts weighing up to 50 lbs
  • Ensuring every customer leaves with a positive impression of the brand and its expertise.
What we offer
What we offer
  • Competitive hourly rate of $24.00 - $26.00 per hour
  • Full-time, permanent stability with a consistent Monday to Friday schedule (9:30 AM – 6:00 PM)
  • Comprehensive health benefits and employee purchase discounts
  • A vibrant workplace culture with frequent team celebrations and social events
  • Located in a transit-accessible area of Markham with free on-site parking
  • Real career growth opportunities within a scaling global distributor.
  • Fulltime
Read More
Arrow Right
New

Mobile Associate - Retail Sales

Mobile Associates (MA) work as a member of a Retail Team of Experts to bring the...
Location
Location
United States , Yreka, California
Salary
Salary:
17.50 USD / Hour
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Diploma/GED
  • 6 months of customer service and/or sales experience, Retail environment preferred
  • Passionate customer advocate with the desire to be yourself when connecting and having fun doing it
  • Competitive drive and proven ability to succeed in a fast-paced sales environment
  • Willingness to work alongside peers and store leaders, learning and sharing ideas, while serving customers and providing resolutions to issues
  • Effective at balancing customer needs and performance goals
  • At least 18 years of age
  • Legally authorized to work in the United States
Job Responsibility
Job Responsibility
  • Builds proficiency related to serving and selling to our customers, while providing a world-class customer experience and building loyalty
  • Helping customers pick up right where they left off in their shopping journey, whether online, through Customer Care or in-store
  • Exploring individual needs and providing hands-on demonstrations of the latest and greatest technology in-store
  • Side-by-side selling to find personalized solutions beyond the bare-bones device and service plan
  • Approaching service and sales needs with composure, integrity and compassion
  • Becomes skilled with and consistently uses digital tools in interactions and onboarding
  • Completes training on T-Mobile in-store experience, new skills and processes, knowledge of systems and reference resources
  • Makes the most of their time on shift, consistently seeking out information between customers, learning about innovations in wireless and technology
  • Establishes relationships with and partners with T-Mobile employees across channels, including business and customer service
  • Collectively own the customer experience and resolve issues, creating a seamless, run-around-free environment
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off and up to 12 paid holidays
  • Paid parental and family leave
  • Family building benefits
  • Parttime
Read More
Arrow Right
New

Pharmacy Technician

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States , Hamden
Salary
Salary:
18.94 - 28.94 USD / Hour
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
April 27, 2026
Flip Icon
Requirements
Requirements
  • Must comply with any state board of pharmacy requirements or laws governing the practice of pharmacy, which includes but is not limited to, age, education, and licensure/certification
  • If the state board of pharmacy does not address or mandate a minimum age requirement, must be at least 16 years of age
  • If the state board of pharmacy does not address or mandate a minimum educational requirement, must have a high school diploma or equivalent, or be actively enrolled in high school or high school equivalency program
  • Regular and predictable attendance, including nights and weekends
  • Ability to complete required training within designated timeframe
  • Attention and Focus: Ability to concentrate on a task over a period of time
  • Ability to pivot quickly from one task to another to meet patient and business needs
  • Ability to confirm prescription information and label accuracy, ensuring patient safety
  • Customer Service and Team Orientation: Actively look for ways to help people, and do so in a friendly manner
  • Notice and understand patients’ reactions, and respond appropriately
Job Responsibility
Job Responsibility
  • Living our purpose by following all company SOPs at each workstation to help our Pharmacists manage and improve patient health
  • Following pharmacy workflow procedures at each pharmacy workstation (i.e., production, pick-up, drive-thru, and drop-off) for safe and accurate prescription fulfillment
  • Contributing to positive patient experiences by showing empathy and genuine care: creating heartfelt and personalized moments while serving patients at pick-up, drive-thru, and over the phone
  • keeping patients healthy by offering immunizations and other services at the register and over the phone
  • and demonstrating compassionate care by solving or escalating patient problems
  • Completing basic inventory activities, as permitted by law, and as directed by the pharmacy leadership team, such as accurately putting away medication deliveries and completing cycle counts, returns-to-stocks, waiting bin inventories, etc.
  • Contributing to a high-performing team, embracing a growth mindset, and being receptive to feedback
  • actively seeking opportunities to expand clinical and technical knowledge needed to better assist patients
  • Remaining flexible for both scheduling and business needs, while contributing to a safe, inclusive, and engaging team dynamic
  • voluntarily traveling to stores in the market to work shifts as needed by the business
What we offer
What we offer
  • Affordable medical plan options
  • a 401(k) plan (including matching company contributions)
  • an employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Benefit solutions that address the different needs and preferences of our colleagues including paid time off, flexible work schedules, family leave, dependent care resources, colleague assistance programs, tuition assistance, retiree medical access and many other benefits depending on eligibility
  • Fulltime
Read More
Arrow Right
New

Customer Support - French Speaker

Are you fluent in both French and English? Are you looking for an opportunity fo...
Location
Location
Portugal , Lisboa
Salary
Salary:
Not provided
https://www.randstad.com Logo
Randstad
Expiration Date
March 15, 2026
Flip Icon
Requirements
Requirements
  • excellent communication skills in French and English
  • solution oriented
  • basic knowledge and experience in a call center
  • strong Commitment for a long term project
  • good PC skills
Job Responsibility
Job Responsibility
  • ensure high solution rate and customer satisfaction at each contact through excellent know how
  • responsibility for dealing with customer contacts by telephone
  • serving the customer to ensure their long-term loyalty
  • management of administrative issues related to administrative changes, invoicing (incl. collection) and complaints
  • Fulltime
Read More
Arrow Right
New

District Support Pharmacist

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States , Yonkers
Salary
Salary:
65.00 - 81.00 USD / Hour
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
April 28, 2026
Flip Icon
Requirements
Requirements
  • Active Pharmacist License in the state where the Store is located
  • Active National Provider Identifier (NPI)
  • Not on the DEA Excluded Parties list
  • Ability to travel within a reasonable radius to support market staffing as business needs require
  • Regular and predictable attendance, including nights and weekends
  • Ability to complete required training within designated timeframe
  • Attention and Focus: Ability to concentrate on a task over a period of time
  • Ability to pivot quickly from one task to another to meet patient and business needs
  • Ability to confirm prescription information and label accuracy, ensuring patient safety
  • Customer Service and Team Orientation: Actively look for ways to help people, and do so in a friendly manner
Job Responsibility
Job Responsibility
  • Traveling the district to fill pharmacist shifts as scheduled by the District Performance Coordinator (DPC)
  • overseeing the pharmacy and serving as the Pharmacy Manager’s proxy during bench shifts without overlap
  • Supporting safe and accurate prescription fulfillment by following—and directing the pharmacy team to follow—pharmacy workflow procedures and utilizing the safety guardrails at every workstation
  • Assumes Pharmacy Manager’s day-to-day duties when serving as the only or the primary pharmacist-on-duty
  • Contributing to positive patient experiences by showing empathy and genuine care and coaching the pharmacy team to do the same: demonstrating compassionate care, collaborating with the patient’s total healthcare team, and proactively resolving insurance and/or medication issues
  • Proactively offering and delivering immunizations to keep patients healthy
  • engaging and supporting Pharmacy Technicians to learn to immunize
  • Supporting the effective management of pharmacy inventory in all pharmacies worked by following—and guiding the pharmacy team to follow—all inventory best practices, with a special focus on protecting cold chain products for our patients and our business
  • Remaining flexible for both scheduling and business needs, while contributing to a safe, inclusive, and engaging team dynamic
  • Maintaining relevant clinical and technical skills for the job as the industry evolves (including but not limited to company-required trainings and CMEs)
What we offer
What we offer
  • Affordable medical plan options
  • a 401(k) plan (including matching company contributions)
  • an employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Benefit solutions that address the different needs and preferences of our colleagues including paid time off, flexible work schedules, family leave, dependent care resources, colleague assistance programs, tuition assistance, retiree medical access and many other benefits depending on eligibility
  • Fulltime
Read More
Arrow Right
New

Mobile Associate - Retail Sales

Mobile Associates (MA) work as a member of a Retail Team of Experts to bring the...
Location
Location
United States , Charleston
Salary
Salary:
17.50 USD / Hour
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Diploma/GED
  • 6 months of customer service and/or sales experience, Retail environment preferred
  • Passionate customer advocate
  • Competitive drive and proven ability to succeed in a fast-paced sales environment
  • Willingness to work alongside peers and store leaders
  • Effective at balancing customer needs and performance goals
  • At least 18 years of age
  • Legally authorized to work in the United States
Job Responsibility
Job Responsibility
  • Builds proficiency related to serving and selling to our customers
  • Helping customers pick up right where they left off in their shopping journey
  • Exploring individual needs and providing hands-on demonstrations of the latest technology
  • Side-by-side selling to find personalized solutions
  • Approaching service and sales needs with composure, integrity and compassion
  • Becomes skilled with and consistently uses digital tools in interactions and onboarding
  • Completes training on T-Mobile in-store experience, new skills and processes
  • Makes the most of their time on shift
  • Establishes relationships with and partners with T-Mobile employees across channels
  • Collectively own the customer experience and resolve issues
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off
  • Up to 12 paid holidays
  • Paid parental and family leave
  • Parttime
Read More
Arrow Right