CrawlJobs Logo

Principal Supercomputing Operations Software Engineer

United States, Multiple Locations 139900.00 - 274800.00 USD / Year · Job Posted March 22, 2026
Apply Position
Job Link Share

Job Description

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation.

Job Responsibility

  • Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
  • Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
  • Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
  • Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • 6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
  • Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
  • Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads
  • Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
  • Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues

Nice to have

  • Bachelor's Degree in Computer Science OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
  • OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Supercomputing Operations Software Engineer

8 matching positions

Principal Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
  • 5+ years hands on experience designing and developing high volume low latency pipelines using products such as AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus or equivalent products
  • 3+ years of experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
  • Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
  • Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
  • Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
  • Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities
  • Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Partner with appropriate stakeholders to determine user requirements for a set of scenarios.
  • Lead identification of dependencies and the development of design documents for a product, application, service, or platform, primarily catering towards exhaustive health monitoring of AI training supercomputers.
  • Build AI Supercomputer observability solutions at scale, with deep focus on actionability to improve availability and reliability of supercomputers.
  • Lead by example and mentor others to produce extensible and maintainable code used across products.
  • Leverage subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to drive multiple groups’ project plans, release plans, and work items.
  • Hold accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Proactively seek new knowledge and adapt to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and share knowledge with other engineers.
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right
New

IT Training Lead

The IT Training Lead will drive technology learning and user adoption across the...
Location
Location
United States , Delray Beach
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in IT training, instructional design, technical enablement, or learning and development
  • Strong knowledge of Microsoft 365
  • Excellent communication, facilitation, and content development skills
  • Ability to translate technical concepts into practical, user-friendly training.
Job Responsibility
Job Responsibility
  • Design, develop, and deliver IT training programs in instructor-led, virtual, and self-paced formats
  • Take lead in the Microsoft Copilot and AI training strategy, including onboarding, advanced use cases, responsible AI usage, and ongoing enablement
  • Partner with IT leadership to support new technology rollouts, system upgrades, and digital transformation initiatives
  • Create and maintain training content, including videos, guides, tutorials, and job aids
  • Identify skill gaps and develop targeted learning solutions to improve adoption and productivity
  • Gather feedback and measure training effectiveness to continuously improve programs.
Read More
Arrow Right
New

K Kitchen Representative

The position includes, but is not limited to, the following essential job duties...
Location
Location
United States , New Albany
Salary
Salary:
Not provided
https://www.circlek.com Logo
Circle K
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Excellent communication skills
  • Team player who can work well with others or independently
  • Acts with integrity
  • keeps commitments
  • Contagious positive attitude
  • Focuses on achieving results while having fun
  • Frequently bend, twist at waist, kneel, squat, stand, and walk
  • Occasionally climb and descend ladders
  • Tolerate extreme cold and hot temperatures and work in and around fryers, ovens, grills, coolers, freezers, sharp objects, and loud noises
  • Reach, grasp, and manipulate objects with hands for entire shift, including reaching for objects overhead
Job Responsibility
Job Responsibility
  • Provides excellent guest service in a fast and friendly manner
  • Maintains a clean restaurant environment by cleaning and performing general housekeeping duties
  • Prepares and serves food items in accordance with all Brand, Company, and health department regulations
  • Ensures product quality, food safety, and operational standards are met
  • Keeps accurate cash, sales, and inventory control records
  • Follows all government laws and safety codes
  • Completes reports on all incidents following our 5-minute rule policy
  • Lives our Company values: One Team, Do the Right Thing, Takes Ownership, Play to Win
What we offer
What we offer
  • Medical, Dental, Vision, Term Life and AD&D plans
  • Flexible spending and health savings accounts (FT)
  • Vacation paid time off
  • Company holidays paid at time and a half
  • Matching 401(k)
  • Tuition Reimbursement
  • Stock Purchase Plan
  • Employee Discount Program
  • Discount Meal Benefit
  • Wellness Plan
Read More
Arrow Right
New

K Kitchen Representative

Location
Location
United States , Decatur
Salary
Salary:
Not provided
https://www.circlek.com Logo
Circle K
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Excellent communication skills
  • Team player who can work well with others or independently
  • Acts with integrity
  • keeps commitments
  • Contagious positive attitude
  • Focuses on achieving results while having fun
  • Frequently bend, twist at waist, kneel, squat, stand, and walk
  • Occasionally climb and descend ladders
  • Tolerate extreme cold and hot temperatures and work in and around fryers, ovens, grills, coolers, freezers, sharp objects, and loud noises
  • Reach, grasp, and manipulate objects with hands for entire shift, including reaching for objects overhead
Job Responsibility
Job Responsibility
  • Provides excellent guest service in a fast and friendly manner
  • Maintains a clean restaurant environment by cleaning and performing general housekeeping duties
  • Prepares and serves food items in accordance with all Brand, Company, and health department regulations
  • Ensures product quality, food safety, and operational standards are met
  • Keeps accurate cash, sales, and inventory control records
  • Follows all government laws and safety codes
  • Completes reports on all incidents following our 5-minute rule policy
  • Lives our Company values: One Team, Do the Right Thing, Takes Ownership, Play to Win
What we offer
What we offer
  • Medical, Dental, Vision, Term Life and AD&D plans
  • Flexible spending and health savings accounts (FT)
  • Vacation paid time off
  • Company holidays paid at time and a half
  • Matching 401(k)
  • Tuition Reimbursement
  • Stock Purchase Plan
  • Employee Discount Program
  • Discount Meal Benefit
  • Wellness Plan
Read More
Arrow Right
New

Restaurant Assistant Manager

This position assists the Restaurant Manager (RM) with daily operations of the r...
Location
Location
United States , Holly Springs
Salary
Salary:
Not provided
https://www.circlek.com Logo
Circle K
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Full time required
  • availability during all hours of operation and at least one hour pre-opening and post-closing required
  • Valid state Driver's License required
  • Excellent communication skills
  • Motivates, coaches, and leads team members
  • Acts with integrity
  • keeps commitments
  • Contagious positive attitude
  • Focuses on achieving results while having fun
  • Ability to gain control during stressful situations
Job Responsibility
Job Responsibility
  • Assists the Restaurant Manager with daily operations of the restaurant and supervises the team in their absence
  • Leads and coaches Restaurant Team Members and partners with the management team to maintain the Company and Brand operational standards
  • Provides excellent guest service in a fast and friendly manner
  • coaches and corrects team
  • Conducts second interviews for team members and shift leads
  • Maintains a clean restaurant environment by cleaning and performing general housekeeping duties
  • Assigns shift duties to team members and follows up to ensure completion
  • Directs team and ensures all food items are prepared and served in accordance with all Brand, Company, and health department regulations
  • Coaches team members to follow guidelines for food preparation and production management
  • Cascades relevant information to team members and assists with new product training
What we offer
What we offer
  • Unlimited tip pooling
  • Medical, Dental, Vision, Term Life and AD&D plans
  • Flexible spending and health savings accounts
  • Short-Term Disability
  • Vacation paid time off
  • Company holidays paid at time and a half
  • Matching 401(k)
  • Tuition Reimbursement
  • Stock Purchase Plan
  • Employee Discount Program
  • Fulltime
Read More
Arrow Right
New

Plant Operator - Crushing and Screen

Are you an experienced and ticketed Machine Operator looking for stable, high-ho...
Location
Location
Australia , Petrie
Salary
Salary:
42.00 - 52.00 AUD / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
July 09, 2026
Flip Icon
Requirements
Requirements
  • Proven Experience working in a quarry, concrete recycling, or heavy industrial yard
  • Current tickets for Front-End Loader (LL) and Excavator (LE)
  • Truck License: Heavy Rigid (HR) or higher is highly regarded
  • Reliability with strong work ethic and punctuality
  • Own reliable vehicle and current driver's license
Job Responsibility
Job Responsibility
  • Safe and efficient operation of heavy machinery in a fast-paced recycling and quarry environment
  • Operating Front-End Loaders
  • Operating Excavators utilized as material handlers
  • Operating Moxy (Articulated Dump Trucks) and other yard machinery as required
  • Assisting with daily machinery pre-starts, basic maintenance, and ensuring the yard runs smoothly
  • Adhering strictly to site health and safety protocols
What we offer
What we offer
  • Top Rates: $42.00 to $52.00 per hour + overtime penalties
  • Big Hours: Consistent 40 to 55-hour work weeks
  • Career Progression: Pathway from casual to permanent full-time employment within 3-6 months
  • Local Work: Convenient Brisbane Northside location (Petrie)
  • Immediate Start
  • Fulltime
Read More
Arrow Right