Senior Supercomputing Operations Engineer Job at Microsoft Corporation (Multiple Locations)

Senior Technical Program Manager (Supercomputing Operations)

Azure Specialized Compute drives the hardware roadmap, software and services tha...

Location

United States , Multiple Locations

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree AND 4+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
2+ years of experience managing cross-functional and/or cross-team projects
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Job Responsibility

Drive highly impactful cross-organization initiatives around end to end experiences spanning multiple Azure products and services
Manage complex program/project to successful completion
Mentor junior members of the team and provide support to our Product Management team.

What we offer

Certain roles may be eligible for benefits and other compensation.

Fulltime

Principal Supercomputing Operations Engineering Manager

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Job Responsibility

Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet

Fulltime

Senior Research Engineer

The HPE HPC & AI EMEA Research Lab (ERL) is characterized by a unique blend of i...

Location

Germany , Munich, Berlin

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
Parallel programming experience, with programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages, etc.
An understanding of AI/ML frameworks, experience with frameworks such as TensorFlow or PyTorch is highly desirable
An interest in system- and data center monitoring and operational data analysis
Professional language skills in English and German

Job Responsibility

Perform world-class research while also shaping products of the future
Work with the most esteemed research partners across Europe
Enable high performance research software on pre-Exascale and Exascale supercomputers
Provide new environments/abstractions to support application developers to build, deploy, and run applications taking advantage of leading-edge hardware at scale
Make and operate HPC/AI systems and datacenters in a sustainable way
Manage modern data-intensive workloads in high performance environments

What we offer

Competitive salary and extensive benefits package (pension scheme, insurances, bike and car leasing, and other fringe benefits)
Work-life balance (flexible working time and hybrid workplace model, 30 vacation days, four HPE Wellness-Fridays, up to six months paid parental leave)
Support for education, training, and career development
Diverse and dynamic work environment

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...

Location

United States , Redmond

Salary:

163000.00 - 296400.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
5+ years of people management experience leading software engineering teams, including managing principal engineers
Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Job Responsibility

Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate

Fulltime

HPC Senior Technical Writer

In this position you will collaborate with knowledge management project leads an...

Location

United States of America , Chippewa Falls

Salary:

81500.00 - 187500.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor's degree in Technical Communications, Computer Science, or related technical/communications field with 4-6 years related experience
Advanced University degree and 2-4 years' experience or equivalent
Understands concepts and develops in-depth working knowledge of products, applications, and systems in assigned area of responsibility
Ability to deliver on multiple project technical requirements, schedules, and information formats
Codes in HTML, DHTML, XML, JavaScript or similar as required
Applies developed subject matter knowledge to solve common and complex business issues and recommends appropriate alternatives
Works on problems of diverse complexity and scope
May act as a team or project leader providing direction to team activities and facilitates information validation and team decision making process
Exercises independent judgment to identify and select a solution
Knowledge of HPC system software and hardware components, including operating systems, programming languages, system monitoring applications, HPC storage, chassis, servers, compute nodes, blades, HPC storage, coolant systems, power supplies, high speed network switches and cabling, and more

Job Responsibility

Create technical product documentation for software products and hardware
Analyze customer information requirements and product specifications to define scope of work and documentation plan
Identify and address the needs of all user groups, including end users, system administrators, internal support engineers, product developers, integration test teams, and training developers
Test documentation for install or administrative tasks to improve information deliverables and provide feedback on ease of use and user interfaces to product development
Manage workload in Jira and source management tools, including SDL, Oxygen, Git, and Github, to manage changes in the shared work environment
Create, revise, and manage content in Oxygen Author (DITA), Markdown, and other content tools
Work with developers, testers, product managers, technical support, and training to identify new features and content that needs to be reworked

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Member of Technical Staff, Infrastructure Data & Analytics

We are seeking experienced Infrastructure Data & Analytics Engineers to join our...

Location

United States , Multiple Locations; Mountain View; San Francisco Bay area; New York City metropolitan area

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, or related technical field AND 8+ years technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 6+ years experience with distributed data processing frameworks and large-scale data systems
OR equivalent experience
Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 10+ years experience with distributed data processing frameworks and large-scale data systems
OR equivalent experience
Proven technical leadership in data engineering, analytics platforms, or large-scale telemetry systems
Hands-on experience with ETL orchestration frameworks such as Airflow, Dagster, or similar
Strong communication skills
can explain complex systems clearly to senior leader

Job Responsibility

Act as the technical lead and owner for infrastructure analytics across compute, storage, and networking
Design and build durable, scalable data pipelines that ingest telemetry from clusters, schedulers, health systems, and capacity trackers into Data Warehouse
Define and standardize core metrics and semantics (e.g., utilization, occupancy, MFU, goodput, capacity readiness, delivery-to-production)
Architect and maintain self-service dashboards and APIs for fleet, cluster, and squad-level visibility
Partner closely with stakeholders across Supercomputing Infra, Researchers, Strategy and Executives to ensure metrics reflect operational and business reality
Implement robust and fault-tolerant systems for data ingestion and processing
Lead data architecture and engineering decisions, applying strong technical judgment to proactively shape executive-level discussions and decisions
Identify data gaps and instrumentation issues
drive fixes by influencing upstream engineering teams
Establish data quality, validation, documentation, and governance so metrics are trusted and repeatable

Fulltime

New

IT Training Lead

The IT Training Lead will drive technology learning and user adoption across the...

Location

United States , Delray Beach

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

Experience in IT training, instructional design, technical enablement, or learning and development
Strong knowledge of Microsoft 365
Excellent communication, facilitation, and content development skills
Ability to translate technical concepts into practical, user-friendly training.

Job Responsibility

Design, develop, and deliver IT training programs in instructor-led, virtual, and self-paced formats
Take lead in the Microsoft Copilot and AI training strategy, including onboarding, advanced use cases, responsible AI usage, and ongoing enablement
Partner with IT leadership to support new technology rollouts, system upgrades, and digital transformation initiatives
Create and maintain training content, including videos, guides, tutorials, and job aids
Identify skill gaps and develop targeted learning solutions to improve adoption and productivity
Gather feedback and measure training effectiveness to continuously improve programs.

New

K Kitchen Representative

The position includes, but is not limited to, the following essential job duties...

Location

United States , New Albany

Salary:

Not provided

Circle K

Expiration Date

Until further notice

Requirements

Excellent communication skills
Team player who can work well with others or independently
Acts with integrity
keeps commitments
Contagious positive attitude
Focuses on achieving results while having fun
Frequently bend, twist at waist, kneel, squat, stand, and walk
Occasionally climb and descend ladders
Tolerate extreme cold and hot temperatures and work in and around fryers, ovens, grills, coolers, freezers, sharp objects, and loud noises
Reach, grasp, and manipulate objects with hands for entire shift, including reaching for objects overhead

Job Responsibility

Provides excellent guest service in a fast and friendly manner
Maintains a clean restaurant environment by cleaning and performing general housekeeping duties
Prepares and serves food items in accordance with all Brand, Company, and health department regulations
Ensures product quality, food safety, and operational standards are met
Keeps accurate cash, sales, and inventory control records
Follows all government laws and safety codes
Completes reports on all incidents following our 5-minute rule policy
Lives our Company values: One Team, Do the Right Thing, Takes Ownership, Play to Win

What we offer

Medical, Dental, Vision, Term Life and AD&D plans
Flexible spending and health savings accounts (FT)
Vacation paid time off
Company holidays paid at time and a half
Matching 401(k)
Tuition Reimbursement
Stock Purchase Plan
Employee Discount Program
Discount Meal Benefit
Wellness Plan

Select Country

Senior Supercomputing Operations Engineer

Job Description

Job Responsibility

Requirements

Looking for more opportunities?