Principal Supercomputing Operations Engineering Manager Job at Microsoft Corporation (Multiple Locations)

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...

Location

United States , Redmond

Salary:

163000.00 - 296400.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
5+ years of people management experience leading software engineering teams, including managing principal engineers
Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Job Responsibility

Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate

Fulltime

Principal Supercomputing Operations Software Engineer

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads
Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues

Job Responsibility

Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform

Fulltime

Principal Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
5+ years hands on experience designing and developing high volume low latency pipelines using products such as AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus or equivalent products
3+ years of experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter

Job Responsibility

Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities
Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact

Fulltime

Principal Software Engineer

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Job Responsibility

Partner with appropriate stakeholders to determine user requirements for a set of scenarios.
Lead identification of dependencies and the development of design documents for a product, application, service, or platform, primarily catering towards exhaustive health monitoring of AI training supercomputers.
Build AI Supercomputer observability solutions at scale, with deep focus on actionability to improve availability and reliability of supercomputers.
Lead by example and mentor others to produce extensible and maintainable code used across products.
Leverage subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to drive multiple groups’ project plans, release plans, and work items.
Hold accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
Proactively seek new knowledge and adapt to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and share knowledge with other engineers.

Fulltime

New

Logistics Technician

As a Microsoft Data Center Inventory & Asset Technician (DIAT), you will perform...

Location

Austria , Vienna

Salary:

37500.00 - 49200.00 EUR / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

High School Diploma AND 6 months experience or an internship in inventory management, retail, warehouse management, or a related field OR equivalent experience. Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Job Responsibility

Perform assigned tasks and escalate issues during high-volume work activity or escalation-based situations under the guidance of more experienced Data Center colleagues
Coordinate with suppliers to initiate warranty claim and process failed vendor hardware devices. This includes information processing, packaging, shipment, and receipt of return for Return Merchandise Authorization (RMA) devices following all Service Level Agreements (SLAs) related to RMA warranty process
Develop working knowledge of stock control and inventory management practices and procedures
Ensure accurate documentation of incoming and outgoing deliveries as well as records
Become familiar with guidelines set by Microsoft contractual agreements with suppliers and maintain a strong customer focus
Perform cycle audits and data corrections to ensure all inventory controls are met
Help to reconcile and report inventory discrepancies
Ensure detailed physical inventory tracking and staging
Under the supervision of more experienced Data Center colleagues, destruction of data bearing devices (DBD) following all Service Level Agreements (SLAs) and Microsoft policies
Comply with all security and data management policies

Fulltime

New

Director Associate Experience - Talent Insights

The Director of Associate Experience provides enterprise leadership for the desi...

Location

United States , Irving

Salary:

Not provided

CHRISTUS Health

Expiration Date

Until further notice

Requirements

Bachelor’s degree in human resources, psychology, organizational development, marketing, or a related field required
Master’s degree in human resources, psychology, organizational development, marketing, or a related field preferred
8+ years of progressive experience in Associate Experience, organizational research, or a related discipline in a complex organization required
Demonstrated experience leading enterprise-wide programs and influencing senior and executive leader
Strong understanding of quantitative and qualitative research methodologies, survey design principles, and applied organizational research

Job Responsibility

Meets expectations of the applicable OneCHRISTUS Competencies: Leader of Self, Leader of Others, or Leader of Leaders
Provide enterprise leadership for the Associate Experience strategy, including vision, guiding principles, success metrics, and a multi year roadmap aligned with CHRISTUS Health’s mission and strategic priorities
Establish and maintain governance, standards, and best practices for global Associate listening programs to ensure consistency, data integrity, and Associate trust across the system
Oversee the design, administration, and continuous evolution of enterprise-wide Associate listening programs, applying human-centered and evidence-based approaches to ensure programs remain relevant, effective, and impactful
Translate Associate Experience data into actionable insights that inform executive decision-making, workforce strategies, and organizational priorities
Serve as a strategic advisor and thought partner to senior and executive leaders, providing perspective on Associate Experience trends, risks, and opportunities
Provide enterprise oversight of Associate recognition programs, ensuring alignment with organizational values and the promotion of fair, inclusive, and consistent recognition practices
Partner with Human Resources, business leaders, and cross functional teams to align Associate Experience, recognition, and related initiatives, including coordinated improvement of key moments across the Associate lifecycle
Drive adoption, participation, and sustained engagement in feedback and recognition programs through leader enablement, effective communication, and targeted engagement strategies
Ensure Associate Experience insights are translated into action by supporting action planning, accountability, and measurement of outcomes at the enterprise and local levels

Fulltime

New

Payroll Specialist

We are looking for a detail-oriented Payroll Specialist to join a software compa...

Location

United States , New York

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

At least 3 years of experience in payroll, with strong exposure to payroll tax reporting and filing
Demonstrated knowledge of multi-state payroll tax requirements and quarterly tax processes
Hands-on experience using Gusto for payroll administration or payroll tax activities
Strong understanding of payroll compliance standards and reporting accuracy requirements
Ability to review detailed payroll data, identify discrepancies, and resolve issues efficiently
Strong organizational skills with the ability to manage deadlines in a fast-paced contract assignment
Effective communication skills and the ability to collaborate with cross-functional teams

Job Responsibility

Manage payroll tax reporting activities for quarterly deadlines, ensuring complete and accurate submission of required filings
Prepare, review, and submit payroll tax documents across multiple states in compliance with applicable regulations
Use Gusto to process payroll-related tax tasks, validate data, and resolve reporting discrepancies
Reconcile payroll tax balances and investigate variances to support accurate financial and compliance records
Partner with HR and Operations stakeholders to gather payroll information and maintain alignment on tax-related deliverables
Monitor filing timelines and help ensure all payroll tax obligations are completed within established deadlines
Research and address payroll tax issues, including notices, exceptions, and reporting inconsistencies
Support project-based payroll tax initiatives tied to first-quarter reporting requirements

What we offer

medical
vision
dental
life and disability insurance
enrollment in company 401(k) plan

New

Interface Analyst Senior

The Interface Analyst Senior supports the business goals and objectives for the ...

Location

United States , Irving

Salary:

Not provided

CHRISTUS Health

Expiration Date

Until further notice

Requirements

Bachelor's degree in Information Systems, Computer Science, Computer Applications, Electrical Engineering, Electronics Engineering, or related area, or foreign equivalent
Demonstrative knowledge of clinical and healthcare messaging and all data formats used in healthcare integration messaging (HL7, EDI, CSV, XML, etc.)
Minimum of 7 years IT experience in technical analysis, design and integration implementation
Minimum of 3 years prior experience integrating messaging between healthcare systems (EMRs, Applications, third parties)
Minimum of 2 years working with relational databases, Transact-SQL, and SQL Stored Procedures or any other database
Minimum of 2 years of experience working in MS Office with a focus on Excel, Word and Visio

Job Responsibility

Leads groups of individuals in the analysis, design, development, and delivery of Integration Solutions
Works closely with project teams on customer-specific initiatives which involve message movement, translation, or integration development to solve highly complex technical messaging and transformation issues
Assists team members with troubleshooting technical issues related to message transformation and flows as well as analyzing customer message specifications and providing gap analysis
Creates documentation for future reference, training and support purposes
Communicates with customers and vendors to clarify messaging format requirements
Writes translation plans to move development work into various environments
Supports or leads integration testing as required

Fulltime

Select Country

Principal Supercomputing Operations Engineering Manager

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?