CrawlJobs Logo

Software Engineer, GPU Infrastructure - HPC

United States, San Francisco 230000.00 - 490000.00 USD / Year · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

As a software engineer on the Fleet High Performance Computing (HPC) team, you will be responsible for the reliability and uptime of all of OpenAI’s compute fleet. Minimizing hardware failure is key to research training progress and stable services, as even a single hardware hiccup can cause significant disruptions. With increasingly large supercomputers, the stakes continue to rise. Being at the forefront of technology means that we are often the pioneers in troubleshooting these state-of-the-art systems at scale. This is a unique opportunity to work with cutting-edge technologies and devise innovative solutions to maintain the health and efficiency of our supercomputing infrastructure. Our team empowers strong engineers with a high degree of autonomy and ownership, as well as ability to effect change. This role will require a keen focus on system-level comprehensive investigations and the development of automated solutions. We want people who go deep on problems, investigate as thoroughly as possible, and build automation for detection and remediation at scale.

Job Responsibility

  • Build and maintain automation systems for provisioning and managing server fleets
  • Develop tools to monitor server health, performance, and lifecycle events
  • Collaborate with clusters, networking, and infrastructure teams
  • Partner with external operators to ensure a high level of quality
  • Identify and fix performance bottlenecks and inefficiencies
  • Continuously improve automation to reduce manual work

Requirements

  • Experience managing large-scale server environments
  • A balance of strengths in building and operationalizing
  • Proficiency in Python, Go, or similar languages
  • Strong Linux, networking, and server hardware knowledge
  • Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool

Nice to have

  • Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)
  • Knowledge of hardware management protocols (e.g., IPMI, Redfish)
  • High-performance computing (HPC) or distributed systems experience
  • Prior experience developing, managing, or designing hardware
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana)

What we offer

  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Software Engineer, GPU Infrastructure - HPC

8 matching positions

Staff Software Engineer, GPU Infrastructure (HPC)

The internal infrastructure team is responsible for building world-class infrast...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment
Job Responsibility
Job Responsibility
  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Principal Supercomputing Operations Software Engineer

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • 6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
  • Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
  • Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads
  • Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
  • Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues
Job Responsibility
Job Responsibility
  • Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
  • Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
  • Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
  • Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

We are seeking an expert Senior GPU Engineer to join our AI Infrastructure team....
Location
Location
China , Shanghai
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of experience in systems programming, HPC, or GPU software development, featuring at least 5 years of hands-on CUDA/C++ kernel development
  • Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper)
  • Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution
  • Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy
  • Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel)
  • Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads
  • Mastery of NVIDIA Nsight Systems/Compute
  • Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Job Responsibility
Job Responsibility
  • Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries
  • Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization)
  • Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads
  • Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching
  • Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism)
  • Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy)
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Azure High Performance Computing and AI Platform (HPC/AI) group is the team behi...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Willing to dive deeply into any level or layer of a problem.
  • Willing to learn emerging technologies, from hardware to software. Evaluate and make recommendations that advance Azure infrastructure for AI and other GPU-based workloads.
  • Leads by example within the team by producing extensible and maintainable. Optimizes, debugs, refactors, and reuses code to improve performance and maintainability, effectiveness, and return on investment (ROI). Applies metrics to drive the quality and stability of code, as well as appropriate coding patterns and best practices.
  • Maintains communication with key partners across the Microsoft ecosystem of engineers. Acts as a key contact for leadership to ensure alignment with partners' expectations. Considers partner teams across organizations and their end goals for products to drive and achieve desirable user experiences and fitting dynamic needs of partners/customers through product development.
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Slurm

We are actively seeking an exceptional Staff Software Engineer to join our cloud...
Location
Location
United States , San Francisco
Salary
Salary:
185000.00 - 224000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience working in software engineering, with strong experience in Systems Engineering
  • Experience in distributed systems, cloud, or HPC environments is a must
  • 2+ years of programming experience in GoLang
  • Strong proficiency in other systems languages (Rust, C++, Python for HPC tooling) is also beneficial
  • Extensive experience with Kubernetes and Linux Engineering and debugging
  • Deep knowledge of Slurm (Simple Linux Utility for Resource Management) administration and the architecture required for managing compute jobs in high-performance environments
  • Skilled in infrastructure as code and familiar with systems-level challenges, ideally with experience utilizing Terraform
  • Understand Argo, CI/CD, and Automated Testing pipelines
  • Can design system architecture, taking ownership of system architecture, including CI/CD pipelines, while ensuring adherence to security standards
  • Strong knowledge of container networking (CNI plugins, service meshes) and Linux networking fundamentals
Job Responsibility
Job Responsibility
  • Lead the development and engineering of our managed Slurm offering, providing a seamless experience for AI/ML and HPC customers who rely on robust Slurm job scheduling
  • Contribute to the development of scalable and robust software solutions, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap
  • Design, build, and maintain Kubernetes operators and controllers dedicated to managing the lifecycle, configuration, and state of large-scale Slurm clusters
  • Drive the integration of GPU acceleration in the Slurm environment, including device plugin architecture, GPU operators, accelerator-aware scheduling, and resource allocation
  • Ensure that high-performance networking technologies, such as InfiniBand and RoCE, are correctly leveraged for distributed GPU workloads running through Slurm
  • Implement and manage features such as multi-tenancy, cluster lifecycle management, auto-scaling, and high availability for the managed Slurm control plane services
  • Develop scalable systems to compete with leading managed services
  • Support the development of your peers by sharing knowledge and providing guidance in technical discussions
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Sr/Staff Software Engineer, Observability

We are looking for a highly skilled engineer with deep expertise in building and...
Location
Location
United States , San Francisco
Salary
Salary:
172000.00 - 253000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience with distributed systems, with a focus on observability and monitoring systems
  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Familiarity with service meshes, networking, and workload instrumentation (Envoy, Istio, OpenTelemetry SDKs)
  • Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices
Job Responsibility
Job Responsibility
  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
  • Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
  • Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
  • Mentoring engineers and shaping Crusoe’s observability strategy and technical roadmap
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

We are looking for a highly skilled engineer with deep expertise in building and...
Location
Location
United States , San Francisco
Salary
Salary:
166000.00 - 201000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices
Job Responsibility
Job Responsibility
  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
  • Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
  • Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
  • Partnering with engineering teams to embed observability into applications, services, and infrastructure
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior GPU Engineer

We are seeking an expert Senior GPU Engineer to join our AI Infrastructure team....
Location
Location
China , Beijing
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience in systems programming, HPC, or GPU software development, featuring at least 5 years of hands-on CUDA/C++ kernel development
  • Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper)
  • Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution
  • Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy
  • Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel)
  • Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads
  • Mastery of NVIDIA Nsight Systems/Compute
  • Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput
Job Responsibility
Job Responsibility
  • Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries
  • Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization)
  • Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads
  • Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching
  • Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism)
  • Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy)
  • Fulltime
Read More
Arrow Right