CrawlJobs Logo

HPC Storage Performance Engineer

United States · Job Posted April 24, 2026
Apply Position
Job Link Share

Job Description

A global technology company supports organizations in managing and using data across different environments. Its solutions help connect, secure, and analyze applications and information. The organization values diverse perspectives, offers flexible work options, and encourages collaboration. It provides opportunities for career growth and development.

Job Responsibility

  • Successfully complete long and short-term benchmark projects involving some of the largest HPC systems in the world that utilize the latest HPC technologies
  • Understand HPC architectural components and features, as well as performance estimation methodologies used to provide required information and performance assessments for storage benchmarks on future and competitive systems
  • Provide technical analysis of I/O in standard HPC storage and application benchmarks
  • Identify solutions, define action plans, and help coordinate and deliver optimal benchmark enhancements and solutions in partnership with account teams
  • Develop and maintain current knowledge of competitors’ products and relevant HPC performance optimization techniques to ensure ability to provide high-quality benchmark performance results

Requirements

  • 10+ years of related working experience is required
  • Deep understanding of the HPC Storage environment, including Lustre architecture, tuning, and metrics gained through experience in HPC and AI environments
  • Experience with storage benchmarks, profiling tools and/or scientific/engineering software for HPC systems
  • Familiarity with analyzing the role of storage in I/O synthetic benchmarks and end-user application performance

Nice to have

  • Experience in one or more of the following storage solutions: DAOS, GPFS, WekaIO, VAST, BeeGFS
  • Knowledge of HPC system components interaction with HPC benchmarks in addition to storage including processor, accelerator, memory, and software technologies
  • Understanding of parallel programming techniques and algorithms
  • Ability to triage complex issues, provide test cases and interact with R&D groups as part of a process to report and fix bugs
  • Expertise with software utilized by the HPC community that includes Compilers (C++, C, Fortran), OpenMP, MPI, MPI-IO, Python and other Linux based scripting languages
  • Expertise in GPU programming (CUDA and HIP)
  • BS (Masters Preferred) degree in in a Science, Technology, Engineering or Mathematical discipline

What we offer

  • Health & Wellbeing
  • Learning & Career Growth
  • Inclusive Culture

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

HPC Storage Performance Engineer

8 matching positions

HPC Storage Engineer

We are looking for an experienced HPC Storage Engineer to design, implement, and...
Location
Location
United States , Bala Cynwyd (Philadelphia Area), Pennsylvania
Salary
Salary:
Not provided
sig.com Logo
Susquehanna International Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience with parallel or distributed filesystems in production environments
  • Strong understanding of Linux systems administration
  • Experience with high-performance I/O, data locality, and throughput optimization
  • Proficiency in large-scale distributed systems development, preferably in C++
  • Proven ability to troubleshoot complex performance and reliability issues across storage and compute stacks
  • Experience with data transfer and movement tools
Job Responsibility
Job Responsibility
  • Design, deploy, and operate HPC storage systems and parallel/distributed filesystems (e.g., Lustre, GPFS/IBM Spectrum Scale, BeeGFS, Ceph)
  • Own data movement workflows across environments, including data ingest, replication, tiering, and archiving
  • Optimize filesystem and storage performance for large-scale parallel workloads
  • Design and tune load-balancing strategies across storage targets, metadata services, and data movement pipelines to ensure even utilization, high throughput, and predictable performance at scale
  • Troubleshoot storage, I/O, and data movement issues across HPC compute clusters
  • Develop and maintain automation for storage provisioning, monitoring, and lifecycle management
  • Partner with compute and networking teams to ensure end-to-end performance and reliability
  • Advise users and application teams on best practices for I/O patterns, data layout, and performance tuning
  • Evaluate and integrate new storage technologies and architectures as requirements evolve
Read More
Arrow Right

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

In this role, you will design and deliver multi-petabyte storage systems purpose...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
Job Responsibility
Job Responsibility
  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
Read More
Arrow Right

HPC Engineer

Location
Location
India , Chennai
Salary
Salary:
Not provided
whiteblue.com Logo
WhiteBlue
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in designing, implementing, and supporting high-performance computing (HPC) clusters with strong knowledge of CPU/GPU architecture, scalable storage, interconnects, and cloud-based systems
  • Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
  • Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities
  • Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
  • Understand and assemble the project specifications and performance requirements at the subsystem and system levels
  • Adhere and drive to project timelines to insure program achievements complete on time
  • Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team
  • Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
  • Experience of crafting and maintaining robust storage
  • Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas
Job Responsibility
Job Responsibility
  • Design, implementation & support of high-performance compute clusters
  • Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
  • Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities
  • Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
  • Understand and assemble the project specifications and performance requirements at the subsystem and system levels
  • Adhere and drive to project timelines to insure program achievements complete on time
  • Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team
  • Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
  • Experience of crafting and maintaining robust storage
  • Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

AI/HPC Systems Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...
Location
Location
United States , Menlo Park
Salary
Salary:
122000.00 - 181000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
  • Bachelor's degree in Computer Science, Computer Engineering, or other relevant technical field, with 2+ years work experience
  • Experience with using communication libraries, such as MPI, NCCL, and UCX
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • Experience with triaging performance issues in complex scale-out distributed applications
Job Responsibility
Job Responsibility
  • Collaborate with hardware and software teams to optimize end-to-end communication pathways for large-scale distributed training workloads, ensuring seamless integration between compute, storage, and networking components
  • Design, implement, and validate new collective communication algorithms tailored for AI/HPC workloads, leveraging RDMA and advanced networking technologies to maximize throughput and minimize latency
  • Develop and maintain automated performance testing frameworks for continuous benchmarking of communication libraries and RDMA transport layers, enabling rapid identification of regressions and bottlenecks
  • Analyze and profile communication patterns in real-world training jobs, using telemetry and tracing tools to uncover inefficiencies and recommend architectural improvements
  • Drive adoption of best practices for scalable, fault-tolerant communication in production environments, including tuning RDMA parameters, optimizing network fabric configurations, and ensuring robust error handling
  • Work closely with vendors and internal teams to evaluate and integrate new hardware features (e.g., NICs, switches, accelerators) that can enhance communication performance for AI/HPC clusters
  • Contribute to documentation and knowledge sharing by authoring technical guides, performance reports, and internal wiki pages to educate peers and stakeholders on communication system optimizations
  • Participate in code reviews and design discussions to ensure high-quality, maintainable solutions that meet the evolving needs of large-scale AI/HPC infrastructure
What we offer
What we offer
  • bonus
  • equity
  • benefits
Read More
Arrow Right

Training Performance Engineer

As a Training Performance Engineer, you’ll drive efficiency improvements across ...
Location
Location
United States , San Francisco
Salary
Salary:
250000.00 - 445000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Love optimizing performance and digging into systems to understand how every layer interacts
  • Have strong programming skills in Python and C++ (Rust or CUDA a plus)
  • Have experience running distributed training jobs on multi-GPU systems or HPC clusters
  • Enjoy debugging complex distributed systems and measuring efficiency rigorously
  • Have exposure to frameworks like PyTorch, JAX, or TensorFlow and an understanding of how large-scale training loops are built
  • Are comfortable collaborating across teams and translating raw profiling data into practical engineering improvements
Job Responsibility
Job Responsibility
  • Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage
  • Optimize GPU utilization and throughput for large-scale distributed model training
  • Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance
  • Implement model graph transforms to improve end to end throughput
  • Build tooling to monitor and visualize MFU, throughput, and uptime across clusters
  • Partner with researchers to ensure new model architectures scale efficiently during pre-training
  • Contribute to infrastructure decisions that improve reliability and efficiency of large training jobs
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Hpc Storage Benchmark Expert

HPC Storage Benchmark Expert. This role has been designed as ‘’Onsite’ with an e...
Location
Location
United Kingdom , Edinburgh
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.Sc. or equivalent in Science, Technology, Engineering, or Mathematics
  • Deep knowledge of HPC storage environments, including: Linux kernel fundamentals
  • High‑performance interconnects
  • Parallel file systems (e.g., Lustre, GPFS)
  • Software‑defined storage (e.g., DAOS, VAST, Weka) — desirable
  • I/O and communication libraries, gained through work with large‑scale HPC systems
  • Proven ability to triage complex issues, create reproducible test cases, and collaborate with R&D to resolve bugs
  • Experience delivering technical content through presentations or training sessions
  • A willingness to adopt and apply new tools, programming paradigms, and technologies
  • Strong team player with the ability to operate independently when needed
Job Responsibility
Job Responsibility
  • Supporting tender responses by analysing customer storage requirements
  • Designing and proposing optimal storage architectures alongside Presales teams
  • Running benchmarks on existing systems and projecting performance for future systems
  • Developing detailed, accurate reports to support solution recommendations
  • Validating commitments by executing customer tests on delivered systems
  • Sharing insights, knowledge, and best practices internally and with customers
  • Collaborating with top‑tier researchers, software engineers, and technical teams to help shape future products
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Gcp Devops Hpc Engineer

Location
Location
Spain
Salary
Salary:
70000.00 - 80000.00 EUR / Year
signifytechnology.com Logo
Signify Technology
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years’ experience in HPC environments (SLURM, MPI, parallel workloads)
  • Strong Linux systems expertise in performance-critical environments
  • Hands-on experience running or migrating HPC workloads in the cloud (GCP preferred)
  • Solid experience with Terraform and Ansible
  • Strong scripting skills (Python, Bash)
  • Deep understanding of GCP services (GCE, VPC, Cloud Storage)
Job Responsibility
Job Responsibility
  • Lead end-to-end migrations of SLURM-based HPC clusters from on-prem to GCP
  • Design, build, and operate secure, scalable HPC architectures in the cloud
  • Optimise SLURM scheduling, workload performance, and resource utilisation
  • Automate cluster deployment and operations using Terraform, Ansible, Python, and Bash
  • Manage HPC software stacks using Spack
  • Deploy and support parallel workloads using MPI, OpenMP, and related frameworks
  • Troubleshoot performance issues and drive continuous optimisation
  • Collaborate with engineering teams and stakeholders in a fully remote environment
  • Fulltime
Read More
Arrow Right