CrawlJobs Logo

HPC Storage Engineer

United States, Bala Cynwyd (Philadelphia Area), Pennsylvania · Job Posted February 01, 2026
Apply Position
Job Link Share

Job Description

We are looking for an experienced HPC Storage Engineer to design, implement, and optimize the storage and data movement infrastructure that underpins our high-performance computing (HPC) environment. This role focuses on distributed and parallel filesystems, storage systems, and large-scale data movement, ensuring reliable, high-throughput access to data for compute-intensive workloads. You will work closely with HPC platform engineers, compute and networking teams, and application users to deliver scalable, performant, and resilient storage solutions that tightly integrate the storage layer with compute nodes.

Job Responsibility

  • Design, deploy, and operate HPC storage systems and parallel/distributed filesystems (e.g., Lustre, GPFS/IBM Spectrum Scale, BeeGFS, Ceph)
  • Own data movement workflows across environments, including data ingest, replication, tiering, and archiving
  • Optimize filesystem and storage performance for large-scale parallel workloads
  • Design and tune load-balancing strategies across storage targets, metadata services, and data movement pipelines to ensure even utilization, high throughput, and predictable performance at scale
  • Troubleshoot storage, I/O, and data movement issues across HPC compute clusters
  • Develop and maintain automation for storage provisioning, monitoring, and lifecycle management
  • Partner with compute and networking teams to ensure end-to-end performance and reliability
  • Advise users and application teams on best practices for I/O patterns, data layout, and performance tuning
  • Evaluate and integrate new storage technologies and architectures as requirements evolve

Requirements

  • Hands-on experience with parallel or distributed filesystems in production environments
  • Strong understanding of Linux systems administration
  • Experience with high-performance I/O, data locality, and throughput optimization
  • Proficiency in large-scale distributed systems development, preferably in C++
  • Proven ability to troubleshoot complex performance and reliability issues across storage and compute stacks
  • Experience with data transfer and movement tools

Nice to have

  • Familiarity with object storage and hierarchical storage management (HSM)
  • Experience integrating storage with HPC schedulers (e.g., Slurm) and compute workflows
  • Background supporting scientific, ML/AI, or other data-intensive workloads

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

HPC Storage Engineer

8 matching positions

HPC Storage Performance Engineer

A global technology company supports organizations in managing and using data ac...
Location
Location
United States
Salary
Salary:
Not provided
welovesalt.com Logo
Salt
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of related working experience is required
  • Deep understanding of the HPC Storage environment, including Lustre architecture, tuning, and metrics gained through experience in HPC and AI environments
  • Experience with storage benchmarks, profiling tools and/or scientific/engineering software for HPC systems
  • Familiarity with analyzing the role of storage in I/O synthetic benchmarks and end-user application performance
Job Responsibility
Job Responsibility
  • Successfully complete long and short-term benchmark projects involving some of the largest HPC systems in the world that utilize the latest HPC technologies
  • Understand HPC architectural components and features, as well as performance estimation methodologies used to provide required information and performance assessments for storage benchmarks on future and competitive systems
  • Provide technical analysis of I/O in standard HPC storage and application benchmarks
  • Identify solutions, define action plans, and help coordinate and deliver optimal benchmark enhancements and solutions in partnership with account teams
  • Develop and maintain current knowledge of competitors’ products and relevant HPC performance optimization techniques to ensure ability to provide high-quality benchmark performance results
What we offer
What we offer
  • Health & Wellbeing
  • Learning & Career Growth
  • Inclusive Culture
  • Fulltime
Read More
Arrow Right

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

In this role, you will design and deliver multi-petabyte storage systems purpose...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
Job Responsibility
Job Responsibility
  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
Read More
Arrow Right

HPC Engineer

Location
Location
India , Chennai
Salary
Salary:
Not provided
whiteblue.com Logo
WhiteBlue
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in designing, implementing, and supporting high-performance computing (HPC) clusters with strong knowledge of CPU/GPU architecture, scalable storage, interconnects, and cloud-based systems
  • Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
  • Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities
  • Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
  • Understand and assemble the project specifications and performance requirements at the subsystem and system levels
  • Adhere and drive to project timelines to insure program achievements complete on time
  • Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team
  • Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
  • Experience of crafting and maintaining robust storage
  • Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas
Job Responsibility
Job Responsibility
  • Design, implementation & support of high-performance compute clusters
  • Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
  • Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities
  • Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
  • Understand and assemble the project specifications and performance requirements at the subsystem and system levels
  • Adhere and drive to project timelines to insure program achievements complete on time
  • Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team
  • Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
  • Experience of crafting and maintaining robust storage
  • Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Hpc Storage Benchmark Expert

HPC Storage Benchmark Expert. This role has been designed as ‘’Onsite’ with an e...
Location
Location
United Kingdom , Edinburgh
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.Sc. or equivalent in Science, Technology, Engineering, or Mathematics
  • Deep knowledge of HPC storage environments, including: Linux kernel fundamentals
  • High‑performance interconnects
  • Parallel file systems (e.g., Lustre, GPFS)
  • Software‑defined storage (e.g., DAOS, VAST, Weka) — desirable
  • I/O and communication libraries, gained through work with large‑scale HPC systems
  • Proven ability to triage complex issues, create reproducible test cases, and collaborate with R&D to resolve bugs
  • Experience delivering technical content through presentations or training sessions
  • A willingness to adopt and apply new tools, programming paradigms, and technologies
  • Strong team player with the ability to operate independently when needed
Job Responsibility
Job Responsibility
  • Supporting tender responses by analysing customer storage requirements
  • Designing and proposing optimal storage architectures alongside Presales teams
  • Running benchmarks on existing systems and projecting performance for future systems
  • Developing detailed, accurate reports to support solution recommendations
  • Validating commitments by executing customer tests on delivered systems
  • Sharing insights, knowledge, and best practices internally and with customers
  • Collaborating with top‑tier researchers, software engineers, and technical teams to help shape future products
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Gcp Devops Hpc Engineer

Location
Location
Spain
Salary
Salary:
70000.00 - 80000.00 EUR / Year
signifytechnology.com Logo
Signify Technology
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years’ experience in HPC environments (SLURM, MPI, parallel workloads)
  • Strong Linux systems expertise in performance-critical environments
  • Hands-on experience running or migrating HPC workloads in the cloud (GCP preferred)
  • Solid experience with Terraform and Ansible
  • Strong scripting skills (Python, Bash)
  • Deep understanding of GCP services (GCE, VPC, Cloud Storage)
Job Responsibility
Job Responsibility
  • Lead end-to-end migrations of SLURM-based HPC clusters from on-prem to GCP
  • Design, build, and operate secure, scalable HPC architectures in the cloud
  • Optimise SLURM scheduling, workload performance, and resource utilisation
  • Automate cluster deployment and operations using Terraform, Ansible, Python, and Bash
  • Manage HPC software stacks using Spack
  • Deploy and support parallel workloads using MPI, OpenMP, and related frameworks
  • Troubleshoot performance issues and drive continuous optimisation
  • Collaborate with engineering teams and stakeholders in a fully remote environment
  • Fulltime
Read More
Arrow Right

HPC Systems Engineer

The Consumer Products Infrastructure team builds and operates the high-performan...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience designing and operating large-scale HPC clusters (1,000+ nodes)
  • Deep expertise with NC, IBM/Platform LSF, and Slurm workload managers
  • Strong Linux system administration experience (RHEL-family preferred)
  • Hands-on experience with MPI, parallel scaling, and performance tuning for simulation workloads
  • Experience using Azure CycleCloud to provision and manage HPC clusters in hybrid cloud environments
  • Proven experience operating InfiniBand or other high-speed interconnects
  • Strong Python and Bash skills for automation, tooling, and workflow optimization
  • Experience with distributed filesystems (NFS, DFS, Lustre, GPFS, BeeGFS)
  • Deep familiarity with HPC licensing systems (FlexLM, DSLS, RLM, LUM)
  • Experience supporting product-oriented engineering or simulation teams
Job Responsibility
Job Responsibility
  • Architect, deploy, and operate large-scale HPC clusters (1,000+ nodes) supporting simulation workloads critical to consumer product development
  • Optimize workload management using NC, IBM/Platform LSF, and Slurm, with a focus on throughput, fairness, and minimizing queue wait times for product teams
  • Design and implement strategies for workload balancing, cluster federation, and multi-scheduler environments that support diverse product workflows
  • Partner closely with product design, mechanical, electrical, and simulation engineers to debug jobs, improve parallel scaling, and accelerate design-to-validation cycles
  • Administer and harden Linux-based HPC systems (RHEL, Rocky Linux, AlmaLinux), including patching, kernel tuning, and performance optimization
  • Operate and optimize software licensing infrastructure (FlexLM, DSLS, LUM, RLM) to maximize utilization and prevent license-related development bottlenecks
  • Deploy and manage Azure CycleCloud and/or TotalCAE to enable elastic capacity, cloud bursting, and hybrid HPC workflows during peak product development cycles
  • Configure and tune high-speed interconnects, including InfiniBand (HDR/EDR/FDR), to support low-latency, tightly coupled simulation workloads
  • Design and maintain high-performance storage systems (NFS, DFS, Lustre, GPFS / Spectrum Scale, BeeGFS, Azure NetApp) optimized for simulation I/O patterns
  • Build automation and internal tooling using Python and Bash to streamline provisioning, monitoring, diagnostics, and job submission workflows
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

HPC Systems Engineer

As a member of our Platform Development team, you will be instrumental in buildi...
Location
Location
United States , Bala Cynwyd (Philadelphia Area), Pennsylvania
Salary
Salary:
Not provided
sig.com Logo
Susquehanna International Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A Bachelor’s degree in Engineering, Computer Science, Information Systems, or a related discipline
  • 5-7 years of progressive experience building Linux and/or Windows based HPC based platforms
  • Familiarity with kernel-level and I/O subsystem tweaks and tools such as sysctl, strace, tcpdump, and netstat
  • Recent hands-on experience with automation in Python or other tools
  • Experience administering Lustre, GPFS, VAST, or other parallel filesystems
  • Understanding of resource schedulers like HTCondor, SLURM, or similar
Job Responsibility
Job Responsibility
  • Contribute to our library of home-grown tools, written primarily in Python and Bash, to automate monitoring, and maintenance
  • Work closely with Strategy Developers, Quantitative Researchers, and trade-supporting application teams to translate complex problems into scalable solutions
  • Coordinate with IT infrastructure teams, including storage and networking, to identify and implement the best solutions
  • Tune operating systems and batch workflows for performance
  • Dive deep on root-cause analysis of systems issues
  • Integrate all of these solutions into our systems effectively and efficiently
  • Oversee all aspects of our HPC environment, including the scheduler, parallel filesystems, GPUs, and interconnects
  • Implement and optimize high-performance storage solutions, including Lustre, VAST, and GPFS
  • Develop strategies to ensure optimal resource allocation and scalability
  • Utilize monitoring and diagnostic tools to quickly pinpoint failures, streamline troubleshooting processes, and ensure the timely recovery of disrupted workflows
Read More
Arrow Right