Staff Engineer, Distributed Storage, HPC & AI Infrastructure Job at Together AI (Amsterdam)

Job Description

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing. You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Job Responsibility

Design multi-petabyte AI/ML storage systems
integrate WekaFS, Ceph, etc.
lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
Design/optimize RDMA, InfiniBand, 400GbE networks
tune for max throughput/min latency
implement NVMe-oF/iSCSI
troubleshoot bottlenecks
optimize TCP/IP for storage
Build Kubernetes storage operators/controllers
enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
create reusable Helm/Terraform patterns
Deliver 10-50 GB/s per GPU node
optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths
troubleshoot with profiling tools
scale to thousands of nodes
Build multi-tier caches (local NVMe, distributed, object)
optimize data locality and model-weight distribution
implement smart prefetching/eviction
Implement monitoring, alerting, SLOs
design DR/backups with runbooks
run chaos engineering
ensure 99.9%+ uptime via proactive/automated remediation
Partner with ML/SRE teams
mentor on storage best practices
contribute to open-source
write docs, postmortems, and public learnings

Requirements

8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
Proven track record deploying and operating high-performance storage for GPU/HPC clusters
Deep Kubernetes and cloud-native storage experience in production environments
Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
BS/MS in Computer Science, Engineering, or equivalent practical experience
History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
Programming: Go and Python for automation, operators, and tooling
Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
Observability: Prometheus, Grafana, Thanos architecture and operations

Nice to have

GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
ML/AI storage patterns (model weights, checkpointing, dataset caching)
Kubernetes operator development (controller-runtime, kubebuilder)
Storage snapshots, cloning, and thin provisioning
Backup and disaster recovery (Velero, Restic, cross-region replication)
Storage encryption (at-rest and in-transit), security and compliance
Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)

Together AI - All Job Offers

Select Country

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

Member of Technical Staff, Site Reliability Engineer (HPC)

Member of Technical Staff, Software Co-Design AI HPC Systems

Senior Staff Cloud Support Engineer

Senior Principal Engineering Manager

Application Analyst III, Perioperative Applications - Clinical Applications/Information Solutions

Application Analyst III, Rev Cycle - Business Applications/Information Solutions

Program Manager III - Analytics/Information Solutions

Information Security Specialist/Analyst II

Our AI answers in your language