CrawlJobs Logo

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

Netherlands, Amsterdam · Job Posted February 18, 2026
Apply Position
Job Link Share

Job Description

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing. You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Job Responsibility

  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
  • create reusable Helm/Terraform patterns
  • Deliver 10-50 GB/s per GPU node
  • optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths
  • troubleshoot with profiling tools
  • scale to thousands of nodes
  • Build multi-tier caches (local NVMe, distributed, object)
  • optimize data locality and model-weight distribution
  • implement smart prefetching/eviction
  • Implement monitoring, alerting, SLOs
  • design DR/backups with runbooks
  • run chaos engineering
  • ensure 99.9%+ uptime via proactive/automated remediation
  • Partner with ML/SRE teams
  • mentor on storage best practices
  • contribute to open-source
  • write docs, postmortems, and public learnings

Requirements

  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
  • Programming: Go and Python for automation, operators, and tooling
  • Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
  • Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
  • Observability: Prometheus, Grafana, Thanos architecture and operations

Nice to have

  • GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
  • ML/AI storage patterns (model weights, checkpointing, dataset caching)
  • Kubernetes operator development (controller-runtime, kubebuilder)
  • Storage snapshots, cloning, and thin provisioning
  • Backup and disaster recovery (Velero, Restic, cross-region replication)
  • Storage encryption (at-rest and in-transit), security and compliance
  • Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

8 matching positions

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
180000.00 - 220000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
  • Advanced Linux systems expertise
  • Deep Kubernetes operational experience (CKA-level or higher)
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
  • Experience supporting AI/ML workloads at scale (GPU clusters)
  • Proven track record of resolving multi-layer, distributed system failures
  • Strong customer communication and executive-facing presence
Job Responsibility
Job Responsibility
  • Serve as highest-level escalation point for complex P1/P0 incidents
  • Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
  • Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
  • Design and improve node validation, burn-in processes, performance baselining, and release readiness
  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
  • Reduce MTTR and incident recurrence through structural improvements
  • Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
  • Support complex AI workloads (training + inference) with performance tuning and observability improvements
  • Act as senior technical advisor during high-risk customer incidents
  • Deliver executive-ready RCAs with clarity and confidence
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right
New

Application Analyst III, Perioperative Applications - Clinical Applications/Information Solutions

The Application Analyst III, Clinical Applications, reports to the leader of the...
Location
Location
United States
Salary
Salary:
Not provided
muschealth.org Logo
MUSC Health
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A bachelor’s degree and five years directly related experience (IS or clinical)
  • or a high school diploma and nine years directly related experience or a Masters’ degree and three years directly related experience required (IS or clinical)
  • Must possess strong interpersonal, project management, analytical and communication skills
  • Epic certification in OpTime and Anesthesia strongly preferred
Job Responsibility
Job Responsibility
  • Provide expertise in evaluating and resolving complex technical issues
  • Demonstrate strong analytical and communication skills to maximize the benefit of Information Solutions services
  • Generate appropriate communication, process and educational plans to identify and remove obstacles to change
  • Serve as the subject matter expert in all areas of application clinical program supporting applications, maintaining system updates, supporting operational end users
  • Take a lead role in projects
  • Participate in team and project meetings and provide input on intelligent solutions to improve efficiency
  • Coach and transfer knowledge to peers and staff
  • Maintain high professional standards
  • Exhibit excellent customer service skills while performing assigned tasks
  • Fulltime
Read More
Arrow Right
New

Application Analyst III, Rev Cycle - Business Applications/Information Solutions

The Application Analyst III, Business Applications, reports to the Leader of the...
Location
Location
United States
Salary
Salary:
Not provided
muschealth.org Logo
MUSC Health
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree and five years directly related experience
  • or a high school diploma and nine years directly related experience
  • or a Master's degree and 3 years directly related experience
  • strong interpersonal, project management, analytical and communication skills
  • application specific certifications preferred
Job Responsibility
Job Responsibility
  • Evaluating and resolving complex technical issues
  • demonstrating strong analytical and communication skills to maximize benefit of Information Solutions services
  • generating appropriate communication, process and educational plans to identify and remove obstacles to change
  • serving as subject matter expert in all areas of the Business Delivery program supporting applications, maintaining system updates, supporting operational end users
  • taking a lead role in projects
  • participating in team and project meetings and providing input on intelligent solutions to improve efficiency
  • coaching and transferring knowledge to peers and staff
  • maintaining high professional standards
  • exhibiting excellent customer service skills
  • Fulltime
Read More
Arrow Right
New

Program Manager III - Analytics/Information Solutions

The Information Solutions Program Manager III reports to the Director of Analyti...
Location
Location
United States , South Carolina
Salary
Salary:
Not provided
muschealth.org Logo
MUSC Health
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree (BA/BS) and a minimum of 5 years of experience in health information systems, Information Technology, project management, informatics, analytics, system education, business administration, or similar field
  • Strong communication and interpersonal skills, with a proven ability to work collaboratively with stakeholders at all levels of the organization
  • Excellent project management skills, with a track record of managing complex projects on time and on budget
  • Experience developing training programs and resources to encourage the adoption of new systems and processes
  • Ability to analyze data and present findings to senior management in a clear and concise manner
  • Strong multitasking and time management skills
  • Ability to work independently, taking initiative and driving projects forward
  • EPIC experienced required
Job Responsibility
Job Responsibility
  • Facilitating organizational change across a complex, multi-site, statewide health system
  • Working directly with both operational and clinical teams, identifying project champions, determining key performance indicators and developing strategies to track and measure success
  • Responsible for year‑round orchestration governance, prioritization, risk management, and submission readiness across these stakeholders
  • Workflow design, user education, informatics, and implementation throughout the product lifecycle
  • Closely collaborates with all teams in IS and key stakeholders as needed for project success
  • Fulltime
Read More
Arrow Right
New

Information Security Specialist/Analyst II

The Information Security Analyst II reports to an Information Security Manager o...
Location
Location
United States
Salary
Salary:
Not provided
muschealth.org Logo
MUSC Health
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A Bachelor's degree in information security, information assurance, computer science, or a related field with at least 2 years of IT security experience or 4-7 years of hands-on experience in information security or related IT experience required
  • Advanced analytical and problem-solving skills along with a solid understanding of information risks concepts and principles
  • Epic Security and Data Courier certifications are required, or the candidate must otherwise be able to obtain such certifications within 6 months
Job Responsibility
Job Responsibility
  • Provides a variety of operational, compliance and consultative functions to enable safe and secure access to information services to support the academic, research, and healthcare automated and manual relative to Epic Security and Access and provides general Tier 2 identity management operational support
  • Fulltime
Read More
Arrow Right