CrawlJobs Logo

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

Netherlands, Amsterdam · Job Posted February 18, 2026
Apply Position
Job Link Share

Job Description

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing. You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Job Responsibility

  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
  • create reusable Helm/Terraform patterns
  • Deliver 10-50 GB/s per GPU node
  • optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths
  • troubleshoot with profiling tools
  • scale to thousands of nodes
  • Build multi-tier caches (local NVMe, distributed, object)
  • optimize data locality and model-weight distribution
  • implement smart prefetching/eviction
  • Implement monitoring, alerting, SLOs
  • design DR/backups with runbooks
  • run chaos engineering
  • ensure 99.9%+ uptime via proactive/automated remediation
  • Partner with ML/SRE teams
  • mentor on storage best practices
  • contribute to open-source
  • write docs, postmortems, and public learnings

Requirements

  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
  • Programming: Go and Python for automation, operators, and tooling
  • Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
  • Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
  • Observability: Prometheus, Grafana, Thanos architecture and operations

Nice to have

  • GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
  • ML/AI storage patterns (model weights, checkpointing, dataset caching)
  • Kubernetes operator development (controller-runtime, kubebuilder)
  • Storage snapshots, cloning, and thin provisioning
  • Backup and disaster recovery (Velero, Restic, cross-region replication)
  • Storage encryption (at-rest and in-transit), security and compliance
  • Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

8 matching positions

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
180000.00 - 220000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
  • Advanced Linux systems expertise
  • Deep Kubernetes operational experience (CKA-level or higher)
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
  • Experience supporting AI/ML workloads at scale (GPU clusters)
  • Proven track record of resolving multi-layer, distributed system failures
  • Strong customer communication and executive-facing presence
Job Responsibility
Job Responsibility
  • Serve as highest-level escalation point for complex P1/P0 incidents
  • Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
  • Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
  • Design and improve node validation, burn-in processes, performance baselining, and release readiness
  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
  • Reduce MTTR and incident recurrence through structural improvements
  • Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
  • Support complex AI workloads (training + inference) with performance tuning and observability improvements
  • Act as senior technical advisor during high-risk customer incidents
  • Deliver executive-ready RCAs with clarity and confidence
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right
New

Project Finance & Operations Analyst

Our client is a globally recognized powerhouse in the technology and innovative ...
Location
Location
Poland , Warszawa
Salary
Salary:
Not provided
https://www.randstad.com Logo
Randstad
Expiration Date
July 31, 2026
Flip Icon
Requirements
Requirements
  • over 3 years of professional experience in corporate finance operations, project controlling, procurement support, or contractor administration
  • advanced Excel skills, including a strong command of pivot tables, lookups, data reconciliations, and building trackers
  • experience working with SAP is highly preferred
  • sharp analytical skills, a great eye for detail , and the ability to comfortably handle multiple operational tasks while meeting deadlines
  • fluent in both English and Polish, with the communication skills needed to collaborate effectively across teams
  • experience working within a multinational environment or a Shared Services Center (SSC) is a major plus
Job Responsibility
Job Responsibility
  • preparing and keeping track of monthly P&L statements alongside detailed margin calculations, as well as delivering accurate monthly forecasts and financial analysis for management
  • verifying timesheets, billing records, contractor data, outsourcing invoices, and finder fees
  • raising, monitoring, and approving purchase orders (POs) directly in SAP, while matching and reconciling supplier invoices against approvals and supporting documentation
  • actively supporting GR/IR and month-end closing activities, investigating missing approvals or invoice discrepancies, and keeping financial documentation flawlessly organized and audit-ready
  • managing settlements for overtime, shift work, extra projects, and business trips, alongside tracking contractor onboarding and offboarding documentation
  • partnering closely with HR, Payroll, Finance, Procurement, and Project Managers to keep operations running smoothly, while identifying opportunities for process and reporting improvements
What we offer
What we offer
  • 12-month B2B contract with a high likelihood of long-term extension
  • the autonomy of a remote work model with occasional face-to-face team alignment
  • opportunities for local and international business travels
  • access to private medical care and a sports card
  • the prestige of working with a premier global technology leader on new operational structures
  • Fulltime
Read More
Arrow Right
New

Document Services Administrator

We've been helping our members save for their future and buy a home of their own...
Location
Location
United Kingdom , Leeds
Salary
Salary:
25000.00 GBP / Year
leedsbuildingsociety.co.uk Logo
Leeds Building Society
Expiration Date
June 08, 2026
Flip Icon
Requirements
Requirements
  • Administrative experience
  • Experience of working under pressure in a fast and accurate manner, to meet set deadlines
  • Excellent organisational skills
  • Previous experience working with Microsoft Office and internal systems
Job Responsibility
Job Responsibility
  • Efficiently handle inbound Society documents ensuring these are forwarded to the relevant departments as well as processing outbound mail for our members and internal departments
  • Building strong relationships with operational colleagues across the Society and becoming knowledgeable on our Societies different business areas
What we offer
What we offer
  • An annual colleague bonus of up to 12%
  • Matched pension contributions of up to 10%
  • 26 days holiday, plus bank holidays and holiday purchase scheme of up to 5 days each year
  • Colleague Mortgage and Saver products
  • Electric vehicle scheme/ Cycle to Work scheme
  • 2 days' volunteering per year
  • Fulltime
Read More
Arrow Right
New

Line Cook

The Renaissance Minneapolis Bloomington Hotel is seeking a creative and skilled ...
Location
Location
United States , Bloomington
Salary
Salary:
Not provided
spirehotels.com Logo
Spire Hospitality
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Completion of a Culinary or Apprenticeship Program is preferred
  • A minimum of 1 year of cooking experience in a hotel is preferred
  • Minimum of 1 year of cooking experience in a similar role, and the size of the operation required
  • Food Handlers Certification preferred
  • Basic mathematical skills are necessary to understand recipes, measurements, requisition amounts, and portion sizes.
Job Responsibility
Job Responsibility
  • Equip assigned workstations with essential products and culinary equipment to ensure efficient production and exceptional service
  • Complete all prep work efficiently for soups, sauces, salads, and various ingredients
  • Ensures meticulous product storage and precise portion control for each dish
  • Minimizes spoilage and waste through effective product rotation
  • Monitors food, produce, and cooking supply levels to facilitate timely reordering
  • Maintains impeccable cleanliness and ensures functionality of refrigeration, storage, and work areas
  • Assist with dishwashing and other stewarding duties assigned.
What we offer
What we offer
  • EARLY PAY OR EARNED WAGE ACCESS (we get paid before payday)
  • medical, dental, and vision benefits through UNITE HERE Local 17, Minnesota’s Hospitality Union
  • Fulltime
Read More
Arrow Right
New

Vice President, Sales

The Vice President of Sales — Full-Service Hotels is the senior sales leader res...
Location
Location
United States , Irving
Salary
Salary:
Not provided
spirehotels.com Logo
Spire Hospitality
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum 10 years of progressive hotel sales experience, with at least 4 years in a multi-property or regional sales leadership role
  • Minimum 6 years of direct full-service hotel sales experience
  • deep understanding of full-service group, catering, and corporate transient dynamics
  • Demonstrated success leading both group and corporate/transient sales functions with measurable revenue and market share results
  • Proven experience managing and developing a layered sales team (regional managers and/or multiple Directors of Sales simultaneously)
  • Strong knowledge of the group sales process from prospecting through contract execution, including intermediary relationships and RFP management
  • Fluency with major brand sales platforms and distribution tools (Marriott, Hilton, IHG, or comparable)
  • experience navigating brand loyalty and preferred program environments
  • Proficiency with CRM platforms (Salesforce, Amadeus Delphi FDC, or comparable) and STR/competitive set reporting
  • Established network of group intermediaries, corporate travel buyers, and national account contacts
Job Responsibility
Job Responsibility
  • Develop and execute a comprehensive sales strategy for all full-service properties spanning group, corporate transient, and consortia segments
  • Establish annual sales production goals, booking pace targets, and account penetration benchmarks by property, region, and segment
  • Translate portfolio-level commercial priorities into clear, property-specific sales plans with measurable outcomes
  • Monitor competitive set performance and market demand trends
  • proactively adjust sales tactics to capture share and defend existing accounts
  • Partner with the COO to align full-service sales execution with Spire's enterprise commercial strategy
  • Represent the sales function in owner presentations, asset management reviews, and Spire executive leadership meetings
  • Drive group room night production across all full-service properties, including meetings, conventions, association business, corporate groups, and social/wedding segments
  • Build and maintain a robust national account base and key intermediary relationships (HelmsBriscoe, ConferenceDirect, Maritz, and other third-party planners)
  • Establish and enforce group booking pace standards, lead-to-contract conversion benchmarks, and group displacement guidelines in coordination with revenue management
What we offer
What we offer
  • Medical
  • Dental
  • Vision
  • Pet discount program
  • Identity theft protection
  • Pre-paid legal support
  • Flexible spending accounts
  • Matched 401K
  • Life insurance
  • Critical accident or illness
  • Fulltime
Read More
Arrow Right