CrawlJobs Logo

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

together.ai Logo

Together AI

Location Icon

Location:
Netherlands , Amsterdam

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing. You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Job Responsibility:

  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
  • create reusable Helm/Terraform patterns
  • Deliver 10-50 GB/s per GPU node
  • optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths
  • troubleshoot with profiling tools
  • scale to thousands of nodes
  • Build multi-tier caches (local NVMe, distributed, object)
  • optimize data locality and model-weight distribution
  • implement smart prefetching/eviction
  • Implement monitoring, alerting, SLOs
  • design DR/backups with runbooks
  • run chaos engineering
  • ensure 99.9%+ uptime via proactive/automated remediation
  • Partner with ML/SRE teams
  • mentor on storage best practices
  • contribute to open-source
  • write docs, postmortems, and public learnings

Requirements:

  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
  • Programming: Go and Python for automation, operators, and tooling
  • Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
  • Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
  • Observability: Prometheus, Grafana, Thanos architecture and operations

Nice to have:

  • GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
  • ML/AI storage patterns (model weights, checkpointing, dataset caching)
  • Kubernetes operator development (controller-runtime, kubebuilder)
  • Storage snapshots, cloning, and thin provisioning
  • Backup and disaster recovery (Velero, Restic, cross-region replication)
  • Storage encryption (at-rest and in-transit), security and compliance
  • Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)

Additional Information:

Job Posted:
February 18, 2026

Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Engineer, Distributed Storage, HPC & AI Infrastructure

New

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

In this role, you will design and deliver multi-petabyte storage systems purpose...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
Job Responsibility
Job Responsibility
  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right
New

Night Manager

At Asda, we want you to Find Your Everything and as our Night Manager, you will ...
Location
Location
United Kingdom , Strabane
Salary
Salary:
Not provided
asda.com Logo
Asda Express
Expiration Date
March 02, 2026
Flip Icon
Requirements
Requirements
  • Experience supervising or managing teams in a retail grocery environment, ideally overnight
  • Ability to make decisions independently with a hands-on attitude and a drive to improve processes and people
  • Confidence in developing teams and managing change
  • Strong customer service focus
  • Flexibility to work a range of night shifts, including weekends
Job Responsibility
Job Responsibility
  • Manage and lead the night team to keep everything running whilst the rest of the world sleeps
  • Ensure delivering the best availability and highest shop floor standards across all departments
  • Ensure store remains safe and legal for customers and colleagues
  • Work a rolling rota and 4 nights over 7 including weekend working
  • Be solely responsible for the store through the night
  • Oversee operation as a whole where store is open 24 hours
  • Planning and executing all trading activity on the shopfloor during night shift
  • Ensuring delivery is date rotated and waste and returns are managed correctly limiting damage through replenishment
  • Leading and coaching night colleagues and team leaders to deliver excellent shopfloor standards
  • Creating a culture of selling our customers with personality, serving with heart and pride, and get one more item in every basket
What we offer
What we offer
  • Competitive salary plus benefits
  • Colleague discount: 15% off your shopping at Asda from day 1 for you and a nominated user
  • Discretionary company bonus scheme
  • Access to an enhanced electric car scheme
  • Free eye test for you and your nominated user
  • Discounted rates and special offers on Asda services such as Mobile, Pharmacy, Opticians, Personal Loan, Pet Insurance, Travel Insurance, Travel Money, Tyres, Breakdown Cover
  • Stream: access to flexible pay, income tracker, financial coaching, exclusive savings account and much more
  • Company pension
  • Wellbeing: including 24/7 virtual GP, 24/7 EAP service, as well as access to free counselling, legal, mortgage, cancer and bereavement support
  • Asda Allies Inclusion Networks – helping colleagues to make sure everybody is included and that our differences are recognised and celebrated
  • Fulltime
Read More
Arrow Right
New

Store Counter Sales

We are so much more than a Parts Store and we are looking for even more great ta...
Location
Location
United States , Lenoir City
Salary
Salary:
Not provided
allianceautomotive.co.uk Logo
Alliance Automotive UK LV Ltd
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Valid Driver’s License
  • Previous experience in a parts store or automotive industry or at least a willingness to learn all things auto parts.
  • High School Diploma or GED. Technical or Trade school courses or degree.
  • Excellent verbal and written communication skills
  • Love fast paced retail environments
  • Great listening skills and empathy for customers
Job Responsibility
Job Responsibility
  • Provide auto parts answers and solutions for our retail and wholesale customers in person at the counter/over the phone
  • Use your parts knowledge to assist other NAPA team members answer questions for customers
  • Providing outstanding customer care and interactions with everyone who comes into our NAPA Store!
  • Bring customer focus and high energy to our fast-paced stores
  • Welcome retail customers into our retail stores and engage to provide a positive consumer experience
  • Use technology (computer), cash register, telephone, and paper catalog system
What we offer
What we offer
  • Awesome people and brand
  • Competitive Pay
  • Outstanding health benefits and 401K
  • Stable company. Fortune 200 with a “family” feel
  • A Culture of promotion from within, using your creativity, finding solutions/fixes, and where no 2 days or career paths are the same!
  • Great training, and ongoing development with support from multiple leaders/your team
  • Parttime
Read More
Arrow Right
New

Hsc deputy manager

There's a role for everyone in retail - including leadership. If you're looking ...
Location
Location
United Kingdom , Nottingham
Salary
Salary:
Not provided
asda.com Logo
Asda Express
Expiration Date
March 02, 2026
Flip Icon
Requirements
Requirements
  • Naturally friendly
  • Able to work in a fast paced environment
  • Adaptable to change
  • Eye for high standards
  • Commercial awareness
  • Good leadership and people skills
  • Strong coach and mentor
  • Inspires the team
  • Role model to all store colleagues
  • Ensure customers have a great shopping experience
Job Responsibility
Job Responsibility
  • Motivate all managers to deliver the best standards
  • Maximise availability
  • Drive sales
  • Maintain shop floor standards
  • Coach the store team
  • In the absence of the General Store Manager, may be responsible for the entire store
  • Work a two-week rolling rota which will include weekends, evenings and sometimes late nights
What we offer
What we offer
  • Colleague discount: 15% off your shopping at Asda from day 1 for you and a nominated user
  • Discretionary company bonus scheme
  • Access to an enhanced electric car scheme
  • Free eye test for you and your nominated user
  • Discounted rates and special offers on Asda services such as Mobile, Pharmacy, Opticians, Personal Loan, Pet Insurance, Travel Insurance, Travel Money, Tyres, Breakdown Cover
  • Stream: access to flexible pay, income tracker, financial coaching, exclusive savings account and much more
  • Company pension
  • Wellbeing: including 24/7 virtual GP, 24/7 EAP service, as well as access to free counselling, legal, mortgage, cancer and bereavement support
  • Asda Allies Inclusion Networks
  • Excellent parental leave policies, including maternity & adoption leave, paternity leave, shared parental leave, neonatal care leave, and support for those doing fertility treatments
  • Fulltime
Read More
Arrow Right
New

Parts Advisor

Due to our growth and expansion within our motor factor network, we are looking ...
Location
Location
Ireland , Dublin 11, County Dublin
Salary
Salary:
Not provided
allianceautomotive.co.uk Logo
Alliance Automotive UK LV Ltd
Expiration Date
February 27, 2026
Flip Icon
Requirements
Requirements
  • Punctual, organised, and efficient whilst working towards deadlines
  • Pro-active and enthusiastic when speaking with customers
  • Able to demonstrate empathy towards customers
  • Professional and possess excellent communication skills
Job Responsibility
Job Responsibility
  • Taking orders for car parts over the telephone and counter, upselling to increase invoice value
  • Outbound calling to raise customer awareness and achieve higher sales
  • Building long-term rapport with local customers
  • Actively promote offers, discounts and customer competitions
  • Limiting credits and returns by ordering and sending the right car parts, first time
What we offer
What we offer
  • An opportunity to join a global brand and market leading motor factor
  • Competitive salary and excellent bonus potential
  • Structured career paths and bespoke training
  • A great team environment & friendly approachable management
  • Fulltime
Read More
Arrow Right
New

Inventory Control and Quality Assurance Manager Supply Chain

The Inventory Control and Quality Assurance (ICQA) Manager ensures the accuracy ...
Location
Location
United States , Minneapolis
Salary
Salary:
74252.00 - 87725.00 USD / Year
allianceautomotive.co.uk Logo
Alliance Automotive UK LV Ltd
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years’ experience working in quality assurance, inventory control and/or customer service within a distribution center or similar environment
  • 1+ year’s proven experience in a managerial role and/or leading teams
  • Ability to motivate and inspire a team to actively contribute to problem-solving and continuous improvement initiatives
  • Excellent organizational and problem-solving abilities, with a keen attention to detail and a proactive approach to addressing challenges
  • Familiarity with warehouse operations and logistics software (e.g., WMS, ERP systems)
  • Ability to thrive in a fast-paced environment and adapt to changing priorities
  • Strong communication skills, both verbal and written
  • Commitment to upholding high standards of integrity, professionalism and customer service, with a focus on supporting team members and delivering for customers
Job Responsibility
Job Responsibility
  • Provides and supports the implementation of business solutions and ensures compliance with policies and procedures focused on quality assurance
  • Ensures accuracy of inventory counts and quality, as well as quality of outbound shipments
  • Leads the ICQA team to collaborate with DC management on process improvement and value enhancement opportunities
  • Contributes to new business initiatives and projects
  • Manages the root cause analysis and corrective actions for quality issues
  • Manages inventory profiles throughout the warehouse
  • Reviews and evaluates quality data to identify process improvement opportunities
  • Creates corrective action plans to address process failures
  • Utilizes techniques for continuous improvement including Lean, Six-Sigma, Poka-Yoke (Error Proofing), and Measurement System Analysis and FMEA (Failure Mode and Effects Analysis)
  • Monitors and reports on supplier product quality and performance
What we offer
What we offer
  • Health Insurance: Comprehensive medical, dental, and vision plans
  • Retirement Plan: 401(k) with company match
  • Paid Time Off: Vacation, personal days, holidays, sick days, and paternal leave
  • Additional Perks: Employee stock purchase plan, tuition reimbursement, professional development opportunities, and wellness programs
  • Fulltime
Read More
Arrow Right
New

AI Strategist

Profound is on a mission to help companies understand and control their AI prese...
Location
Location
United States , New York City
Salary
Salary:
100000.00 - 120000.00 USD / Year
tryprofound.com Logo
Profound
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 0 to 1 years of experience as a business operations analyst, data analyst, or technical consultant
  • Experience building analytics dashboards, including Retool, Tableau, Looker, or similar
  • Comfort working directly with raw data and writing SQL
  • Experience building tools or projects with AI, such as Cursor
  • Strong written and verbal communication skills
  • Comfort operating at the intersection of product, engineering, and customer success
  • Naturally curious, with a tendency to ask follow-up questions and explore edge cases
  • High ownership and proactive by default, taking analyses end to end and surfacing next steps without being asked
Job Responsibility
Job Responsibility
  • Partner with Fortune 500 brands as a strategic guide through the AI discovery landscape
  • Perform applied analysis on LLM behavior, prompt patterns, and visibility mechanics
  • Write SQL to extract and structure prompt-level datasets
  • Use LLM-based workflows to classify, cluster, and synthesize unstructured data
  • Produce concise analytical outputs used in senior-level discussions
  • Bridge technical complexity and strategic clarity in client-facing work
What we offer
What we offer
  • equity
  • a full range of benefits and perks
  • Fulltime
Read More
Arrow Right