CrawlJobs Logo

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

together.ai Logo

Together AI

Location Icon

Location:
Netherlands , Amsterdam

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing. You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Job Responsibility:

  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
  • create reusable Helm/Terraform patterns
  • Deliver 10-50 GB/s per GPU node
  • optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths
  • troubleshoot with profiling tools
  • scale to thousands of nodes
  • Build multi-tier caches (local NVMe, distributed, object)
  • optimize data locality and model-weight distribution
  • implement smart prefetching/eviction
  • Implement monitoring, alerting, SLOs
  • design DR/backups with runbooks
  • run chaos engineering
  • ensure 99.9%+ uptime via proactive/automated remediation
  • Partner with ML/SRE teams
  • mentor on storage best practices
  • contribute to open-source
  • write docs, postmortems, and public learnings

Requirements:

  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
  • Programming: Go and Python for automation, operators, and tooling
  • Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
  • Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
  • Observability: Prometheus, Grafana, Thanos architecture and operations

Nice to have:

  • GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
  • ML/AI storage patterns (model weights, checkpointing, dataset caching)
  • Kubernetes operator development (controller-runtime, kubebuilder)
  • Storage snapshots, cloning, and thin provisioning
  • Backup and disaster recovery (Velero, Restic, cross-region replication)
  • Storage encryption (at-rest and in-transit), security and compliance
  • Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)

Additional Information:

Job Posted:
February 18, 2026

Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Engineer, Distributed Storage, HPC & AI Infrastructure

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

In this role, you will design and deliver multi-petabyte storage systems purpose...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
Job Responsibility
Job Responsibility
  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right
New

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
180000.00 - 220000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
  • Advanced Linux systems expertise
  • Deep Kubernetes operational experience (CKA-level or higher)
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
  • Experience supporting AI/ML workloads at scale (GPU clusters)
  • Proven track record of resolving multi-layer, distributed system failures
  • Strong customer communication and executive-facing presence
Job Responsibility
Job Responsibility
  • Serve as highest-level escalation point for complex P1/P0 incidents
  • Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
  • Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
  • Design and improve node validation, burn-in processes, performance baselining, and release readiness
  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
  • Reduce MTTR and incident recurrence through structural improvements
  • Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
  • Support complex AI workloads (training + inference) with performance tuning and observability improvements
  • Act as senior technical advisor during high-risk customer incidents
  • Deliver executive-ready RCAs with clarity and confidence
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right
New

Project Manager

Chelsea Lighting is seeking a highly qualified and dedicated Project Manager to ...
Location
Location
United States , New York
Salary
Salary:
80000.00 - 110000.00 USD / Year
chelsealighting.com Logo
Chelsea Lighting
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Two (2) years of lighting or construction experience in Project Management is required
  • Demonstrated ability at successfully managing and developing a team comprised of Project Coordinators and Assistant Project Managers
  • College Graduate – four-year preferred
  • Must have strong time management, problem solving, interpersonal, and organizational skills, and utilize these skills in a demanding, multi-tasking environment
  • Must have a thorough understanding of submission process, vendor and project management
  • Must have the ability to effectively liaise with clients and cross-functional departments in the aggressive resolution of problems to drive timely payments for products and services rendered
  • Must have experience interpreting construction drawings, specifications, architectural sketches and other related construction documents including detailed take-offs, and be able to teach others
  • Proficient in software programs - MS Office Suite and Project Management tools
  • Strong quantitative and analytical math skills
  • Proactive and concise written and verbal communications
Job Responsibility
Job Responsibility
  • Managing all facets of project execution, including submittals, order entry, shop drawings, client and site management dynamics, logistics, billing, and collections
  • Serving as the primary liaison between clients, construction teams, and manufacturers, ensuring seamless collaboration and communication
  • Utilizing Smartsheet and other tools to monitor project progress, generate insights, and proactively address potential issues
  • Driving project profitability through effective management of change orders, returns, and mitigating back charges
  • Delivering accurate, data-driven reporting to internal and external stakeholders, including status updates, metrics, and tracking charts
  • Leading project kickoff meetings, site visits, and ongoing evaluations to ensure client satisfaction and project alignment with business goals
  • Providing expert Leadership, Management, and Accountability (LMA) mentoring to develop your project management team into high performers
  • Ensuring “Right People, Right Seat” - reward and advance high performers, and diagnose and develop corrective-action plans for underperformers
  • Attend all Kickoff meetings, regular job site visits, and sending regular status reports to customers
  • Facilitate conflict and Back-Charge resolution with clients
Read More
Arrow Right
New

Clinic Manager - Interventional Radiology

The Interventional Radiology (IR) Clinic Manager is responsible for guiding dail...
Location
Location
United States , San Diego
Salary
Salary:
155000.00 - 196000.00 USD / Year
ucsd.edu Logo
UC San Diego
Expiration Date
April 23, 2026
Flip Icon
Requirements
Requirements
  • A Bachelor's Degree in Nursing (BSN)
  • Registered Nurse (RN) license issued by the state of California
  • BART or BLS at time of hire with commitment to get BART within six (6) months of hire date
  • Minimum of five or more years of ambulatory, procedural, or inpatient RN experience
  • Minimum two years of supervisory experience in clinic or healthcare operations
  • Excellent communication and interpersonal skills, with the ability to lead teams effectively, collaborate with multidisciplinary staff, and provide compassionate, patient and family-centered care
  • Proven ability to manage patient access workflows with a strong problem solving skillset to identify barriers and implement effective solutions
  • Experience with performance improvement, workflow redesign, and operational efficiency initiatives
  • Ability to work collaboratively with multidisciplinary teams including physicians, APPs, nursing, administrative staff, and centralized services
Job Responsibility
Job Responsibility
  • Guiding daily operations for excellent patient care delivery within the Interventional Radiology ambulatory clinics
  • Manages daily clinical operations, patient access, referral workflows, and care coordination to ensure a high-quality, patient centered experience
  • Oversees clinical and administrative teams, supports in-clinic procedures, drives process improvements, and ensures compliance with regulatory, safety, and quality standards
  • Partners closely with the IR Clinic Medical Director, IR leadership, APPs, faculty, scheduling teams, and UCSD-PG leadership to optimize performance across the IR care continuum
  • Fulltime
!
Read More
Arrow Right
New

Pharmacy Technician

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States , Christiansburg
Salary
Salary:
16.00 - 24.00 USD / Hour
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
June 03, 2026
Flip Icon
Requirements
Requirements
  • Must be at least 16 years of age
  • Licensure requirements vary by state
  • Attention and Focus: The ability to concentrate on a task over a period of time without being distracted
  • Customer Service Orientation: Actively look for ways to help people, and do so in a friendly manner
  • Notice and understand customers’ reactions, and respond appropriately
  • Communication Skills: Use and understand verbal and written communication to interact with customers and colleagues
  • Actively listening by giving full attention to what others are saying, taking time to understand the points being made, asking questions as appropriate, and not interrupting at inappropriate times
  • Mathematical Reasoning: The ability to use math to solve a problem, such as calculating day’s supply of a prescription
  • Problem Resolution: Is able to judge when something is wrong or is likely to go wrong
  • recognizing there is a problem
Job Responsibility
Job Responsibility
  • Manage all assigned pharmacy workstations and tasks to support the team’s ability to promptly, safely and accurately fill patient prescriptions all while providing caring service that exceeds customer expectations
  • Deliver compassionate care to our millions of patients every day
  • Ensure all medication needs and regulatory compliance standards are met for our patients
  • Demonstrate ethical conduct and maintain patient confidentiality at all times
What we offer
What we offer
  • Affordable medical plan options
  • a 401(k) plan (including matching company contributions)
  • an employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Paid time off
  • flexible work schedules
  • family leave
  • dependent care resources
  • colleague assistance programs
  • tuition assistance
  • Parttime
Read More
Arrow Right