CrawlJobs Logo

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

together.ai Logo

Together AI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

160000.00 - 260000.00 USD / Year

Job Description:

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing. You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Job Responsibility:

  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
  • create reusable Helm/Terraform patterns
  • Deliver 10-50 GB/s per GPU node
  • optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths
  • troubleshoot with profiling tools
  • scale to thousands of nodes
  • Build multi-tier caches (local NVMe, distributed, object)
  • optimize data locality and model-weight distribution
  • implement smart prefetching/eviction
  • Implement monitoring, alerting, SLOs
  • design DR/backups with runbooks
  • run chaos engineering
  • ensure 99.9%+ uptime via proactive/automated remediation
  • Partner with ML/SRE teams
  • mentor on storage best practices
  • contribute to open-source
  • write docs, postmortems, and public learnings

Requirements:

  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
  • Programming: Go and Python for automation, operators, and tooling
  • Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
  • Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
  • Observability: Prometheus, Grafana, Thanos architecture and operations

Nice to have:

  • GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
  • ML/AI storage patterns (model weights, checkpointing, dataset caching)
  • Kubernetes operator development (controller-runtime, kubebuilder)
  • Storage snapshots, cloning, and thin provisioning
  • Backup and disaster recovery (Velero, Restic, cross-region replication)
  • Storage encryption (at-rest and in-transit), security and compliance
  • Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)
What we offer:
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Engineer, Distributed Storage and HPC & AI Infrastructure

New

Staff Engineer, Distributed Storage, HPC & AI Infrastructure

In this role, you will design and deliver multi-petabyte storage systems purpose...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
Job Responsibility
Job Responsibility
  • Design multi-petabyte AI/ML storage systems
  • integrate WekaFS, Ceph, etc.
  • lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing)
  • Design/optimize RDMA, InfiniBand, 400GbE networks
  • tune for max throughput/min latency
  • implement NVMe-oF/iSCSI
  • troubleshoot bottlenecks
  • optimize TCP/IP for storage
  • Build Kubernetes storage operators/controllers
  • enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas
Read More
Arrow Right
New

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right
New

Summer 2026 Network Engineering Internship

This is an 11-week paid learning experience during which you’ll be able to conne...
Location
Location
United States , Bellevue
Salary
Salary:
26.00 - 47.00 USD / Hour
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 18 years of age
  • Legally authorized to work in the United States
  • Must be actively enrolled in a Bachelors or Graduate degree program
  • Employees of T-Mobile or Metro by T-Mobile are ineligible for Internships
  • Interest in Systems Architecture, Cybersecurity
  • Passionate about protecting customers
Job Responsibility
Job Responsibility
  • Gain an understanding of the T-Mobile Consumer Identity Architecture
  • Collaborate with cross-functional teams to understand business requirements as they pertain to Consumer Identity
  • Collaborate with the existing team to design end-to-end Consumer Identity solutions to integrate T-Mobile’s new products and services
  • Create requirements to develop new features in the Ericsson IAM product
  • Assist the team with testing of new Ericsson IAM features in the NQE environment
  • Drive the implementation of new configurations as needed by IAM clients
  • Interpret transaction logs to detect fraudulent activities
What we offer
What we offer
  • Hands-on experience
  • Training
  • Networking with other interns and leaders
  • Mentorship
  • Hands-on projects
  • Chance to create an immediate impact
  • Relocation assistance may be provided to program participants who reside more than 50 miles from the internship location
  • Fulltime
Read More
Arrow Right
New

Maintenance Electrician

Estates Management is seeking to recruit suitably qualified and experienced elec...
Location
Location
United Kingdom , Wolverhampton
Salary
Salary:
29588.00 - 32080.00 GBP / Year
wlv.ac.uk Logo
University of Wolverhampton
Expiration Date
March 02, 2026
Flip Icon
Requirements
Requirements
  • Served a registered apprenticeship or equivalent training in the electrical installation trade
  • Experience ideally in an industrial/commercial environment
  • Must hold an NVQ Level 3 or approved equivalent in Electrical Installation
  • Must hold the City and Guilds 2360 Electrical Installation Theory Part 2 Course or approved equivalent
  • Must have completed a recognised course on BS7671 :18th edition of the I.E.T. Regulations up to and including all current amendments
  • Post apprenticeship /training experience required with evidence of appropriate competency
  • Must hold a current driving licence
Job Responsibility
Job Responsibility
  • Undertake electrical work including fault finding, repairs and alterations to the various building services throughout the University
  • Participate in an "on call" rota
  • Travel throughout the University estate as well as in and around the West Midlands
What we offer
What we offer
  • Market supplement up to £1,800
  • Fulltime
Read More
Arrow Right
New

Telehealth Nurse Practitioner

Visana Health is an innovative virtual women's health clinic offering comprehens...
Location
Location
Salary
Salary:
50.00 - 65.00 USD / Hour
visanahealth.com Logo
Visana Health
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience as a Nurse Practitioner in women’s health (experience and/or certification) licensed with prescriptive authority/independent practice in good standing and without history of discipline or sanctions
  • Minimum of 1-2+ state licenses (additional licenses welcome)
  • Board certification in a related specialty (adult, family, midwife, women’s health, etc.)
  • DEA License
  • Confident learning new technology and expanding clinical knowledge
  • Ability to learn and adapt to a treatment philosophy where we educate and support patients
  • Ability to appear on camera via video conferencing tools to see patients
Job Responsibility
Job Responsibility
  • Direct patient care in virtual synchronous clinic visits, during clinic hours which are M-Sat 7am to 10 pm EST
  • Timely patient follow-up work including, but not limited to, medical leadership case consults, responding to patient messages or making phone calls to patients when required
  • Reviewing test results and ensuring proper patient notification in compliance with practice standards timelines
  • Chart note documentation and completed billing within 24 hours of the patient visit
  • Participation in weekly clinical meetings
  • Participate in group practice style environment, offering guidance and support as your expertise allows
What we offer
What we offer
  • 100% remote telehealth visits that emphasize ample time to address the patients' needs
  • Flexible schedule, with evening and weekend hours desired
  • Weekly clinical meetings to provide medical training and support the collaborative practice environment
  • Fulltime
Read More
Arrow Right
New

SAP Service Delivery Manager

We are looking for a SAP Service Delivery Manager to ensure stable, compliant, a...
Location
Location
Colombia , Medellín
Salary
Salary:
85000.00 - 90000.00 COP / Year
algoteque.com Logo
Algoteque
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in SAP S/4HANA service delivery or operations
  • Strong understanding of S/4HANA architecture, processes, and modules
  • Experience in international, or regulated SAP environments
  • ITIL-based service management knowledge
  • Excellent stakeholder management, communication, and leadership skills
Job Responsibility
Job Responsibility
  • Own end-to-end service delivery for SAP S/4HANA systems
  • Ensure availability, performance, and stability of the S/4HANA landscape
  • Manage SLAs, KPIs, service reviews, and continuous improvement initiatives
  • Act as escalation point for S/4HANA-related incidents, risks, and operational issues
  • Coordinate internal IT teams, AMS providers, and business stakeholders
  • Ensure compliance with security, audit, and governance standards
  • Support transition from project delivery to S/4HANA run and hypercare
  • Drive standardization, automation, and optimization of service processes
  • Fulltime
Read More
Arrow Right
New

Senior SAP CPI / Integration Consultant

We are looking for a Senior SAP S/4HANA CPI / Integration Consultant to implemen...
Location
Location
Colombia , Medellín
Salary
Salary:
85000.00 - 90000.00 COP / Year
algoteque.com Logo
Algoteque
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience with SAP CPI / Integration Suite in S/4HANA environments
  • Knowledge of REST, SOAP, IDocs, and event-based messaging
  • Experience integrating S/4HANA with cloud and on-premise systems
  • Strong analytical, problem-solving, and communication skills
Job Responsibility
Job Responsibility
  • Design, develop, and maintain SAP S/4HANA integrations using SAP CPI / Integration Suite
  • Implement APIs, message mappings, and event-based integrations
  • Ensure secure, scalable, and reliable integration architecture
  • Support integration testing, troubleshooting, and issue resolution
  • Collaborate with SAP and non-SAP teams to align integration solutions
  • Contribute to integration standards, governance, and best practices
  • Fulltime
Read More
Arrow Right
New

Specialist, Brand and Social Marketing

This position plays a key role in helping to plan, execute and track strategic m...
Location
Location
United States , Fort Myers
Salary
Salary:
Not provided
chicos.com Logo
Chico's FAS, Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Diploma required, AA or bachelor’s preferred
  • BA/BS degree in marketing or related field required
  • 1-3 years’ experience within marketing and social media
  • retail experience a plus
  • Highly organized with strong attention to detail
  • Strong written and verbal communications
  • Ability to think strategically and independently
  • Exceptional attention to detail and ability to effectively multi-task in a deadline driven atmosphere
  • Solid interpersonal and communication skills
  • Ability to interact with a diverse team of people, including all levels of leadership, remote teams and agencies
Job Responsibility
Job Responsibility
  • Assist in execution of seasonal marketing requirements, including briefing and tracking of creative assets to support owned channels
  • Partner with internal teams including, but not limited to PR/Influencer, Creative, Ecommerce, Merchandising, and Store Ops to ensure alignment and execution of key marketing strategies
  • Schedules all organic social media posts via Sprinklr platform and collaborates closely with Social Manager and creative team on execution
  • Assists with brand marketing and social media insights to inform go-forward strategies to drive the business
  • Oversee mall marketing initiatives including store openings and closings
  • Supports the goals of the Marketing team by performing ad hoc duties as assigned
  • Monitors Chico’s social pages for customer sentiment and relays to cross functional teams as needed
  • Assists Marketing leadership with department presentations, meeting preparation, channel reporting, and recaps
  • Fulltime
Read More
Arrow Right