This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a seasoned Storage Engineering Manager with experience in the specification, evaluation, deployment, and management of HPC storage solutions across multiple datacenters to build out a world-class team. You will hire and guide a team of storage engineers in building storage infrastructure that serves our AI/ML infrastructure products, ensuring the seamless deployment and operational excellence of both the physical and logical storage infrastructure (including proprietary and open source solutions). Your role is not just to manage people, but to serve as the ultimate technical and operational authority for our high-performance, petabyte-scale storage solutions.Your leadership will be pivotal in ensuring our systems are not just high-performing, but also reliable, scalable, and manageable as we grow toward exascale. This is a unique opportunity to work at the intersection of large-scale distributed systems and the rapidly evolving field of artificial intelligence infrastructure. This is an opportunity to have a significant impact on the future of AI. You will be building the foundational infrastructure that powers some of the most advanced AI research and products in the world.
Job Responsibility:
Grow/Hire, lead, and mentor a top-talent team of high-performing storage engineers delivering HPC, petabyte-scale storage solutions
Foster a high-velocity culture of innovation, technical excellence, and collaboration
Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members
Drive outcomes by managing project priorities, deadlines, and deliverables using Agile methodologies
Drive the technical vision and strategy for Lambda distributed storage solutions
Lead storage vendor selection criteria, vendor selection, and vendor relationship management (support, installation, scheduling, specification, procurement)
Manage team in storage lifecycle management (installation, cabling, capacity upgrades, service, RMA, updating both hardware and software components as needed)
Guide choices around optimization of storage pools, sharding, and tiering/caching strategies
Lead team in tasks related to multi-tenant security, tenant provisioning, metering integration, storage protocol interconnection, and customer data-migration
Guide Storage SREs in development of scripting and automation tools for configuration management, monitoring, and operational tasks
Guide team in problem identification, requirements gathering, solution ideation, and stakeholder alignment on engineering RFCs
Lead the team in supporting customers
Collaborate with the HPC Architecture team on drive selection, capacity determination, storage networking, cache placement, and rack layouts
Work closely with the storage software teams and networking teams to execute on cross-functional infrastructure initiatives and new data-center deployments including integration of storage protocols across a variety of on-prem storage solutions
Work with procurement data-center operations, and fleet engineering teams to deploy storage solutions into new and existing data centers
Work with vendors to troubleshoot customer performance, reliability, and data-integrity issues
Work closely with Networking, Compute, and Storage Software Engineering teams to deploy high-performance distributed storage solutions to serve AI/ML workloads
Partner with the fleet engineering team to ensure seamless deployment, monitoring, and maintenance of the distributed storage solutions
Stay current with the latest trends and research into AI and HPC storage technologies and vendor solutions
Guide team in investigating strategies for using Nvidia SuperNIC DPUs for storage edge-caching, offloading, and GPUDirect Storage capabilities
Work with the Lambda product team to uncover new trends in the AI inference and training product category that will inform emerging storage solutions
Encourage and support the team in exploring new technologies and approaches to improve system performance and efficiency
Requirements:
10+ years of experience in storage engineering with at least 5+ years in a management or lead role
Demonstrated experience leading a team of storage engineers and storage SREs on complex, cross-functional projects in a fast-paced startup environment
Extensive hands-on experience in designing, deploying, and maintaining distributed storage solutions in a CSP (Cloud Service Provider), NCP (Neo-Cloud provider), HPC-infrastructure integrator, or AI-infrastructure company
Experience with storage solutions serving storage volumes at a scale greater than 20PB
Strong project management skills, leading high-confidence planning, project execution, and delivery of team outcomes on schedule
Extensive experience with storage site reliability engineering
Experience with one or more of the following in an HPC or AI Infrastructure environment: Vast, DDN, Pure Storage, NetApp, Weka
Experience deploying CEPH at scale greater than 25PB
Experience in serving one or more of the following storage protocols: object storage (e.g., S3), block storage (e.g., iSCSI), or file storage (e.g., NFS, SMB, Lustre)
Professional individual contributor experience as a storage engineer or storage SRE
Familiarity with modern storage technologies (e.g., NVMe, RDMA, DPUs) and their role in optimizing performance
Experience building a high-performance team through deliberate hiring, upskilling, planned skills redundancy, performance-management, and expectation setting