CrawlJobs Logo

Senior HPC Systems and Storage Engineer

ucsd.edu Logo

UC San Diego

Location Icon

Location:
United States , San Diego

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

108100.00 - 160000.00 USD / Year
Save Job
Save Icon
Job offer has expired

Job Description:

The Senior HPC Systems and Storage Engineer will apply advanced systems and software integration concepts, and location or institutional objectives, to resolve highly complex issues where analysis of systems and software requires an in-depth evaluation of variable factors to resolve and implement medium to large projects of broad scope and complexity. They will regularly resolve highly complex business processes, system functionality, implementation issues and system and software integration issues where analysis of situations or data requires an in-depth evaluation of variable factors and select tools, methods, techniques and evaluation criteria to obtain results. They will also give technical presentations to associated team, other technical units and management as well as evaluate new technologies including performing moderate to complex cost / benefit analyses. They may lead a team of systems / infrastructure professionals.

Job Responsibility:

  • Designing, deploying, and operating SDSC HPC compute clusters and their associated storage systems
  • Maintaining their performance, reliability, and availability at the national, state, and campus level
  • Contributing to the design, deployment, and operation of high-performance HPC systems and storage environments, including parallel file systems operating at scale across high-speed networks
  • Planning and executing system lifecycles, including deployment, upgrades, and decommissioning of HPC systems and storage services
  • Contribute to technical planning and effort estimation for new deployments, proposals, and recharge-based services
  • Evaluating and recommending improvements to tools and workflows
  • Participates in the selection and integration of new technologies
  • Working with vendors and SDSC staff to benchmark and evaluate storage systems and cluster platforms
  • Maintaining current knowledge of emerging technologies
  • Developing advanced processes and scripts for system analysis, testing, and automation
  • Leading efforts to integrate monitoring and alerting, improving incident detection, response, and user communication
  • Overseeing collaboration with SDSC security teams to implement best practices for system deployment, identity management, and software updates
  • Overseeing development and maintenance of related documentation

Requirements:

  • Bachelor’s degree in related area and / or equivalent experience / training
  • Proven experience administering and supporting large-scale HPC clusters or other distributed POSIX (Linux) systems, including advanced knowledge of Linux system administration, primarily Red Hat and its derivatives (e.g., Rocky Linux)
  • Proven experience designing, deploying, and operating large-scale (petabyte-class) high-performance parallel and distributed file systems (e.g., Lustre, Ceph, BeeGFS, GPFS), as well as enterprise and local file systems (e.g., NFS, ZFS, ext4, XFS) in Linux-based environments, including troubleshooting and performance tuning
  • Demonstrated experience with scripting and automation using languages such as Bash and Python
  • use of configuration management tools (e.g., Ansible, CFEngine)
  • and version control systems (e.g., Git) to manage and maintain system configurations and infrastructure
  • Advanced knowledge of HPC middleware stack including cluster management tools, job schedulers and resources managers. Examples include: Slurm, PBS, HPCM, and Bright Cluster Manager
  • Demonstrated knowledge of TCP/IP networking, including sockets, VLANs, and firewalls

Additional Information:

Job Posted:
May 04, 2026

Expiration:
May 07, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior HPC Systems and Storage Engineer

Senior HPC Deployment Engineer

As a High Performance Computer (HPC) Solution Installation and Deployment Engine...
Location
Location
Australia , Melbourne
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in installing, configuring, and deploying HPC systems
  • strong knowledge of HPC architectures, parallel computing, and cluster management
  • proficiency in Linux/Unix operating systems
  • experience with HPC software tools and libraries (e.g., MPI, OpenMP, SLURM, Torque)
  • familiarity with high-speed networking technologies (e.g., InfiniBand, Ethernet)
  • excellent problem-solving skills and attention to detail
  • strong communication and interpersonal skills
  • ability to work independently and as part of a team
  • certifications in relevant technologies (e.g., Red Hat Certified Engineer, Certified HPC Professional)
  • experience with cloud-based HPC solutions
Job Responsibility
Job Responsibility
  • Install and configure HPC hardware and software components, including servers, storage, and networking equipment
  • set up and manage high-speed interconnects (e.g., InfiniBand, Ethernet)
  • deploy operating systems, cluster management software, and parallel file systems
  • coordinate with clients and project managers to understand deployment requirements and timelines
  • implement and document HPC deployment processes and best practices
  • perform system testing and validation to ensure optimal performance and reliability
  • provide technical support to clients during the installation and deployment phases
  • conduct training sessions for clients on HPC system usage and maintenance
  • develop and maintain user documentation and guides
  • monitor and analyze system performance to identify and resolve bottlenecks
What we offer
What we offer
  • Comprehensive suite of benefits supporting physical, financial, and emotional wellbeing
  • specific programs for personal and professional development
  • inclusion and flexibility to manage work and personal needs
  • Fulltime
Read More
Arrow Right

Senior Solution Architect AI & HPC

AI is a high-growth market for HPE, and we believe we are uniquely suited to bri...
Location
Location
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Engineering, Computer Science, or similar quantitative focus preferred
  • Ability to quickly prototype functionality into scripts for demos, integrations, troubleshooting, etc.
  • Expertise in cloud architectures, specifically with public cloud platforms such as AWS, Azure, or Google Cloud
  • Strong understanding of AI technologies, including machine learning, deep learning, and neural networks
  • Experience participating in solution configurations and the creation of PoCs to meet customer requirements
  • Solid knowledge of infrastructure components, including servers, storage, networking, and virtualization
  • Experience with high-performance computing (HPC) and GPU-accelerated systems is advantageous
  • Demonstrates expert technical skills in assigned area of specialization
  • Expert knowledge of the company offerings, strategic initiatives, current trends, competitor products and strategies within area of responsibility
  • Expert level written and verbal communication skills and mastery over English and local language
Job Responsibility
Job Responsibility
  • Collaborate with sales teams to understand customer requirements and develop tailored solutions for their AI infrastructure needs
  • Engage in pre-sales activities, including technical presentations, demonstrations, and proof-of-concepts
  • Act as a trusted advisor to customers, addressing their questions, concerns, and technical challenges effectively
  • Stay up-to-date with the latest advancements in AI technologies, cloud architectures, and infrastructure trends
  • Lead Proof-of-Concepts (PoC) for HPE customers expanding into Deep Learning or Machine Learning use cases
  • Architect reusable end-to-end AI solutions for HPE customers and prospects
  • Lead technical discussions with customers and partners to propose HPE and partner Integrated solutions
  • Identify solutions, define action plans, and help coordinate and deliver optimal solutions and enhancements
  • Recommend configurations and settings for different types of hardware and interconnect fabrics
  • Assist in any product or technical issue towards an initial sale or renewal of a customer
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

HPC Principal Federal Technical Consultant

In this role, you will serve as a trusted technical advisor for customers, guidi...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • demonstrated ability to independently complete solution implementations and application design deliverables
  • must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • must be willing to travel as the business dictates
  • expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world-class scale HPC solutions, from requirements gathering to implementation
  • provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • provide technical support and troubleshooting for HPC implementation in secure locations
  • work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • ensure compliance with enterprise IT security and technology controls
  • act as principal consultant in customer engagements, often leading cross-functional project teams
What we offer
What we offer
  • comprehensive suite of benefits that supports physical, financial, and emotional wellbeing
  • programs catered to helping employees reach any career goals
  • inclusive work environment.
  • Fulltime
Read More
Arrow Right

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
India
Salary
Salary:
Not provided
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work for the respective hiring region
Read More
Arrow Right

HPC Senior Technical Writer

In this position you will collaborate with knowledge management project leads an...
Location
Location
United States of America , Chippewa Falls
Salary
Salary:
81500.00 - 187500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Technical Communications, Computer Science, or related technical/communications field with 4-6 years related experience
  • Advanced University degree and 2-4 years' experience or equivalent
  • Understands concepts and develops in-depth working knowledge of products, applications, and systems in assigned area of responsibility
  • Ability to deliver on multiple project technical requirements, schedules, and information formats
  • Codes in HTML, DHTML, XML, JavaScript or similar as required
  • Applies developed subject matter knowledge to solve common and complex business issues and recommends appropriate alternatives
  • Works on problems of diverse complexity and scope
  • May act as a team or project leader providing direction to team activities and facilitates information validation and team decision making process
  • Exercises independent judgment to identify and select a solution
  • Knowledge of HPC system software and hardware components, including operating systems, programming languages, system monitoring applications, HPC storage, chassis, servers, compute nodes, blades, HPC storage, coolant systems, power supplies, high speed network switches and cabling, and more
Job Responsibility
Job Responsibility
  • Create technical product documentation for software products and hardware
  • Analyze customer information requirements and product specifications to define scope of work and documentation plan
  • Identify and address the needs of all user groups, including end users, system administrators, internal support engineers, product developers, integration test teams, and training developers
  • Test documentation for install or administrative tasks to improve information deliverables and provide feedback on ease of use and user interfaces to product development
  • Manage workload in Jira and source management tools, including SDL, Oxygen, Git, and Github, to manage changes in the shared work environment
  • Create, revise, and manage content in Oxygen Author (DITA), Markdown, and other content tools
  • Work with developers, testers, product managers, technical support, and training to identify new features and content that needs to be reworked
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

HPC Principal Federal Technical Consultant

Principal Consultant to join our High-Performance Computing (HPC) team. In this ...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • Advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • Proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • Demonstrated ability to independently complete solution implementations and application design deliverables
  • Must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • Must be willing to travel as the business dictates
  • Expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • Experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world class scale HPC solutions, from requirements gathering to implementation
  • Provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • Configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • Work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • Provide technical support and troubleshooting for HPC implementation in secure locations
  • Work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • Evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • Ensure compliance with enterprise IT security and technology controls
  • Act as principal consultant in customer engagements, often leading cross-functional project teams (including customer staff)
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right