This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Senior HPC Systems and Storage Engineer will apply advanced systems and software integration concepts, and location or institutional objectives, to resolve highly complex issues where analysis of systems and software requires an in-depth evaluation of variable factors to resolve and implement medium to large projects of broad scope and complexity. They will regularly resolve highly complex business processes, system functionality, implementation issues and system and software integration issues where analysis of situations or data requires an in-depth evaluation of variable factors and select tools, methods, techniques and evaluation criteria to obtain results. They will also give technical presentations to associated team, other technical units and management as well as evaluate new technologies including performing moderate to complex cost / benefit analyses. They may lead a team of systems / infrastructure professionals.
Job Responsibility:
Designing, deploying, and operating SDSC HPC compute clusters and their associated storage systems
Maintaining their performance, reliability, and availability at the national, state, and campus level
Contributing to the design, deployment, and operation of high-performance HPC systems and storage environments, including parallel file systems operating at scale across high-speed networks
Planning and executing system lifecycles, including deployment, upgrades, and decommissioning of HPC systems and storage services
Contribute to technical planning and effort estimation for new deployments, proposals, and recharge-based services
Evaluating and recommending improvements to tools and workflows
Participates in the selection and integration of new technologies
Working with vendors and SDSC staff to benchmark and evaluate storage systems and cluster platforms
Maintaining current knowledge of emerging technologies
Developing advanced processes and scripts for system analysis, testing, and automation
Leading efforts to integrate monitoring and alerting, improving incident detection, response, and user communication
Overseeing collaboration with SDSC security teams to implement best practices for system deployment, identity management, and software updates
Overseeing development and maintenance of related documentation
Requirements:
Bachelor’s degree in related area and / or equivalent experience / training
Proven experience administering and supporting large-scale HPC clusters or other distributed POSIX (Linux) systems, including advanced knowledge of Linux system administration, primarily Red Hat and its derivatives (e.g., Rocky Linux)
Proven experience designing, deploying, and operating large-scale (petabyte-class) high-performance parallel and distributed file systems (e.g., Lustre, Ceph, BeeGFS, GPFS), as well as enterprise and local file systems (e.g., NFS, ZFS, ext4, XFS) in Linux-based environments, including troubleshooting and performance tuning
Demonstrated experience with scripting and automation using languages such as Bash and Python
use of configuration management tools (e.g., Ansible, CFEngine)
and version control systems (e.g., Git) to manage and maintain system configurations and infrastructure
Advanced knowledge of HPC middleware stack including cluster management tools, job schedulers and resources managers. Examples include: Slurm, PBS, HPCM, and Bright Cluster Manager
Demonstrated knowledge of TCP/IP networking, including sockets, VLANs, and firewalls