CrawlJobs Logo

Senior HPC Deployment Engineer

https://www.hpe.com/ Logo

Hewlett Packard Enterprise

Location Icon

Location:
Australia , Melbourne

Category Icon
Category:
-

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

As a High Performance Computer (HPC) Solution Installation and Deployment Engineer, you will be responsible for the installation, configuration, and deployment of HPC systems. You will work closely with clients, project managers, and other technical staff to ensure that HPC solutions meet performance, reliability, and scalability requirements. This role demands a strong understanding of HPC architectures, networks, and software, along with excellent problem-solving skills.

Job Responsibility:

  • Install and configure HPC hardware and software components, including servers, storage, and networking equipment
  • set up and manage high-speed interconnects (e.g., InfiniBand, Ethernet)
  • deploy operating systems, cluster management software, and parallel file systems
  • coordinate with clients and project managers to understand deployment requirements and timelines
  • implement and document HPC deployment processes and best practices
  • perform system testing and validation to ensure optimal performance and reliability
  • provide technical support to clients during the installation and deployment phases
  • conduct training sessions for clients on HPC system usage and maintenance
  • develop and maintain user documentation and guides
  • monitor and analyze system performance to identify and resolve bottlenecks
  • optimize HPC configurations for specific applications and workloads
  • implement performance tuning techniques for hardware and software
  • work closely with hardware and software vendors to troubleshoot and resolve issues
  • collaborate with internal teams to integrate HPC solutions with existing infrastructure
  • communicate effectively with stakeholders to provide updates on project status and technical issues
  • stay updated on the latest HPC technologies and trends
  • recommend improvements to enhance system performance, reliability, and scalability
  • participate in the evaluation and testing of new HPC products and solutions

Requirements:

  • Proven experience in installing, configuring, and deploying HPC systems
  • strong knowledge of HPC architectures, parallel computing, and cluster management
  • proficiency in Linux/Unix operating systems
  • experience with HPC software tools and libraries (e.g., MPI, OpenMP, SLURM, Torque)
  • familiarity with high-speed networking technologies (e.g., InfiniBand, Ethernet)
  • excellent problem-solving skills and attention to detail
  • strong communication and interpersonal skills
  • ability to work independently and as part of a team
  • certifications in relevant technologies (e.g., Red Hat Certified Engineer, Certified HPC Professional)
  • experience with cloud-based HPC solutions
  • knowledge of scripting languages (e.g., Python, Bash)

Nice to have:

  • Certifications in relevant technologies (e.g., Red Hat Certified Engineer, Certified HPC Professional)
  • experience with cloud-based HPC solutions
  • knowledge of scripting languages (e.g., Python, Bash)
What we offer:
  • Comprehensive suite of benefits supporting physical, financial, and emotional wellbeing
  • specific programs for personal and professional development
  • inclusion and flexibility to manage work and personal needs

Additional Information:

Job Posted:
September 10, 2025

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior HPC Deployment Engineer

Senior Research Engineer

The HPE HPC & AI EMEA Research Lab (ERL) is characterized by a unique blend of i...
Location
Location
Germany , Munich, Berlin
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Parallel programming experience, with programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages, etc.
  • An understanding of AI/ML frameworks, experience with frameworks such as TensorFlow or PyTorch is highly desirable
  • An interest in system- and data center monitoring and operational data analysis
  • Professional language skills in English and German
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Work with the most esteemed research partners across Europe
  • Enable high performance research software on pre-Exascale and Exascale supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run applications taking advantage of leading-edge hardware at scale
  • Make and operate HPC/AI systems and datacenters in a sustainable way
  • Manage modern data-intensive workloads in high performance environments
What we offer
What we offer
  • Competitive salary and extensive benefits package (pension scheme, insurances, bike and car leasing, and other fringe benefits)
  • Work-life balance (flexible working time and hybrid workplace model, 30 vacation days, four HPE Wellness-Fridays, up to six months paid parental leave)
  • Support for education, training, and career development
  • Diverse and dynamic work environment
Read More
Arrow Right

Senior Linux System Administrator - Support Engineer

Senior Linux System Administrator/System Support Engineer with expertise support...
Location
Location
Australia , Canberra
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent work experience
  • At least 5 years of hands-on experience managing Linux systems in production environments, including HPC systems
  • Expertise in Linux/Unix operating systems, parallel file systems (Lustre, GPFS), and networking technologies
  • Proficiency in scripting/programming languages (Bash, Python, Perl, C++)
  • Experience with automation/configuration management tools (Ansible, Puppet, Chef, Terraform)
  • Strong understanding of networking concepts (TCP/IP, DNS, DHCP, firewalls, VPNs)
  • Familiarity with monitoring/logging tools (Nagios, Grafana, ELK Stack)
  • Experience with containerization technologies (Docker, Kubernetes)
  • Excellent problem-solving, analytical, and communication skills
  • Demonstrated ability to work independently in multi-technology environments and collaborate across teams
Job Responsibility
Job Responsibility
  • Deploy, configure, maintain, and troubleshoot Linux servers (Red Hat, CentOS, Ubuntu, or others) across physical, virtual, and cloud environments
  • Support, maintain, and optimize HPC systems, including installation, servicing, and advanced technical troubleshooting of hardware/software and parallel file systems
  • Monitor system performance, availability, and security using industry-standard tools and practices
  • Plan and execute upgrades, patches, enhancements, and migrations to ensure systems are current, secure, and optimized
  • Automate system administration tasks using scripting languages and configuration management tools
  • Implement and maintain backup/recovery strategies, disaster recovery plans, and system documentation
  • Collaborate with development, network, and security teams to support application deployments and troubleshoot issues
  • Provide technical consulting, mentoring, and guidance to junior team members
  • Ensure compliance with strict security protocols in sensitive environments
  • Participate in on-call rotation and respond to system incidents and outages
What we offer
What we offer
  • Competitive salary and performance-based bonuses
  • Comprehensive health, dental, and vision insurance
  • Retirement plan options
  • Paid time off and holidays
  • Professional development opportunities
  • Flexible work arrangements
  • Fulltime
Read More
Arrow Right

HPC Principal Federal Technical Consultant

Principal Consultant to join our High-Performance Computing (HPC) team. In this ...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • Advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • Proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • Demonstrated ability to independently complete solution implementations and application design deliverables
  • Must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • Must be willing to travel as the business dictates
  • Expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • Experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world class scale HPC solutions, from requirements gathering to implementation
  • Provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • Configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • Work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • Provide technical support and troubleshooting for HPC implementation in secure locations
  • Work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • Evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • Ensure compliance with enterprise IT security and technology controls
  • Act as principal consultant in customer engagements, often leading cross-functional project teams (including customer staff)
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Senior Systems Engineer HPC

Location
Location
India , Gurgaon
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related field (equivalent experience may substitute for degree)
  • Minimum of 10 years of systems experience, including at least 5 years working specifically with HPC
  • Strong knowledge of Linux operating systems (e.g., Rocky Linux, Ubuntu) with a fundamental understanding of Linux internals, system administration, and performance tuning
  • Experience building and managing RPM and DEB packages
  • Experience with cluster management tools such as Bright Cluster Manager, OpenHPC stack, or Warewulf
  • Proficiency with job schedulers and resource managers such as Slurm and LSF
  • Strong understanding of Linux networking (e.g., TCP/IP, DNS, routing) and HPC interconnects (e.g., InfiniBand, Ethernet) including performance tuning
  • Knowledge of parallel file systems such as Lustre, Ceph, or GPFS
  • Working knowledge of Linux authentication and directory services such as LDAP and Active Directory
  • Strong experience with DevOps and configuration management tools, including Ansible, Terraform, Jenkins, and Git
Job Responsibility
Job Responsibility
  • System Administration & Maintenance: Install, configure, and maintain HPC clusters (hardware, software, operating systems), perform regular updates/patching, manage user accounts and permissions, and troubleshoot/resolve hardware or software issues
  • Performance & Optimization: Monitor and analyse system and application performance, identify bottlenecks, implement tuning solutions, and profile workloads to improve efficiency
  • Cluster & Resource Management: Manage and optimize job scheduling, resource allocation, and cluster operations using tools such as Slurm, LSF, Bright Cluster Manager / Base Command Manager, OpenHPC, and Warewulf
  • Networking & Interconnects: Configure, manage, and tune Linux networking (TCP/IP, DNS, routing) and high-speed HPC interconnects (InfiniBand, Ethernet) to ensure low-latency, high-bandwidth communication
  • Storage & Data Management: Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS), ensure data integrity, manage backups, and support disaster recovery
  • Security & Authentication: Implement security controls, ensure compliance with policies, and manage authentication and directory services such as LDAP and Active Directory
  • DevOps & Automation: Use configuration management and DevOps practices (Ansible, Terraform, Jenkins, Git) to automate deployments, application packaging (RPM/DEB), and system configurations
  • User Support & Collaboration: Provide technical support, documentation, and training to researchers
  • collaborate with scientists, HPC architects, and engineers to align infrastructure with research needs
  • Planning & Innovation: Contribute to the design and planning of HPC infrastructure upgrades, evaluate and recommend hardware/software solutions, and explore cloud-based HPC solutions where applicable
  • Fulltime
Read More
Arrow Right

HPC Principal Federal Technical Consultant

In this role, you will serve as a trusted technical advisor for customers, guidi...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • demonstrated ability to independently complete solution implementations and application design deliverables
  • must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • must be willing to travel as the business dictates
  • expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world-class scale HPC solutions, from requirements gathering to implementation
  • provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • provide technical support and troubleshooting for HPC implementation in secure locations
  • work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • ensure compliance with enterprise IT security and technology controls
  • act as principal consultant in customer engagements, often leading cross-functional project teams
What we offer
What we offer
  • comprehensive suite of benefits that supports physical, financial, and emotional wellbeing
  • programs catered to helping employees reach any career goals
  • inclusive work environment.
  • Fulltime
Read More
Arrow Right

Senior Network Engineer, Deployment

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’r...
Location
Location
United States , San Francisco, Sunnyvale
Salary
Salary:
162000.00 - 196000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of related experience building and operating at scale in a production environment
  • In-depth knowledge of network protocols including TCP/IP, QoS, BGP, OSPF/IS-IS, EVPN, VXLAN, QoSand MPLS-related technologies like RSVP-TE, LDP, etc.
  • Good understanding of network monitoring protocols and tools, such as SNMP, IPFIX, Sflow/netflow, and Telemetry
  • Familiar with data center network architecture, such as Fat Tree architecture, CLOS, BGP-TE, and peering for edge
  • Hands-on experience with major network devices like Mellanox, Cisco, Arista, Juniper, and other mainstream vendors
  • Familiar with mainstream commercial switch/router chipsets, such as Broadcom, Barefoot, etc.
  • In-depth knowledge of public cloud architecture connectivity options to AWS, GCP, Azure, Ali Cloud, OCI, etc.
  • Good understanding of IPv6 and IPv4-IPv6 coexistence technologies
  • Self-motivated, with good communication and writing skills
  • Team player and participate in Crusoe Energy Cloud network global on-call rotation
Job Responsibility
Job Responsibility
  • Deploy, build, and optimize Crusoe Energy Cloud's global network, including edge, backbone, data center, and public cloud connectivity
  • Work with cross-functional teams, including but not limited to Software Infrastructure and Product, to drive the innovation and evolution of the Crusoe Energy Cloud network
  • Work with external vendors and ISPs to test and verify device and carrier selection
  • Will be part of a 24/7 Oncall Support for the Crusoe Network
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior MLOps Engineer

If you’re passionate about scalability, automated deployment, and well-optimized...
Location
Location
Romania , Bucharest
Salary
Salary:
Not provided
it-genetics.com Logo
IT Genetics Romania
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • University degree, preferably in engineering (software, industrial, mechanical, process) or a related field
  • Over 5 years of experience in MLOps or machine learning engineering, with a focus on deploying and managing deep learning models at scale
  • Strong skills in Python, CI/CD pipelines, and ML frameworks (e.g., PyTorch, TensorFlow, OpenCV) for automating and scaling ML workflows
  • Expertise in monitoring and alert automation for ML workflows, including data pipelines, training processes, and model performance (e.g., Prometheus, Grafana)
  • Familiarity with distributed training techniques, multi-GPU strategies, and hardware optimization for deep learning
  • Strong communication and interpersonal skills
Job Responsibility
Job Responsibility
  • Design end-to-end architecture for the automated training of ML models
  • Create data pipelines to build relevant datasets and data annotation flows
  • Monitor ML model performance and data drift
  • Handle versioning, deployment, and integration with the software team
  • Develop and manage CI/CD pipelines for building, testing, and deploying models
  • Apply best practices for model versioning, rollback, and A/B testing to ensure reliable and accurate production releases
  • Set up a robust monitoring system and develop automated alerting solutions to proactively identify issues in data pipelines, model training, validation, and data variation
  • Promote MLOps best practices (Infrastructure as Code, reproducibility, security) and continuously improve internal processes to increase reliability and efficiency
  • Research and implement cutting-edge technologies to improve training efficiency (e.g., distributed training, HPC, multi-GPU strategies) for the research team
  • Explore future MLOps frameworks and GPU-based cloud solutions as part of the scalability roadmap
What we offer
What we offer
  • Meal tickets
  • A place where your voice truly matters
  • Performance bonuses
  • A day off on your birthday
  • Private medical subscription
  • Trainings and learning resources
  • Hybrid work model
  • Bookster subscription
  • A friendly, passionate, and solution-oriented team
  • Opportunities to grow or change your role within the company
Read More
Arrow Right

Senior ML Systems Engineer, Frameworks & Tooling

We’re looking for a senior engineer to help build, maintain and evolve the train...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong engineering experience in large-scale distributed training or HPC systems
  • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
  • Experience working with containerized environments (Docker, Singularity/Apptainer)
  • A track record of building tools that increase developer velocity for ML teams
  • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
  • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams
Job Responsibility
Job Responsibility
  • Build and own the training framework responsible for large-scale LLM training
  • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing)
  • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100)
  • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics
  • Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training
  • Investigate and resolve performance bottlenecks across the ML systems stack
  • Build robust systems that ensure reproducible, debuggable, large-scale runs
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right