CrawlJobs Logo

Senior HPC Deployment Engineer

Australia, Melbourne · Job Posted September 10, 2025
Apply Position
Job Link Share

Job Description

As a High Performance Computer (HPC) Solution Installation and Deployment Engineer, you will be responsible for the installation, configuration, and deployment of HPC systems. You will work closely with clients, project managers, and other technical staff to ensure that HPC solutions meet performance, reliability, and scalability requirements. This role demands a strong understanding of HPC architectures, networks, and software, along with excellent problem-solving skills.

Job Responsibility

  • Install and configure HPC hardware and software components, including servers, storage, and networking equipment
  • set up and manage high-speed interconnects (e.g., InfiniBand, Ethernet)
  • deploy operating systems, cluster management software, and parallel file systems
  • coordinate with clients and project managers to understand deployment requirements and timelines
  • implement and document HPC deployment processes and best practices
  • perform system testing and validation to ensure optimal performance and reliability
  • provide technical support to clients during the installation and deployment phases
  • conduct training sessions for clients on HPC system usage and maintenance
  • develop and maintain user documentation and guides
  • monitor and analyze system performance to identify and resolve bottlenecks
  • optimize HPC configurations for specific applications and workloads
  • implement performance tuning techniques for hardware and software
  • work closely with hardware and software vendors to troubleshoot and resolve issues
  • collaborate with internal teams to integrate HPC solutions with existing infrastructure
  • communicate effectively with stakeholders to provide updates on project status and technical issues
  • stay updated on the latest HPC technologies and trends
  • recommend improvements to enhance system performance, reliability, and scalability
  • participate in the evaluation and testing of new HPC products and solutions

Requirements

  • Proven experience in installing, configuring, and deploying HPC systems
  • strong knowledge of HPC architectures, parallel computing, and cluster management
  • proficiency in Linux/Unix operating systems
  • experience with HPC software tools and libraries (e.g., MPI, OpenMP, SLURM, Torque)
  • familiarity with high-speed networking technologies (e.g., InfiniBand, Ethernet)
  • excellent problem-solving skills and attention to detail
  • strong communication and interpersonal skills
  • ability to work independently and as part of a team
  • certifications in relevant technologies (e.g., Red Hat Certified Engineer, Certified HPC Professional)
  • experience with cloud-based HPC solutions
  • knowledge of scripting languages (e.g., Python, Bash)

Nice to have

  • Certifications in relevant technologies (e.g., Red Hat Certified Engineer, Certified HPC Professional)
  • experience with cloud-based HPC solutions
  • knowledge of scripting languages (e.g., Python, Bash)

What we offer

  • Comprehensive suite of benefits supporting physical, financial, and emotional wellbeing
  • specific programs for personal and professional development
  • inclusion and flexibility to manage work and personal needs

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior HPC Deployment Engineer

8 matching positions

Senior Network Engineer, Deployment

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’r...
Location
Location
United States , San Francisco, Sunnyvale
Salary
Salary:
162000.00 - 196000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of related experience building and operating at scale in a production environment
  • In-depth knowledge of network protocols including TCP/IP, QoS, BGP, OSPF/IS-IS, EVPN, VXLAN, QoSand MPLS-related technologies like RSVP-TE, LDP, etc.
  • Good understanding of network monitoring protocols and tools, such as SNMP, IPFIX, Sflow/netflow, and Telemetry
  • Familiar with data center network architecture, such as Fat Tree architecture, CLOS, BGP-TE, and peering for edge
  • Hands-on experience with major network devices like Mellanox, Cisco, Arista, Juniper, and other mainstream vendors
  • Familiar with mainstream commercial switch/router chipsets, such as Broadcom, Barefoot, etc.
  • In-depth knowledge of public cloud architecture connectivity options to AWS, GCP, Azure, Ali Cloud, OCI, etc.
  • Good understanding of IPv6 and IPv4-IPv6 coexistence technologies
  • Self-motivated, with good communication and writing skills
  • Team player and participate in Crusoe Energy Cloud network global on-call rotation
Job Responsibility
Job Responsibility
  • Deploy, build, and optimize Crusoe Energy Cloud's global network, including edge, backbone, data center, and public cloud connectivity
  • Work with cross-functional teams, including but not limited to Software Infrastructure and Product, to drive the innovation and evolution of the Crusoe Energy Cloud network
  • Work with external vendors and ISPs to test and verify device and carrier selection
  • Will be part of a 24/7 Oncall Support for the Crusoe Network
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Cybersecurity Engineer

Senior Cybersecurity Engineer LOCATION: Eglin AFB, FL JOB STATUS: Full-time C...
Location
Location
United States , Eglin Air Force Base
Salary
Salary:
Not provided
astrion.us Logo
Astrion
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master’s Degree (in Computer Science, Cybersecurity or a related field). Relevant experience may be substituted for the degree
  • 10 Years’ total experience, at least 8 of which is in cybersecurity engineering, architecture or R&D infrastructure
  • Top Secret Clearance with SCI. Eligible for Special Access Program (SAP) access. US Citizenship is required
  • DoD 8570/8140 IAT Level III (CISSP, CISM, or equivalent). Certifications: Security+, CEH, or other relevant security certifications
  • Expert-level knowledge of cybersecurity principles, risk management, and secure computing architectures
  • Hands-on experience with security tools and technologies, such as SIEM, intrusion detection/prevention systems, vulnerability scanners, and endpoint protection solutions. Experience with Host-Based Security System (HBSS), Assured Compliance Assessment Solution (ACAS), Nessus, Tenable.sc, Tenable.io, NNM, LCE, Nessus Manager, Agents, and Scanner
  • Experience with scripting (Python, PowerShell) and automation tools (Ansible, Chef)
  • Familiarity with Risk Management Framework (RMF), Authority to Operate (ATO) documentation, and enclave compliance management
  • Physically able to lift up to 50 lbs
  • adaptable to fieldwork and hands-on installations
Job Responsibility
Job Responsibility
  • Collaborate with network engineers to architect secure network topologies for current and future connected and isolated environments, ensuring security is embedded in the design phase
  • Design and deploy security solutions for S&T environments that support continuous research, development, and DevSecOps, working closely with network engineers to implement and maintain these solutions
  • Advise on security planning for long-term initiatives, including SDREN integration and the Weapons Technology Integration Center (WTIC) and other facility projects, in conjunction with network planning efforts
  • Develop security innovation roadmaps aligned with mission goals and emerging technologies, coordinating with network engineers to ensure alignment with network modernization efforts
  • Coordinate with facilities, engineering, and network teams to ensure robust infrastructure supports secure research operations, focusing on the security aspects of network hardware/power/cooling needs and structured cabling
  • Lead security aspects of containerization, virtualization, and orchestration of systems to support laboratory computing, HPC, and edge devices, working with network engineers to implement secure configurations
  • Engineer multiple S&T networks security architecture in compliance with NIST 800-series, DoD RMF, DISA Security Technical Implementation Guides (STIGs), and cybersecurity best practices, collaborating with network engineers to ensure seamless integration. Review engineering, architecture, and designs to ensure DoD security policies are met
  • Implement DevSecOps pipelines to automate security scans and CI/CD deployments, working with network engineers to integrate security into existing pipelines
  • Manage ATO package development and collaborate with ISSMs, network engineers, and cybersecurity stakeholders to ensure compliance. Review and develop RMF Assessment and Authorization (A&A) documentation, e.g. System Security Plans (SSPs), Security Assessment Reports (SARs), and Plans of Action and Milestones (POA&Ms)
  • Integrate identity management and single sign-on solutions across enclaves and hybrid environments, coordinating with network engineers to implement and maintain these solutions. Analyze and tune HBSS policies for assets during integration test events. Perform verification and troubleshooting across all HBSS modules. Install updates to HBSS software as released and in compliance with STIG requirements. Monitor HBSS software to ensure that the clients/servers are operational and reporting properly
What we offer
What we offer
  • Competitive salaries
  • Continuing education assistance
  • Professional development
  • Multiple healthcare benefits package options
  • 401K with employer matching
  • Competitive time off policy along with a federally recognized holiday schedule
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

We are looking for a dynamic, energetic Sr. Software Systems Design Engineer to ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Very strong data structure and algorithmic skills
  • Experience in software development using C/C++ and debugging skills on multicore systems
  • Experience in identifying performance bottlenecks, and designing/implementing optimizations to relieve analyzed bottlenecks
  • Experience in x86 (or other architecture based) optimizations
  • Understanding of Cache sub-system, Instruction Set Architecture, pipeline (for any CPU)
  • Experience in performance analysis for data center, HPC (High Performance Computing), MPI (Message passing Interface) applications
  • Bachelors or Master's degree in Computer Science Engineering or related field.
Job Responsibility
Job Responsibility
  • Problem solving across multiple software layers (user space, kernel, applications, libraries) and hardware
  • Optimization/development of the CPU performance stack (applications, libraries) for AMD server processors
  • Analyze and solve performance, scalability bottlenecks when code is running on multi-core, multi-node deployments
  • Innovate and publish papers, patents and participate in technical conferences to advance AMD technologies
  • Continuously learn and grow along with evolving X86 server CPU architecture and application landscape
  • Lead collaborative approaches with multiple teams
  • Mentor others to achieve integrated projects.
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Performance Tooling

The Artificial Intelligence (AI) Frameworks team at Microsoft develops AI softwa...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. This includes passing the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C++, or Python OR equivalent experience
  • 4+ years’ practical experience working on high performance applications and performance debugging and optimization on CPUs/GPUs
  • Experience in DNN/LLM inference and experience in one or more DL frameworks such as PyTorch, Tensorflow, or ONNX Runtime and familiarity with CUDA, ROCm, Triton
  • Technical background and solid foundation in software engineering principles, computer architecture, GPU architecture, hardware neural net acceleration
  • Experience in end-to-end performance analysis and optimization of state of the art LLMs and HPC applications, including proficiency using GPU profiling tools
  • Cross-team collaboration skills and the desire to collaborate in a team of researchers and developers
  • Ability to independently lead projects
Job Responsibility
Job Responsibility
  • Work across multiple layers of the AI software stack (abstractions, programming models, compilers, runtimes, libraries, and APIs) to enable large-scale model training and inference
  • Benchmark OpenAI and other LLMs for performance on GPUs and Microsoft hardware
  • Debug, profile, and optimize performance for training/inference workloads on Central Processing Units (CPUs)/Graphics Processing Units (GPUs)
  • Monitor performance regressions and drive continuous improvements to reduce time-to-deploy and hardware footprint
  • Collaborate across teams of researchers and engineers to deliver scalable, production-ready AI performance improvements
  • Fulltime
Read More
Arrow Right

Senior Power Engineer

Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Electrical Engineering, Computer Engineering, or related field AND 3+ years technical engineering experience OR Bachelor's Degree in Electrical Engineering, Computer Engineering, or related field AND 5+ years technical engineering experience OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Define and design rack level power systems to enable integration of a variety of IT gears into Microsoft data centers
  • Work with cross-functional teams to drive product qualification full test coverage to meet product requirements and ensure product deployment
  • Lead rack level power system testing and identify integration issues between PSU, power shelf, firmware and system load
  • Contribute to rack level power system solution roadmap with cross-functional teams and vendor partners to ensure long term scalability of Microsoft power infrastructure
  • Collaborate with cross-functional teams on power delivery for high-performance compute (HPC), AI workloads, and ODM platforms
  • Define and enforce power system design standards, safety protocols, and compliance with global regulatory requirements
  • Engage with external partners, suppliers, and academic collaborators to advance power system innovation
Read More
Arrow Right

Senior Software Engineer - Copilot Security

Copilot Security is at the core of Microsoft’s mission to deliver trusted, human...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • 3+ years in technical engineering roles building large-scale services.
  • Hands-on experience designing and operating security-critical or AI-powered systems at scale, including agentic AI, secure orchestration, or advanced threat defenses.
  • Proven ability to design, build, and ship agentic AI features or frameworks.
  • Ability to clearly explain complex systems and security concepts to technical and non-technical stakeholders and influence cross-org roadmaps.
  • Agentic AI Development & Orchestration: Experience building production agent systems using frameworks such as LangGraph, Amazon Strands SDK, or similar platforms
  • familiarity with agentic design patterns including tool calling, multi-agent coordination, and secure delegation patterns.
  • Hands-on experience with distributed training frameworks (Ray, Slurm, HPC), containerization and orchestration technologies (Docker, Kubernetes) for ML model deployment, and ML lifecycle management in production environments.
  • Experience designing evaluation frameworks for LLM-based applications and implementing observability for agent systems using tools such as Phoenix, MLFlow, LangFuse, or custom eval harnesses
  • understanding of AI safety evaluation methodologies including adversarial testing and red-teaming.
Job Responsibility
Job Responsibility
  • Develop and ship agentic AI-powered security features that protect users from threats such as prompt injection, adversarial manipulation, and abuse of agentic workflows.
  • Implement secure orchestration frameworks that enable Copilot to safely delegate, coordinate, and execute actions across devices, services, and platforms.
  • Invent and apply new intelligent agents that leverage information flow analysis and apply common sense and judgement guardrails for security and privacy.
  • Collaborate with product, engineering, security, privacy, and AI teams to adopt agentic security patterns and best practices across Copilot and MAI.
  • Monitor key metrics for agentic AI security and innovation, using data-driven insights to improve defenses and enablement.
  • Document secure agentic AI patterns, ensuring they address novel risks, support safe delegation, and enable responsible orchestration of actions.
  • Fulltime
Read More
Arrow Right

Senior DevOps Engineer (AI & Cloud Infrastructure)

We are seeking a Senior DevOps Engineer to design, deploy, and operate the next ...
Location
Location
United States , Palo Alto
Salary
Salary:
175000.00 - 250000.00 USD / Year
inflection.ai Logo
Inflection AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience in DevOps, Site Reliability Engineering, or ML Infrastructure supporting high-scale, production systems
  • Deep expertise in Azure and AWS, including storage, compute, networking, databases, and cloud-native monitoring services
  • Strong Kubernetes administration experience, including GPU scheduling, operator deployment, and management of core infrastructure components
  • experience with Slurm is highly desirable
  • Proven experience deploying, scaling, and operating Large Language Models (LLMs) and inference engines such as vLLM, TGI, or Triton
  • Strong experience with modern DevOps tooling: Terraform, Helm, Kustomize, ArgoCD, GitHub Actions or GitLab CI, Prometheus, Grafana, and Clickhouse
  • Advanced scripting and automation skills in Python and Bash, with the ability to debug complex distributed systems and optimize performance at scale
  • Demonstrated ability to troubleshoot LLM servers, Kubernetes workloads, GPU utilization, and cloud infrastructure bottlenecks
  • Have a bachelor’s degree or equivalent in a related field to the offered position requirements.
Job Responsibility
Job Responsibility
  • Architect, deploy, and operate large-scale LLM inference servers and AI applications with a focus on low latency, high availability, and production reliability
  • Design, provision, and maintain complex cloud architectures across Azure and AWS, including storage, compute, networking, databases, and native LLM services
  • Manage GPU-enabled Kubernetes clusters and Slurm-based HPC environments, optimizing resource allocation for AI training and inference workloads
  • Deploy and operate core Kubernetes infrastructure components and operators (GPU operators, ingress controllers, service meshes, CNIs, CSIs, and storage drivers)
  • Build scalable infrastructure-as-code and deployment workflows using Terraform, Helm, Kustomize, ArgoCD, and GitOps best practices
  • Design and maintain centralized observability systems using Prometheus, Grafana, Clickhouse, and cloud-native monitoring tools
  • Participate in on-call rotations, lead incident response, perform post-mortems, and continuously improve system reliability and SLAs.
What we offer
What we offer
  • Diverse medical, dental and vision options
  • 401k matching program
  • Unlimited paid time off
  • Parental leave and flexibility for all parents and caregivers
  • Support of country-specific visa needs for international employees living in the Bay Area
  • Meaningful equity component.
  • Fulltime
Read More
Arrow Right

Senior Research Engineer

The HPE HPC & AI EMEA Research Lab (ERL) is characterized by a unique blend of i...
Location
Location
Germany , Munich, Berlin
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Parallel programming experience, with programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages, etc.
  • An understanding of AI/ML frameworks, experience with frameworks such as TensorFlow or PyTorch is highly desirable
  • An interest in system- and data center monitoring and operational data analysis
  • Professional language skills in English and German
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Work with the most esteemed research partners across Europe
  • Enable high performance research software on pre-Exascale and Exascale supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run applications taking advantage of leading-edge hardware at scale
  • Make and operate HPC/AI systems and datacenters in a sustainable way
  • Manage modern data-intensive workloads in high performance environments
What we offer
What we offer
  • Competitive salary and extensive benefits package (pension scheme, insurances, bike and car leasing, and other fringe benefits)
  • Work-life balance (flexible working time and hybrid workplace model, 30 vacation days, four HPE Wellness-Fridays, up to six months paid parental leave)
  • Support for education, training, and career development
  • Diverse and dynamic work environment
Read More
Arrow Right