CrawlJobs Logo

Hpc Operations Lead

linuxrecruit.co.uk Logo

Linux Recruit

Location Icon

Location:
United Kingdom , London

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

73000.00 - 82000.00 GBP / Year

Job Description:

Lead the systems that power discovery. Behind every breakthrough in modern science sits the computational infrastructure that makes it possible. The platforms, clusters and storage environments that turn bold ideas into real progress. This is an opportunity to lead that foundation working at the intersection of technology and discovery. You will join a world leading research institute where scientists and engineers work side by side to tackle some of the most complex challenges in Science and Technology. The culture is open, collaborative and deeply curious, designed to remove barriers and enable innovation at scale. As HPC Operations Lead, you will play a central role in shaping how research computing services are delivered and evolved. Reporting into the Head of Research Computing Platforms, you will take ownership of the operational performance of a large scale HPC and storage environment, ensuring systems are robust, responsive and continuously improving. This is a leadership role with real breadth. You will guide a specialist team, oversee service delivery and act as a key point of connection between technical teams and scientific users. From managing incidents and service performance to influencing long term technology direction and strategy, your work will directly support research outcomes across the organisation. A key part of the role is ensuring that complex infrastructure remains accessible and usable. You will engage closely with researchers to understand their needs, translate technical concepts into clear language and help shape platforms that genuinely enable scientific progress. Alongside this, you will lead on the design and operation of high performance storage services, supporting both internal workloads and external collaboration. The environment includes large scale HPC clusters, Linux based systems, workload schedulers such as Slurm, networking with Infiniband and parallel file systems such as GPFS. Experience with high performance storage at petabyte scale is particularly relevant, alongside a broader understanding of automation, data centre environments or networking. You will bring proven leadership experience, strong operational awareness and the ability to manage complex services with limited resources and competing priorities. Just as important is your ability to work collaboratively across teams, balancing technical depth with a clear focus on outcomes. This is a role for someone who wants their work to matter. Every system you improve and every service you shape will contribute to research that has the potential to change lives.

Job Responsibility:

  • Play a central role in shaping how research computing services are delivered and evolved
  • take ownership of the operational performance of a large scale HPC and storage environment
  • ensure systems are robust, responsive and continuously improving
  • guide a specialist team
  • oversee service delivery
  • act as a key point of connection between technical teams and scientific users
  • managing incidents and service performance
  • influencing long term technology direction and strategy
  • ensuring complex infrastructure remains accessible and usable
  • engage closely with researchers to understand their needs
  • translate technical concepts into clear language
  • help shape platforms that genuinely enable scientific progress
  • lead on the design and operation of high performance storage services
  • supporting both internal workloads and external collaboration

Requirements:

  • Proven leadership experience
  • strong operational awareness
  • ability to manage complex services with limited resources and competing priorities
  • ability to work collaboratively across teams
  • experience with large scale HPC clusters
  • Linux based systems
  • workload schedulers such as Slurm
  • networking with Infiniband
  • parallel file systems such as GPFS
  • experience with high performance storage at petabyte scale
  • broader understanding of automation
  • data centre environments or networking

Additional Information:

Job Posted:
May 04, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Hpc Operations Lead

Lead Solution Architect

Lead Solution Architect role at Hewlett Packard Enterprise focusing on designing...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience
  • Bachelor of Arts/Science or equivalent degree in computer science or related area of study
  • without a degree, 11+ years of relevant professional experience
  • Proficiency in container orchestration platforms (Red Hat OpenShift, SUSE Rancher, CNCF Kubernetes)
  • Experience with GPU-accelerated workloads and tools like NVIDIA GPU Operator and DCGM
  • Ability to integrate Kubernetes with AI/ML workloads and GPU infrastructure in hybrid or private cloud environments
  • Experience architecting HPC clusters including GPU/compute nodes and HPC storage technologies (Lustre, WEKA, Parallel Filesystems)
  • Understanding of high-speed networking (InfiniBand, Mellanox, RoCE)
  • Experience with HPC cluster management tools (HPE Cluster Management, NVIDIA Base Command Manager)
  • Familiarity with HPC workload schedulers (Slurm, Altair PBS Pro)
Job Responsibility
Job Responsibility
  • Design and scope multiple deliverables across multiple technologies
  • Lead team in delivery of multiple deliverables
  • Develop solutions that enhance availability, performance, maintainability and agility of customer's enterprise
  • Contribute to design and application of new tools
  • Re-use existing experience to develop new solutions
  • Understand architectural dependencies of technologies in customer's IT environment
  • Advise, integrate, and accelerate customers' outcomes from digital transformation
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Career development programs
  • Flexible work arrangements
  • Inclusive work environment
  • Fulltime
Read More
Arrow Right

HPC System Software Analyst

Provide technology consulting to external customers and internal project teams. ...
Location
Location
United States , Los Alamos
Salary
Salary:
101900.00 - 234500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Active Department of Energy (DOE) Q Clearance or have held one in the past 3 years
  • if previous clearance, must not foresee a problem with it being reinstated
  • duties require US Citizenship
  • Bachelor’s degree in Computer Science, Engineering, or related area of study
  • 4+ years HPC-related experience, ideally with large-scale HPC and parallel file system administration and support
  • without a degree, three additional years of relevant professional experience (7+ years in total)
  • understanding of a HPC Data Center IT Operations environment
  • expertise in HPC application consulting and support
  • strong system administration skills, particularly in HPC environments
  • extensive knowledge and experience with Linux operating systems (RHEL or SLES)
Job Responsibility
Job Responsibility
  • Provide on-site system administration and HPC application consulting services
  • address and resolve the current top issues in the HPC environment
  • maintain the HPC systems availability to the customer
  • monitor system performance and provide recommendations for improvements
  • collaborate with team members and stakeholders to deliver high-quality support and solutions
  • create and document site procedures, system diagrams, and other configuration or support documents
  • maintain system software and firmware revisions, including patches, updates, and OS upgrades
  • solve system hardware, software, and third-party software issues, and provide detailed and thoughtful analysis of problem and solution
  • gather data, perform analysis, and escalate problems to higher-level product support groups and appropriate management when necessary to ensure timely resolution of system or customer issues
  • provide solutions and implement repair or workarounds when possible, fully documenting steps taken when required
What we offer
What we offer
  • comprehensive suite of benefits that supports their physical, financial and emotional wellbeing
  • specific programs catered to helping you reach any career goals you have
  • unconditionally inclusive in the way we work and celebrate individual uniqueness
  • diverse backgrounds are valued and succeed here
  • Fulltime
Read More
Arrow Right

HPC Operations Lead

One of Europe’s most exciting research organisations is on the hunt for a Lead E...
Location
Location
United Kingdom , London
Salary
Salary:
70000.00 - 80000.00 GBP / Year
linuxrecruit.co.uk Logo
Linux Recruit
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Knowledge of HPC environments and large-scale storage
  • Experience leading people and platforms
  • Ability to communicate with clarity and warmth
  • Comfortable juggling priorities and working with different stakeholders
  • Ability to find practical solutions in a fast-moving research setting
  • Experience in science or biomedical research is beneficial
  • Curiosity and a collaborative mindset
Job Responsibility
Job Responsibility
  • Take ownership of high-performance compute and large-scale storage platforms
  • Ensure platforms are reliable, responsive, and ready
  • Work closely with researchers and technology teams
  • Oversee the HPC service desk
  • Guide incident response
  • Help shape the future direction of the platforms
  • Design and deliver training
  • Support users
  • Step into a wider leadership role when required
What we offer
What we offer
  • Excellent benefits
  • Culture that encourages ideas, learning and teamwork
  • Fulltime
Read More
Arrow Right

HPC AI-BU District Service Manager

Responsible for the overall management of a service segment of significant scope...
Location
Location
United States , Memphis
Salary
Salary:
101900.00 - 234500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree preferred or equivalent experience
  • Five to ten years of related experience in customer support in a technical environment with proven managerial abilities
  • 3-10 years of small to medium team (3-25 people) team lead experience as HPC Field Team Lead or similar
  • Experience in a dynamic environment / adaptable to change / receptive to constructive feedback
  • 360-degree relationship and communication to peers, junior staff, leaders, and customers
Job Responsibility
Job Responsibility
  • Work closely with the Site Team Leads to plan, direct, and monitor operational/tactical activities of technical on-site team
  • Manage / coordinate customer escalations, and escalations of technical, process, or materials issues encountered by field team
  • Provide guidance on process improvements and recommend changes in alignment with business tactics and strategy for area of responsibility
  • Responsible for the full understanding of the service contract and associated terms and conditions
  • Proactively identify, report on, and close risks to Service Level Agreement (SLA) or customer satisfaction
  • Meet business and operational targets by managing core site and business metrics - Key Performance Indicators (KPIs)
  • Routine status updates to Services Geo Lead (Director)
  • Establish and manage relationships with customers
  • Establish and maintain close collaborative relationship with the sales account team and stakeholders
  • Regularly visit sites to field teams and customers (approximately 25% travel)
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Principal Software Automation Engineer

Microsoft Silicon Cloud Hardware Infrastructure Engineering (SCHIE) is the team ...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • CoE Leadership & Technical Authority: Own the end-to-end automation strategy for HPC, operational platforms, and Azure integrations. Define reference architectures, standards, and coding methodologies. Serve as the highest-level technical escalation point for automation, reliability, and integration challenges across the org
  • Roadmaps & Standards: Create and maintain multi-year automation roadmaps aligned to business priorities. Establish coding standards, testing strategies, code quality, security baselines, and operational readiness criteria adopted across teams
  • Team Leadership: Build, mentor, and technically lead a software automation team over time. Set hiring bar, role definitions, and career paths
  • coach senior engineers
  • lead by example through hands-on contributions
  • Hands-on Engineering (Principal IC): Architect, design, implement, and operate production-grade automation platforms for HPC infrastructure and cloud services
  • Operational Automation at Scale: Eliminate manual and error-prone work by codifying provisioning, imaging, patching, validation, break/fix, incident response, and self-healing remediation workflows
  • Platform & Service Integrations: Design robust API-first, event-driven, and asynchronous integrations across internal platforms for HPC services, and Azure-native services
  • ETL & Data Engineering: Build and evolve data pipelines that ingest, transform, and validate telemetry, logs, metrics, and operational signals. Enable reliability analysis, capacity forecasting, cost optimization, and executive reporting
  • Azure Automation & Governance: Lead infrastructure-as-code, CI/CD pipelines, identity and access automation (RBAC), policy enforcement, secrets management, and monitoring with security-by-default and compliance-aware practices
  • Fulltime
Read More
Arrow Right

Senior Supercomputing Operations Engineer

Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience operating high performance computing (HPC), artificial intelligence (AI), or largescale distributed systems in production environments
  • Handson experience operating interconnect fabrics for HPC, AI, or largescale distributed systems in production
  • Strong Linux systems knowledge with demonstrated experience debugging lowlevel infrastructure issues
  • Demonstrated ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve production issues
  • Familiarity with InfiniBand Subnet Manager behavior, including routing, congestion control, and fabric telemetry
Job Responsibility
Job Responsibility
  • Act as DRI for InfiniBand and GPU interconnect fabric operations, ensuring GPU availability and AI training stability
  • Lead incident triage, mitigation, recovery, and root cause analysis for fabric-related production issues
  • Perform deep multi-layer debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, and GPU interactions
  • Drive operational excellence and prevention by identifying systemic failure patterns and authoring TSGs, playbooks, and escalation guides
  • Build and leverage automation, telemetry, and tooling to improve detection, debuggability, and mean time to mitigation
  • Fulltime
Read More
Arrow Right

HPC SME

HPE Operations is our innovative IT services organization. It provides the exper...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8-12 years of experience with different flavours of Linux like SLES, RHEL and Ubuntu/Debian
  • 5-8 years experience in managing HPC/Linux clusters with good understanding of its architecture
  • Skilled in installation and configuration of various applications on Linux
  • Install, administer, and maintain hardware, system software, networking, accounts, and security measures on VMWare configuration
  • Diagnose and resolve system issues and performance issues
  • Experience in drafting technical SOPs, action plans and knowledge documents
  • Good understanding of different cloud platforms
  • Reinstate integrity of system as quickly as possible following an outage
  • Triage and solve user-submitted tickets
  • Track resource usage using monitoring and queuing software
Job Responsibility
Job Responsibility
  • Review and Validate HPC solutions and Environment through POCs and Benchmarking
  • Architecting and designing HPC solutions tailored to the customer's needs
  • Overseeing solution implementation, integration and testing
  • Diagnose and correct solution issues during the implementation
  • Providing training, documentation and ongoing support
  • Maintain the Life-cycle management of the HPC environment
  • Oversee the team operations and deliverables
  • Lead the team with technical expertise ensure regular technical session and case reviews
  • Demonstrate high level of technical & communication skills under critical situations
  • Takes responsibility for end-to-end problem ownership and its solutions
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

HPC SME

HPE Operations is our innovative IT services organization. It provides the exper...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8 - 12 years of experience different flavours of Linux like SLES, RHEL and Ubuntu/Debian
  • 5 - 8 years Experience in managing HPC/Linux clusters and should have good understanding of its architecture
  • Skilled in installation and configuration of various applications on Linux
  • Install, administer, and maintain hardware, system software, networking, accounts, and security measures on VMWare configuration
  • Diagnose and resolve system issues and performance issues
  • Should have experience in drafting technical SOPs, action plans and knowledge documents
  • Should have good understanding of different cloud platforms
  • Reinstate integrity of system as quickly as possible following an outage in order to minimize downtime
  • Triage and solve user-submitted tickets, especially when they relate to the infrastructure
  • Track resource usage using monitoring and queuing software
Job Responsibility
Job Responsibility
  • Review and Validate HPC solutions and Environment through POCs and Benchmarking
  • Architecting and designing HPC solutions tailored to the customer’s needs
  • Overseeing solution implementation, integration and testing
  • Diagnose and correct solution issues during the implementation
  • Providing training, documentation and ongoing support
  • Maintain the Life-cycle management of the HPC environment
  • Oversee the team operations and deliverables
  • Lead the team with technical expertise ensure regular technical session and case reviews
  • Demonstrate high level of technical & communication skills under critical situations
  • Takes responsibility for end-to-end problem ownership and its solutions
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
Read More
Arrow Right