CrawlJobs Logo

Hpc Operations Lead

United Kingdom, London Employment contract 73000.00 - 82000.00 GBP / Year · Job Posted May 04, 2026
Apply Position
Job Link Share

Job Description

Lead the systems that power discovery. Behind every breakthrough in modern science sits the computational infrastructure that makes it possible. The platforms, clusters and storage environments that turn bold ideas into real progress. This is an opportunity to lead that foundation working at the intersection of technology and discovery. You will join a world leading research institute where scientists and engineers work side by side to tackle some of the most complex challenges in Science and Technology. The culture is open, collaborative and deeply curious, designed to remove barriers and enable innovation at scale. As HPC Operations Lead, you will play a central role in shaping how research computing services are delivered and evolved. Reporting into the Head of Research Computing Platforms, you will take ownership of the operational performance of a large scale HPC and storage environment, ensuring systems are robust, responsive and continuously improving. This is a leadership role with real breadth. You will guide a specialist team, oversee service delivery and act as a key point of connection between technical teams and scientific users. From managing incidents and service performance to influencing long term technology direction and strategy, your work will directly support research outcomes across the organisation. A key part of the role is ensuring that complex infrastructure remains accessible and usable. You will engage closely with researchers to understand their needs, translate technical concepts into clear language and help shape platforms that genuinely enable scientific progress. Alongside this, you will lead on the design and operation of high performance storage services, supporting both internal workloads and external collaboration. The environment includes large scale HPC clusters, Linux based systems, workload schedulers such as Slurm, networking with Infiniband and parallel file systems such as GPFS. Experience with high performance storage at petabyte scale is particularly relevant, alongside a broader understanding of automation, data centre environments or networking. You will bring proven leadership experience, strong operational awareness and the ability to manage complex services with limited resources and competing priorities. Just as important is your ability to work collaboratively across teams, balancing technical depth with a clear focus on outcomes. This is a role for someone who wants their work to matter. Every system you improve and every service you shape will contribute to research that has the potential to change lives.

Job Responsibility

  • Play a central role in shaping how research computing services are delivered and evolved
  • take ownership of the operational performance of a large scale HPC and storage environment
  • ensure systems are robust, responsive and continuously improving
  • guide a specialist team
  • oversee service delivery
  • act as a key point of connection between technical teams and scientific users
  • managing incidents and service performance
  • influencing long term technology direction and strategy
  • ensuring complex infrastructure remains accessible and usable
  • engage closely with researchers to understand their needs
  • translate technical concepts into clear language
  • help shape platforms that genuinely enable scientific progress
  • lead on the design and operation of high performance storage services
  • supporting both internal workloads and external collaboration

Requirements

  • Proven leadership experience
  • strong operational awareness
  • ability to manage complex services with limited resources and competing priorities
  • ability to work collaboratively across teams
  • experience with large scale HPC clusters
  • Linux based systems
  • workload schedulers such as Slurm
  • networking with Infiniband
  • parallel file systems such as GPFS
  • experience with high performance storage at petabyte scale
  • broader understanding of automation
  • data centre environments or networking

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Hpc Operations Lead

8 matching positions

HPC Operations Lead

One of Europe’s most exciting research organisations is on the hunt for a Lead E...
Location
Location
United Kingdom , London
Salary
Salary:
70000.00 - 80000.00 GBP / Year
linuxrecruit.co.uk Logo
Linux Recruit
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Knowledge of HPC environments and large-scale storage
  • Experience leading people and platforms
  • Ability to communicate with clarity and warmth
  • Comfortable juggling priorities and working with different stakeholders
  • Ability to find practical solutions in a fast-moving research setting
  • Experience in science or biomedical research is beneficial
  • Curiosity and a collaborative mindset
Job Responsibility
Job Responsibility
  • Take ownership of high-performance compute and large-scale storage platforms
  • Ensure platforms are reliable, responsive, and ready
  • Work closely with researchers and technology teams
  • Oversee the HPC service desk
  • Guide incident response
  • Help shape the future direction of the platforms
  • Design and deliver training
  • Support users
  • Step into a wider leadership role when required
What we offer
What we offer
  • Excellent benefits
  • Culture that encourages ideas, learning and teamwork
  • Fulltime
Read More
Arrow Right

Hpc Operations Engineering Manager

Microsoft AI is seeking an experienced HPC Operations Engineering Manager to joi...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with Site Reliability Engineering, DevOps, or Infrastructure Engineering Leadership roles AND 8+ years experience with Kubernetes, Docker, and container orchestration, AND 8+ years experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code, AND 6+ years experience with programming/scripting skills not limited to Python, Go, or Bash
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience AND 10+ years experience with Kubernetes, Docker, and container orchestration, AND 10+ years' experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • OR equivalent experience
  • 6+ years people management experience
  • Experience in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Experience running large-scale GPU clusters for ML/AI workloads
  • Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators)
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Solid knowledge of distributed systems, networking, and storage
Job Responsibility
Job Responsibility
  • Team leadership: Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems
  • Observability: Design and help maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra
  • Automation & Tooling: Lead building of automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
  • Fulltime
Read More
Arrow Right

Systems and Operations QA Lead

The Systems and Operations QA Lead will be oversee the regulatory and systems re...
Location
Location
Philippines , Gateway, Cavite
Salary
Salary:
Not provided
unilever.com Logo
Unilever
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Graduate of any Science course, preferably licensed Engineer or Chemist but not a must
  • Good people handling, coaching, and leadership skills
  • Problem solving and data analysis skills
  • Presentation skills (well versed in MS Office and Power Apps)
  • Reading, comprehension, and communication skills
  • Working knowledge in GLP, WCM, TPM, HACCP and HPC 420
  • Minimum of 5 years experience in FMCG
Job Responsibility
Job Responsibility
  • Implements program to ensure Quality Team is trained and properly executing all necessary quality checks based on standards
  • Ensures all relevant personnel are knowledgeable of the Quality Management System
  • Liaises with BU/ BG Quality team, R&D, Procurement, Planning and other groups at the factory
  • Coordinates with relevant regulatory groups for any important updates and actions needed to ensure site is fully complaint to latest standards
  • Develops and builds quality protocols aligned with category requirements and protocols
  • HACCP and Quality Plan are available, updated based on established program, effective and made aware to all relevant personnel
  • Ensures internal audit program is in place
  • Manages daily schedule of Quality Engineers, Innovation & Change QA and Systems QA
  • Provides timely disposition on held finished products
  • Coordinates with other factory departments of any operations related concerns and ensure necessary actions are taken
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer- ML Network Stack

We are seeking an experienced engineer to join our team that owns the network st...
Location
Location
Israel , Tel Aviv
Salary
Salary:
Not provided
Amazon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of non-internship professional software development experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • 3+ years as a mentor, tech lead or leading engineering teams
  • 3+years experience in SW/HW Co-Design
Job Responsibility
Job Responsibility
  • Be a senior engineer on a team that builds and maintains the infrastructure that monitors and reports on functionality and performance of massive testing workloads run at scale
  • Use internal Amazon CI/CD tools, Linux, and public AWS products to automate the delivery of our software to customers, saving developer time
  • Write Python code that effortlessly spools up large clusters and runs benchmarks and applications for ML and HPC workloads
  • Use AWS Managed Grafana and Athena to digest the massive amount of performance data generated by these workloads and create dashboards for developers and stakeholders
  • Invent automatic mechanisms to alert developers to functional and performance regressions so they never reach customers
  • Manage the complexity of infrastructure that covers many instance types, software stacks, Linux operating systems, cutting-edge releases and make it easy to evolve
Read More
Arrow Right
New

Logistics Chargehand CLS - Days

As a Logistics Chargehand within the Construction Logistics Space, you will lead...
Location
Location
United Kingdom , Somerset
Salary
Salary:
24.24 GBP / Hour
wilsonjames.co.uk Logo
Wilson James
Expiration Date
June 25, 2026
Flip Icon
Requirements
Requirements
  • CITB HS&E Supervisor Test (to be completed before start date)
  • SSSTS (must be obtained within the first 3 months of employment)
  • CPCS Card (Traffic Marshal) or willing to work towards
  • Previous Supervisory Experience
  • Demonstrate the ability to think and act quickly in emergencies or under pressure
  • Able to successfully execute SOP's, Risk Assessments, and Method Statements to teams
  • Excellent communication skills, written & verbal, with internal and external stakeholders
  • Demonstrate reliability, including and have strong time management skills, to ensure team effectiveness
  • Understanding of Health and Safety responsibilities and obligations
  • You must be able to provide a 3-year work/ unemployment/ education history and the required vetting process in line with HPC protocols
Job Responsibility
Job Responsibility
  • Lead a team within a dynamic construction logistics environment
  • Oversee the coordination and safe movement of permanent materials
  • Traffic marshalling duties
  • Ensure all logistical operations run smoothly and safely
  • Both team supervision and some independent working across multiple logistics areas
What we offer
What we offer
  • £24.24 per hour
  • working on a rotation of 12-hour shifts, 4 on, 5 off, 5 on, 4 off, 5 on, 5 off, providing a positive work/life balance
  • 23 Days annual leave including bank holidays
  • Life assurance scheme
  • Pension Scheme 5% employer contribution
  • Employee Assistance Programme that provides a health and wellbeing support service, including access to an online GP
  • Access to an industry leading Employee Benefits Platform offering lifestyle savings and discounts on most high street retailers, a Reward and Recognition programme
  • The opportunity to develop your career with access to training and development programmes
  • As an employer of choice, we focus on wellbeing, training, and career progression
  • Fulltime
!
Read More
Arrow Right
New

Principal Product Manager - Virtualization Architect

Designs, plans, develops, and manages a product or portfolio of virtualization p...
Location
Location
United States , All
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or equivalent in computer science, engineering or related field of study
  • 10+ years of experience in product management, engineering, or a related technical role, with significant exposure to virtualization platforms and hypervisor technologies
  • Demonstrated hands-on or architectural familiarity with KVM, QEMU, libvirt, and the broader Linux virtualization stack
  • Deep technical knowledge of KVM hypervisor architecture, virtual machine lifecycle, vCPU scheduling, memory management (huge pages, NUMA), virtio device emulation, and hardware-assisted virtualization (Intel VT-x/AMD-V, IOMMU)
  • Strong understanding of virtualized networking (OVS, macvtap, SR-IOV, DPDK) and storage virtualization (virtio, iSCSI, NVMe-oF, Ceph/RBD) as they apply to KVM guest workloads
  • Familiarity with virtualization management and orchestration ecosystems - including libvirt APIs, oVirt, OpenStack Nova, and KubeVirt - and the ability to define product integration requirements across these layers
  • Extensive cross-functional leadership skills: ability to drive alignment across engineering, field, and partner organizations on complex, technically ambiguous virtualization platform initiatives
  • Strong financial and business acumen, including experience building business cases, defining performance metrics, and analyzing competitive positioning for infrastructure software products
  • Ability to provide product-specific technical training and enablement to sales, partners, and customer-facing teams on KVM and virtualization platform capabilities
  • Experience engaging with open-source ecosystems and upstream communities (Linux kernel, QEMU, libvirt, oVirt, OpenStack) as a product stakeholder
Job Responsibility
Job Responsibility
  • Independently leads end-to-end strategy and operational roadmap for one or more KVM-based virtualization products or a broader virtualization platform portfolio, spanning hypervisor core, management APIs, and guest ecosystem
  • Defines and drives the virtualization platform value proposition - including performance benchmarks, TCO advantages, and feature differentiation versus VMware, Hyper-V, and other competing hypervisor stacks - to support go-to-market and sales enablement
  • Synthesizes market and customer requirements (MRDs) by maintaining deep knowledge of enterprise virtualization use cases: VDI, server consolidation, cloud-native workloads, telco NFV/edge, and HPC virtualization
  • Translates KVM/QEMU/libvirt engineering capabilities into customer-facing requirements and product specifications, ensuring technical feasibility and roadmap alignment with upstream open-source communities (e.g., Linux kernel KVM subsystem, QEMU project)
  • Guides key stakeholders through all lifecycle phases - from hypervisor feature planning and kernel integration to product launch, sustaining engineering, and platform end-of-life planning
  • Collaborates across engineering, supply chain, and marketing to optimize product configuration, SKU design, pricing, and go-to-market strategies for virtualization platform offerings
  • Acts as a subject matter authority on KVM virtualization architecture, providing technical direction to internal teams, enabling sales and partner technical communities, and representing the product externally with customers and at industry forums
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior Cybersecurity Engineer

Senior Cybersecurity Engineer LOCATION: Eglin AFB, FL JOB STATUS: Full-time C...
Location
Location
United States , Eglin Air Force Base
Salary
Salary:
Not provided
astrion.us Logo
Astrion
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master’s Degree (in Computer Science, Cybersecurity or a related field). Relevant experience may be substituted for the degree
  • 10 Years’ total experience, at least 8 of which is in cybersecurity engineering, architecture or R&D infrastructure
  • Top Secret Clearance with SCI. Eligible for Special Access Program (SAP) access. US Citizenship is required
  • DoD 8570/8140 IAT Level III (CISSP, CISM, or equivalent). Certifications: Security+, CEH, or other relevant security certifications
  • Expert-level knowledge of cybersecurity principles, risk management, and secure computing architectures
  • Hands-on experience with security tools and technologies, such as SIEM, intrusion detection/prevention systems, vulnerability scanners, and endpoint protection solutions. Experience with Host-Based Security System (HBSS), Assured Compliance Assessment Solution (ACAS), Nessus, Tenable.sc, Tenable.io, NNM, LCE, Nessus Manager, Agents, and Scanner
  • Experience with scripting (Python, PowerShell) and automation tools (Ansible, Chef)
  • Familiarity with Risk Management Framework (RMF), Authority to Operate (ATO) documentation, and enclave compliance management
  • Physically able to lift up to 50 lbs
  • adaptable to fieldwork and hands-on installations
Job Responsibility
Job Responsibility
  • Collaborate with network engineers to architect secure network topologies for current and future connected and isolated environments, ensuring security is embedded in the design phase
  • Design and deploy security solutions for S&T environments that support continuous research, development, and DevSecOps, working closely with network engineers to implement and maintain these solutions
  • Advise on security planning for long-term initiatives, including SDREN integration and the Weapons Technology Integration Center (WTIC) and other facility projects, in conjunction with network planning efforts
  • Develop security innovation roadmaps aligned with mission goals and emerging technologies, coordinating with network engineers to ensure alignment with network modernization efforts
  • Coordinate with facilities, engineering, and network teams to ensure robust infrastructure supports secure research operations, focusing on the security aspects of network hardware/power/cooling needs and structured cabling
  • Lead security aspects of containerization, virtualization, and orchestration of systems to support laboratory computing, HPC, and edge devices, working with network engineers to implement secure configurations
  • Engineer multiple S&T networks security architecture in compliance with NIST 800-series, DoD RMF, DISA Security Technical Implementation Guides (STIGs), and cybersecurity best practices, collaborating with network engineers to ensure seamless integration. Review engineering, architecture, and designs to ensure DoD security policies are met
  • Implement DevSecOps pipelines to automate security scans and CI/CD deployments, working with network engineers to integrate security into existing pipelines
  • Manage ATO package development and collaborate with ISSMs, network engineers, and cybersecurity stakeholders to ensure compliance. Review and develop RMF Assessment and Authorization (A&A) documentation, e.g. System Security Plans (SSPs), Security Assessment Reports (SARs), and Plans of Action and Milestones (POA&Ms)
  • Integrate identity management and single sign-on solutions across enclaves and hybrid environments, coordinating with network engineers to implement and maintain these solutions. Analyze and tune HBSS policies for assets during integration test events. Perform verification and troubleshooting across all HBSS modules. Install updates to HBSS software as released and in compliance with STIG requirements. Monitor HBSS software to ensure that the clients/servers are operational and reporting properly
What we offer
What we offer
  • Competitive salaries
  • Continuing education assistance
  • Professional development
  • Multiple healthcare benefits package options
  • 401K with employer matching
  • Competitive time off policy along with a federally recognized holiday schedule
  • Fulltime
Read More
Arrow Right

Lead Engineer, Ml Network Stack - Annapurna Labs

We are seeking an experienced engineer and technical leader to join our team tha...
Location
Location
United States , Seattle; Cupertino
Salary
Salary:
168100.00 - 261500.00 USD / Year
Amazon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of non-internship professional software development experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • 3+ years as a mentor, tech lead or leading engineering teams
  • 3+ years experience in SW/HW Co-Design
Job Responsibility
Job Responsibility
  • Be the lead engineer on a team that builds and maintains the infrastructure that monitors and reports on functionality and performance of massive testing workloads run at scale
  • Use internal Amazon CI/CD tools, Linux, and public AWS products to automate the delivery of our software to customers, saving developer time
  • Write Python code that effortlessly spools up large clusters and runs benchmarks and applications for ML and HPC workloads
  • Use AWS Managed Grafana and Athena to digest the massive amount of performance data generated by these workloads and create dashboards for developers and stakeholders
  • Invent automatic mechanisms to alert developers to functional and performance regressions so they never reach customers
  • Manage the complexity of infrastructure that covers many instance types, software stacks, Linux operating systems, cutting-edge releases and make it easy to evolve
What we offer
What we offer
  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
  • sign-on payments
  • restricted stock units (RSUs)
  • Fulltime
Read More
Arrow Right