CrawlJobs Logo

HPC AI & Kubernetes Platform Engineer

Australia, Canberra Employment contract 118102.00 - 127808.00 AUD / Year · Job Posted May 15, 2026
Apply Position
Job Link Share

Job Description

Build and run Kubernetes and HPC platforms at national scale. Deliver secure, reliable and automated compute environments. Grow your skills across on‑prem and cloud at CSIRO.

Job Responsibility

  • Design, deploy, and manage run: ai and AI development tools and environments on GPU clusters
  • Design, deploy, and manage K8s across various environments (on-premises, cloud, hybrid)
  • Implement and maintain K8s best practices to ensure efficient and reliable cluster operations
  • Develop and maintain automation scripts and tools for provisioning, configuration, and management of run: ai and K8s environments
  • Leverage Infrastructure as Code (IaC) tools such as Helm, Ansible or Terraform
  • Implement monitoring and logging solutions to ensure the health and performance of GPU clusters
  • Troubleshoot and resolve issues related to cluster operations, application deployments, and performance bottlenecks
  • Ensure that environments adhere to security best practices and compliance requirements
  • Implement and manage security controls such as role-based access control (RBAC), network policies, and image scanning
  • Work closely with DevOps, development teams, research users and other stakeholders to understand requirements, optimise workflows, and support scientific applications and workflows
  • Provide guidance and support for containerisation, K8s, and run: ai -related issues

Requirements

  • Relevant Bachelor’s degree or equivalent relevant work experience in Information Technology, Computer Science, Mathematics, Physics or Engineering
  • Knowledge of containerisation technologies (Docker, containers) and microservices architecture
  • Knowledge of run: ai and AI development tools and environments
  • Proficiency in scripting and automation using tools such as Bash, Python, or Go
  • Familiarity with Infrastructure as Code (IaC) tools like Helm, Ansible or Terraform
  • Experience in Linux system administration
  • Understanding of networking concepts, security practices, and CI/CD pipelines
  • Strong problem-solving, analytical and communication skills
  • Demonstrated ability to work with independence and self-motivation within a distributed team environment

Nice to have

  • Kubernetes (CKA or CKAD), or NVIDIA Certification, or equivalent
  • Experience with public cloud platforms (AWS, Azure, GCP) and associated services related to K8s and ML

What we offer

  • 15.4% superannuation
  • flexible work arrangements
  • range of leave entitlements
  • career development opportunities
  • comprehensive training and development portfolio

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

HPC AI & Kubernetes Platform Engineer

8 matching positions

Kubernetes Platform Engineer

Kubernetes Platform Engineer. This role has been designed as ‘Hybrid’ with an ex...
Location
Location
United States , Bloomington
Salary
Salary:
111500.00 - 211500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud Architectures
  • Cross Domain Knowledge
  • Design Thinking
  • Development Fundamentals
  • DevOps
  • Distributed Computing
  • Microservices Fluency
  • Full Stack Development
  • Security-First Mindset
  • Solutions Design
Job Responsibility
Job Responsibility
  • Lead Kubernetes‑native, RDMA‑class networking for distributed AI inference platforms on HPC clusters
  • Own the end‑to‑end technical design that allows Kubernetes‑orchestrated inference workloads (NVIDIA NIMs, vLLM, TensorRT‑LLM) to transparently consume high‑speed fabrics (e.g., HPE Slingshot/CXI) using Operators, DRA, CDI, Multus/secondary CNI, and Kubernetes networking abstractions—without container rebuilds, privileged pods, or manual tuning
  • Make HPC fabric capabilities consumable from standard containers
  • Design the mechanisms to expose RDMA‑capable NIC resources and required runtime components without baking the fabric into images, including mounting/injecting host user‑space libraries (e.g., libcxi + libfabric) in a controlled, supportable way
  • Define the reference design and implement for Kubernetes‑native RDMA enablement across Dynamic Resource Allocation (DRA), Container Device Interface (CDI), Multus + secondary CNIs, and Operator‑driven lifecycle management
  • Own API and CRD design (ResourceClaims, DeviceClasses, custom CRDs) with long‑term compatibility guarantees
  • Make and defend architectural tradeoffs between Device plugins vs DRA, CDI vs runtime hooks vs admission webhooks, Shared vs exclusive NIC models, and Performance vs operability vs isolation
  • Define how distributed inference patterns (KV‑cache movement, prefill/decode separation) map onto Kubernetes primitives
  • Ensure out-of-the-box compatibility with NVIDIA NIMs and the NIM Operator, KServe ServingRuntime / InferenceService, and GPU Operator (CDI mode)
  • Publish deployment patterns and validated manifests for inference workloads using RDMA fast paths
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Systems Design Engineer - AI Cluster Software

WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great prod...
Location
Location
United States , Austin
Salary
Salary:
163200.00 - 244800.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors or Masters degree in electrical or computer engineering
  • Evidence of end-to-end systems thinking, debugging, and tradeoff decisions
  • hands-on familiarity with at least two schedulers and/or orchestration systems (e.g., Slurm, Kubernetes), MPI/OpenMP, distributed storage patterns, or performance analysis
  • experience writing evaluation docs/RFCs with clear criteria, benchmarks, risks, and recommendations
  • Strong Linux fundamentals: Linux operating systems, networking, filesystems, containers, performance tooling (perf, flamegraphs, nvprof/rocprof, basic eBPF)
  • ability to turn complex systems into accessible, structured documentation with diagrams and reproducible steps
  • ROCm, RCCL, Instinct GPUs, EPYC platforms, compiler/toolchain impacts, and performance tuning
  • DDP, collective comms, sharded/stateful optimizers
  • NCCL/RCCL behavior and transport considerations (PCIe, NVLink, IF)
  • Slurm configuration patterns, Kubernetes for HPC/AI (GPU operators, device plugins), Apptainer/Singularity
Job Responsibility
Job Responsibility
  • Apply your expertise to shape AI infrastructure by creating reference architectures, configuration guides, and deployment blueprints that help internal teams and customers make informed hardware and software decisions
  • Perform deep technical evaluations of AI stacks across compute, storage, networking, and observability layers, documenting how they work, where they fit, and the tradeoffs involved
  • Design and execute reproducible experiments and benchmarking harnesses to compare technologies such as schedulers, distributed training libraries, and observability stacks
  • Develop small reference implementations and tools to validate performance hypotheses, analyze system behavior and more
  • Build a library of technical artifacts—including presentations, design documents, and “how it works” guides, to support pre-sales engineers and enable others to skill up from an HPC perspective
  • Present findings through demos, documentation, and internal talks, and create templates and checklists to support repeatable evaluations and cluster designs
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Global Lead Architect – Hybrid Cloud, AI & HPE Platform Delivery

A highly senior, customer-facing architecture and delivery leadership role respo...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12–15+ years in enterprise IT, with strong focus on: Solution architecture and delivery leadership
  • Hybrid cloud, AI/HPC, and infrastructure platforms
  • Proven background in professional services / delivery-led roles, not purely presales
  • Demonstrated experience leading large-scale, multi-technology programs end-to-end
  • Strong consulting mindset with excellent stakeholder and executive communication skills
  • Deep expertise in enterprise private cloud platforms and hybrid architectures
  • Strong understanding of workload migration, interoperability, and governance
  • AI platform design (GPU-based infrastructure, NVIDIA ecosystem)
  • HPC cluster architecture, workload schedulers (Slurm, PBS Pro), and performance tuning
  • Kubernetes ecosystems (OpenShift, Rancher, CNCF stack)
Job Responsibility
Job Responsibility
  • Serve as the technical validation authority during early sales cycles
  • Lead technical governance from opportunity qualification through delivery execution
  • Own solution integrity across the lifecycle—design, validation, implementation, and optimization
  • Architect and oversee end-to-end hybrid and private cloud solutions
  • Drive adoption of cloud-native, automated, and scalable architectures
  • Lead delivery teams across complex engagements
  • Act as the lead design authority ensuring delivery success for AI infrastructure and HPC deployments, Containerized platforms and cloud-native environments, Enterprise hybrid cloud transformations
  • Provide hands-on guidance during critical phases (design reviews, PoCs, escalations)
  • Lead technical due diligence during RFP/RFI responses, Solution workshops and discovery sessions, Proof-of-concept engagements
  • Translate business requirements into deliverable, production-ready architectures
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

AI Infra Engineer

We are looking for an AI Infra engineer to join our growing team. We work with K...
Location
Location
United States , San Francisco; Palo Alto
Salary
Salary:
210000.00 - 385000.00 USD / Year
perplexity.ai Logo
Perplexity
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • Experience with deploying and managing distributed training systems at scale
  • Deep understanding of container orchestration and distributed systems architecture
  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
  • Experience managing GPU clusters and optimizing compute resource utilization
  • Expert-level Kubernetes administration and YAML configuration management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
Job Responsibility
Job Responsibility
  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
What we offer
What we offer
  • Equity
  • Health
  • Dental
  • Vision
  • Retirement
  • Fitness
  • Commuter and dependent care accounts
  • Fulltime
Read More
Arrow Right

Advisory and Professional Services Sovereign AI Enterprise Architect

USA Advisory and Professional Services Sovereign AI Enterprise Architect (US). T...
Location
Location
United States , Spring
Salary
Salary:
161000.00 - 378000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong demonstrated background deploying and configuring enterprise Kubernetes (Rancher RKE2, Red Hat OpenShift, CNCF Kubernetes)
  • 5+ years hands-on Linux experience
  • Familiarity with SLURM on Kubernetes frameworks (Slinky, SUNK, etc.)
  • Strong demonstrated background working with Ansible automation frameworks
  • Familiarity with leveraging REST APIs for product integrations
  • Familiarity with HPC architectures, network topologies, and high performance storage platforms
  • Knowledge of NVIDIA's AI Enterprise tools, specifically BCM and DCGM for deploying and managing
  • Experience with Sovereign Cloud, MLOps, or compliance-led AI deployments
  • Familiarity with EU AI Act, GDPR, HIPAA, FedRAMP, CJIS, or related frameworks
  • Background in professional services, consulting, or large-scale transformation programs
Job Responsibility
Job Responsibility
  • Lead end-to-end architecture design for sovereign AI deployments, ensuring compliance with data residency, privacy, and regulatory requirements
  • Develop scalable, secure, and robust AI/ML reference architectures, including model lifecycle management, data pipelines, inference infrastructure, and governance frameworks
  • Evaluate and select technologies, cloud/on-prem infrastructure, and tooling aligned to sovereign and risk-management constraints
  • Provide architectural oversight across implementation teams, ensuring alignment with enterprise standards and risk mitigation strategies
  • Identify, assess, and manage risks across the AI solution lifecycle
  • Develop and implement governance frameworks that embed responsible AI principles, auditability, and continuous risk monitoring
  • Partner with security, compliance, and legal teams to ensure adherence to sovereignty mandates and enterprise risk frameworks
  • Embed security-by-design and risk-by-design in architecture, deployment, and operations
  • Create architectural documentation, diagrams, and governance frameworks for AI model development, deployment, and monitoring
  • Drive security-by-design principles into all AI components, including encryption, access control, auditing, and safe model operation
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

OpenShift Architect

We are currently seeking a OpenShift Architect to join our team in Bangalore, Ka...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Must be a graduate (B.Tech/B.E./MCA or equivalent)
  • Post-graduate degree in Computer Science or related field is highly preferred
  • 10 to 15 years of experience in Infrastructure Engineering, Unix/Linux Systems Architecture, and Cloud-Native platforms
  • 5+ years of experience as a primary Architect leading enterprise-scale Red Hat OpenShift (OCP 4.x) environments
  • Red Hat Certified Architect (RHCA) – Level II or higher (Cloud/Datacenter)
  • Red Hat Certified Specialist in MultiCluster Management (EX432) or Automation (EX380)
  • Solutions Architect Professional (AWS SAP-C02, Azure AZ-305, or GCP Professional Architect)
  • Willingness to work in rotational shifts/on-call as a technical lead in a 24x7 support window
Job Responsibility
Job Responsibility
  • Serve as the global SME for RHEL/RHCOS, architecting kernel-level optimizations, advanced system tuning, and high-performance computing (HPC) configurations
  • Define the strategy for transitioning legacy UNIX (AIX/Solaris/HP-UX) and monolithic Linux workloads into containerized or virtualized environments on OpenShift
  • Lead architectural decisions for Bare Metal, VMware, and KVM integration
  • Design global, highly available OpenShift architectures across hybrid and multi-cloud environments (IPI/UPI)
  • Direct architectural oversight for ROSA (AWS) and ARO (Azure)
  • Drive the roadmap for OpenShift Virtualization (KubeVirt) to unify VM and container management
  • Architect software-defined networking (SDN/OVN) and enterprise storage strategies using OpenShift Data Foundation (ODF)
  • Architect global automation frameworks using Ansible Automation Platform and Terraform
  • Establish organizational standards for OpenShift GitOps (ArgoCD)
  • Expert-level implementation of Red Hat Advanced Cluster Management (RHACM) for global governance
  • Fulltime
Read More
Arrow Right

Apj Advisory and Professional Services Sovereign Ai Enterprise Architect

APJ Advisory and Professional Services Sovereign AI Enterprise Architect. This r...
Location
Location
India , Mumbai
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong demonstrated background deploying and configuring enterprise Kubernetes (Rancher RKE2, Red Hat OpenShift, CNCF Kubernetes) - microk8s and kubespray should not be considered demonstrated background
  • Familiarity with SLURM on Kubernetes frameworks (Slinky, SUNK, etc.)
  • Strong demonstrated background working with Ansible automation frameworks
  • Familiarity with leveraging REST APIs for product integrations
  • Familiarity with HPC architectures, network topologies, and high performance storage platforms
  • Knowledge of NVIDIA's AI Enterprise tools, specifically BCM and DCGM for deploying and managing
  • 10+ years’ experience in enterprise sales or sales leadership within IT services, cloud, AI/ML, or data/analytics domains
  • 5+ years hands-on Linux experience
  • Strong grounding in data architecture, ML lifecycle, cloud platforms (AWS/Azure/GCP/Greenlake), MLOps tooling, security/privacy frameworks, APIs/integration patterns, with the ability to translate technical concepts into business value
  • Proven experience selling complex technology solutions to enterprise and/or public sector organizations
Job Responsibility
Job Responsibility
  • Lead end-to-end architecture design for sovereign AI deployments, ensuring compliance with data residency, privacy, and regulatory requirements
  • Develop scalable, secure, and robust AI/ML reference architectures, including model lifecycle management, data pipelines, inference infrastructure, and governance frameworks
  • Evaluate and select technologies, cloud/on-prem infrastructure, and tooling aligned to sovereign and risk-management constraints
  • Provide architectural oversight across implementation teams, ensuring alignment with enterprise standards and risk mitigation strategies
  • Identify, assess, and manage risks across the AI solution lifecycle
  • Develop and implement governance frameworks that embed responsible AI principles, auditability, and continuous risk monitoring
  • Partner with security, compliance, and legal teams to ensure adherence to sovereignty mandates and enterprise risk frameworks
  • Embed security-by-design and risk-by-design in architecture, deployment, and operations
  • Create architectural documentation, diagrams, and governance frameworks for AI model development, deployment, and monitoring
  • Drive security-by-design principles into all AI components, including encryption, access control, auditing, and safe model operation
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
Read More
Arrow Right