HPC AI & Kubernetes Platform Engineer Job at CSIRO (Canberra)

AI/ML Enterprise Solution Architect

As an AI/ML Enterprise Solution Architect – HPE and NVIDIA Alliance (APAC) for t...

Location

Singapore , Central Singapore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

8+ years in technical architecture, presales, or solution architecture roles within AI, HPC, or data infrastructure
Strong working knowledge of NVIDIA technologies (DGX, HGX, NVLink, CUDA, NGC, Ai tool with nim, nemo, Omniverse platform, etc.)
Experience architecting AI or HPC solutions in enterprise or cloud/hybrid environments
Comfortable engaging both technical and executive stakeholders, from CIOs to principal engineers
Familiarity with AI frameworks (TensorFlow, PyTorch), container orchestration (Kubernetes), and MLOps a strong plus
Ability to influence regional teams across multiple cultures and time zones
Exceptional communication and storytelling skills to translate technical value into business outcomes
Bachelor’s or master’s degree in computer science, Engineering, or related field
MBA a plus

Job Responsibility

Act as the technical lead for NVIDIA solutions (HGX, NVIDIA AI Enterprise, CUDA stack) within the HPE alliance GTM in APAC
Collaborate with regional sales teams to position HPE-NVIDIA joint solutions in key customer opportunities
Support lighthouse accounts and pilot programs by providing architecture guidance, proof-of-concept oversight, and solution differentiation
Partner with NVIDIA and HPE technical teams to localize and scale global offerings such as AI Factory, RAG blueprints, and PCAI
Deliver technical enablement across HPE and NVIDIA field sellers, presales engineers, and partner ecosystems
Serve as a trusted advisor to customers on emerging AI infrastructure needs, workload optimization, and deployment patterns
Drive technical alignment with NVIDIA’s solution architects and GPU-accelerated software stack teams
Provide feedback from the field to influence joint solution roadmaps, content development, and strategic investments

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Global Lead Architect – Hybrid Cloud, AI & HPE Platform Delivery

A highly senior, customer-facing architecture and delivery leadership role respo...

Location

Bulgaria , Sofia

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

12–15+ years in enterprise IT, with strong focus on: Solution architecture and delivery leadership
Hybrid cloud, AI/HPC, and infrastructure platforms
Proven background in professional services / delivery-led roles, not purely presales
Demonstrated experience leading large-scale, multi-technology programs end-to-end
Strong consulting mindset with excellent stakeholder and executive communication skills
Deep expertise in enterprise private cloud platforms and hybrid architectures
Strong understanding of workload migration, interoperability, and governance
AI platform design (GPU-based infrastructure, NVIDIA ecosystem)
HPC cluster architecture, workload schedulers (Slurm, PBS Pro), and performance tuning
Kubernetes ecosystems (OpenShift, Rancher, CNCF stack)

Job Responsibility

Serve as the technical validation authority during early sales cycles
Lead technical governance from opportunity qualification through delivery execution
Own solution integrity across the lifecycle—design, validation, implementation, and optimization
Architect and oversee end-to-end hybrid and private cloud solutions
Drive adoption of cloud-native, automated, and scalable architectures
Lead delivery teams across complex engagements
Act as the lead design authority ensuring delivery success for AI infrastructure and HPC deployments, Containerized platforms and cloud-native environments, Enterprise hybrid cloud transformations
Provide hands-on guidance during critical phases (design reviews, PoCs, escalations)
Lead technical due diligence during RFP/RFI responses, Solution workshops and discovery sessions, Proof-of-concept engagements
Translate business requirements into deliverable, production-ready architectures

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Senior AI Presales Consultant

We are seeking a high-impact, strategic AI Presales Consultant to join our elite...

Location

India , Mumbai

Salary:

Not provided

Eviden

Expiration Date

Until further notice

Requirements

7+ years in a customer-facing technical role (e.g., Presales, Solutions Architecture, AI Specialist, or Technical Consulting), with a proven track record of designing large-scale AI, ML, or HPC solutions
Deep, hands-on understanding of LLM architectures. Must be able to architect, explain, and build PoCs for RAG pipelines, including vector databases (e.g., Milvus, Pinecone, Chroma), embedding models, and data ingestion strategies
Direct experience in sizing AI infrastructure. Must be able to perform "napkin math" and detailed calculations for GPU, CPU, memory, and network requirements
Must be able to fluently discuss performance metrics (tokens/second, latency, throughput, TFLOPS) and their relationship to hardware choice (e.g., NVIDIA H100 vs. A100, memory bandwidth, interconnects like NVLink/InfiniBand)
Expertise in the AI software stack. Strong understanding of MLOps principles (Kubeflow, MLflow), Kubernetes (K8s) for AI workloads, and model serving platforms (NVIDIA Triton, KServe, or similar)
Strong, current knowledge of the AI model landscape (e.g., Llama family, Mistral, GPT-family, foundation models). Ability to discuss fine-tuning techniques, quantization, and pruning
Exceptional communication, whiteboarding, and presentation skills. Ability to translate executive-level business needs into detailed technical architecture and build a compelling C-level value proposition
Bachelor's or Master's degree in Computer Science, AI, Data Science, or a related engineering field

Job Responsibility

Strategic Client Advisory: Lead executive-level "Art of the Possible" workshops and technical discovery sessions to understand a client's business goals, data readiness, and AI maturity
Full-Stack Solution Architecture: Design holistic, end-to-end AI solutions that synergize our supercomputing hardware, AI software platform, and MLOps capabilities to meet specific client needs
Generative AI & LLM Expertise: Act as the subject matter expert on Generative AI. Architect and evangelize scalable data ingestion and preparation pipelines, specializing in Retrieval-Augmented Generation (RAG) frameworks
Infrastructure Sizing & Performance Modelling: Analyse customer workloads (data volume, model complexity, training frequency, inference throughput) to accurately size the required platform infrastructure, including Kubernetes clusters, data storage, and software licenses. This includes calculating compute, storage, and network requirements based on key performance metrics like model parameters, token performance (tokens/sec), desired latency, and concurrent user load
Model & Software Consultation: Advise clients on AI model selection, comparing the trade-offs of open-source vs. proprietary LLMs, fine-tuning vs. foundation models, and model quantization
Position and demonstrate our proprietary AI software platform, MLOps tools, and libraries, integrating them into the client's ecosystem
Inference Optimization: Design and architect robust, low-latency, and high-throughput inference solutions for complex AI models, including large-scale LLM serving
User Experience (UX) Advocacy: Collaborate with client teams to define the end-user experience, ensuring the solution delivers tangible business value and a seamless interface for data scientists, analysts, and application users
Sales Cycle Enablement: Own the technical narrative throughout the sales cycle. Build and deliver compelling presentations, custom demonstrations, and Proofs of Concept (PoCs). Lead the technical response to complex RFIs/RFPs

Fulltime

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
OR equivalent experience
Strong proficiency in Kubernetes, Docker, and container orchestration
Knowledge of CI/CD pipelines for Inference and ML model deployment
Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
Strong programming/scripting skills in Python, Go, or Bash
Solid knowledge of distributed systems, networking, and storage
Experience running large-scale GPU clusters for ML/AI workloads (preferred)

Job Responsibility

Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows

What we offer

Competitive compensation, equity options, and comprehensive benefits

Fulltime

Kubernetes Platform Engineer

Kubernetes Platform Engineer. This role has been designed as ‘Hybrid’ with an ex...

Location

United States , Bloomington

Salary:

111500.00 - 211500.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Cloud Architectures
Cross Domain Knowledge
Design Thinking
Development Fundamentals
DevOps
Distributed Computing
Microservices Fluency
Full Stack Development
Security-First Mindset
Solutions Design

Job Responsibility

Lead Kubernetes‑native, RDMA‑class networking for distributed AI inference platforms on HPC clusters
Own the end‑to‑end technical design that allows Kubernetes‑orchestrated inference workloads (NVIDIA NIMs, vLLM, TensorRT‑LLM) to transparently consume high‑speed fabrics (e.g., HPE Slingshot/CXI) using Operators, DRA, CDI, Multus/secondary CNI, and Kubernetes networking abstractions—without container rebuilds, privileged pods, or manual tuning
Make HPC fabric capabilities consumable from standard containers
Design the mechanisms to expose RDMA‑capable NIC resources and required runtime components without baking the fabric into images, including mounting/injecting host user‑space libraries (e.g., libcxi + libfabric) in a controlled, supportable way
Define the reference design and implement for Kubernetes‑native RDMA enablement across Dynamic Resource Allocation (DRA), Container Device Interface (CDI), Multus + secondary CNIs, and Operator‑driven lifecycle management
Own API and CRD design (ResourceClaims, DeviceClasses, custom CRDs) with long‑term compatibility guarantees
Make and defend architectural tradeoffs between Device plugins vs DRA, CDI vs runtime hooks vs admission webhooks, Shared vs exclusive NIC models, and Performance vs operability vs isolation
Define how distributed inference patterns (KV‑cache movement, prefill/decode separation) map onto Kubernetes primitives
Ensure out-of-the-box compatibility with NVIDIA NIMs and the NIM Operator, KServe ServingRuntime / InferenceService, and GPU Operator (CDI mode)
Publish deployment patterns and validated manifests for inference workloads using RDMA fast paths

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Senior Manager, AI Infrastructure and Operations

The Sr. Manager/Staff Engineer, AI Infrastructure & MLOps Engineering is a senio...

Location

Japan , Tokyo

Salary:

Not provided

Pfizer

Expiration Date

Until further notice

Requirements

8+ years of hands-on software engineering experience in cloud infrastructure, DevOps, and MLOps
Deep expertise in Python, Kubernetes, Terraform, Helm, and CI/CD pipeline development
Proven experience architecting and operating containerized solutions on AWS, GCP, and Azure
Strong knowledge of Infrastructure-as-Code, distributed systems, and production system reliability
Bachelor’s or Master’s degree in Computer Science, Engineering, or related field

Job Responsibility

Design, implement, and own large-scale cloud-based HPC and MLOps platforms supporting AI model training, genomic sequencing, and precision medicine
Architect multi-environment clusters (AWS, GCP, Azure), enabling GPU/FPGA workloads and advanced observability
Lead the development of developer and cloud platforms, including internal engineering accelerators and reusable toolsets
Design, implement, and manage unified platform catalogs using Backstage, enhancing developer experience and application metadata management
Develop custom plugins and APIs for Backstage to support internal engineering workflows and documentation
Build and maintain Python-based automation frameworks, CI/CD pipelines, and Infrastructure-as-Code (Terraform, Helm, Pulumi, AWS CDK)
Operationalize containerized solutions using Docker and Kubernetes, integrating MLflow, Kubeflow, and other orchestration platforms
Implement robust automation for provisioning, configuring, and managing cloud resources across multiple environments
Lead the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and advanced observability (Prometheus, Grafana, PagerDuty)
Develop and maintain APIs and services for model management, feature stores, and inference pipelines

Fulltime

AI Infra Engineer

We are looking for an AI Infra engineer to join our growing team. We work with K...

Location

United States , San Francisco; Palo Alto

Salary:

210000.00 - 385000.00 USD / Year

Perplexity

Expiration Date

Until further notice

Requirements

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
Experience with deploying and managing distributed training systems at scale
Deep understanding of container orchestration and distributed systems architecture
High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
Experience managing GPU clusters and optimizing compute resource utilization
Expert-level Kubernetes administration and YAML configuration management
Proficiency with Slurm job scheduling, resource management, and cluster configuration
Python and C++ programming with focus on systems and infrastructure automation
Hands-on experience with ML frameworks such as PyTorch in distributed training contexts

Job Responsibility

Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
Manage and optimize Slurm-based HPC environments for distributed training of large language models
Develop robust APIs and orchestration systems for both training pipelines and inference services
Implement resource scheduling and job management systems across heterogeneous compute environments
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

What we offer

Equity
Health
Dental
Vision
Retirement
Fitness
Commuter and dependent care accounts

Fulltime

Pcai And Ai Factory Expert

We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI...

Location

India , Bengaluru

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor’s / Master’s Degree in Computer Science, IT, or equivalent field
8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPUbased environments
Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments
Hands-on experience in automation and orchestration across bare metal and containerized infrastructure

Job Responsibility

Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance
Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks
Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Ezmeral Runtime Enterprise, Kubernetes, and Rancher Harvester
Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers
Proactively monitor system health using DCGM, NetQ, Grafana, and Exivity dashboards
Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers
Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues
Maintain operational documentation, runbooks, and incident logs
Manage cluster lifecycle through Ansible, AWX, HPE Performance Cluster Manager (HPCM), and SLURM
Oversee automation for provisioning, scaling, and patch management of Compute and Containerized workloads

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

HPC AI & Kubernetes Platform Engineer

CSIRO

Location:
Australia , Canberra ▼
Melbourne

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
May 15, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for HPC AI & Kubernetes Platform Engineer

AI/ML Enterprise Solution Architect

Global Lead Architect – Hybrid Cloud, AI & HPE Platform Delivery

Senior AI Presales Consultant

Member of Technical Staff, Site Reliability Engineer (HPC)

Kubernetes Platform Engineer

Senior Manager, AI Infrastructure and Operations

AI Infra Engineer

Pcai And Ai Factory Expert

Our AI answers in your language

HPC AI & Kubernetes Platform Engineer

CSIRO

Location:Australia , Canberra ▼Melbourne

Category:IT - Software Development

Contract Type:Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:May 15, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for HPC AI & Kubernetes Platform Engineer

AI/ML Enterprise Solution Architect

Global Lead Architect – Hybrid Cloud, AI & HPE Platform Delivery

Senior AI Presales Consultant

Member of Technical Staff, Site Reliability Engineer (HPC)

Kubernetes Platform Engineer

Senior Manager, AI Infrastructure and Operations

AI Infra Engineer

Pcai And Ai Factory Expert

Location:
Australia , Canberra ▼
Melbourne

Category:
IT - Software Development

Contract Type:
Employment contract

Job Posted:
May 15, 2026