CrawlJobs Logo

HPC AI & Kubernetes Platform Engineer

https://www.csiro.au/ Logo

CSIRO

Location Icon

Location:
Australia , Canberra

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

118102.00 - 127808.00 AUD / Year

Job Description:

Build and run Kubernetes and HPC platforms at national scale. Deliver secure, reliable and automated compute environments. Grow your skills across on‑prem and cloud at CSIRO.

Job Responsibility:

  • Design, deploy, and manage run: ai and AI development tools and environments on GPU clusters
  • Design, deploy, and manage K8s across various environments (on-premises, cloud, hybrid)
  • Implement and maintain K8s best practices to ensure efficient and reliable cluster operations
  • Develop and maintain automation scripts and tools for provisioning, configuration, and management of run: ai and K8s environments
  • Leverage Infrastructure as Code (IaC) tools such as Helm, Ansible or Terraform
  • Implement monitoring and logging solutions to ensure the health and performance of GPU clusters
  • Troubleshoot and resolve issues related to cluster operations, application deployments, and performance bottlenecks
  • Ensure that environments adhere to security best practices and compliance requirements
  • Implement and manage security controls such as role-based access control (RBAC), network policies, and image scanning
  • Work closely with DevOps, development teams, research users and other stakeholders to understand requirements, optimise workflows, and support scientific applications and workflows
  • Provide guidance and support for containerisation, K8s, and run: ai -related issues

Requirements:

  • Relevant Bachelor’s degree or equivalent relevant work experience in Information Technology, Computer Science, Mathematics, Physics or Engineering
  • Knowledge of containerisation technologies (Docker, containers) and microservices architecture
  • Knowledge of run: ai and AI development tools and environments
  • Proficiency in scripting and automation using tools such as Bash, Python, or Go
  • Familiarity with Infrastructure as Code (IaC) tools like Helm, Ansible or Terraform
  • Experience in Linux system administration
  • Understanding of networking concepts, security practices, and CI/CD pipelines
  • Strong problem-solving, analytical and communication skills
  • Demonstrated ability to work with independence and self-motivation within a distributed team environment

Nice to have:

  • Kubernetes (CKA or CKAD), or NVIDIA Certification, or equivalent
  • Experience with public cloud platforms (AWS, Azure, GCP) and associated services related to K8s and ML
What we offer:
  • 15.4% superannuation
  • flexible work arrangements
  • range of leave entitlements
  • career development opportunities
  • comprehensive training and development portfolio

Additional Information:

Job Posted:
May 15, 2026

Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for HPC AI & Kubernetes Platform Engineer

AI/ML Enterprise Solution Architect

As an AI/ML Enterprise Solution Architect – HPE and NVIDIA Alliance (APAC) for t...
Location
Location
Singapore , Central Singapore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in technical architecture, presales, or solution architecture roles within AI, HPC, or data infrastructure
  • Strong working knowledge of NVIDIA technologies (DGX, HGX, NVLink, CUDA, NGC, Ai tool with nim, nemo, Omniverse platform, etc.)
  • Experience architecting AI or HPC solutions in enterprise or cloud/hybrid environments
  • Comfortable engaging both technical and executive stakeholders, from CIOs to principal engineers
  • Familiarity with AI frameworks (TensorFlow, PyTorch), container orchestration (Kubernetes), and MLOps a strong plus
  • Ability to influence regional teams across multiple cultures and time zones
  • Exceptional communication and storytelling skills to translate technical value into business outcomes
  • Bachelor’s or master’s degree in computer science, Engineering, or related field
  • MBA a plus
Job Responsibility
Job Responsibility
  • Act as the technical lead for NVIDIA solutions (HGX, NVIDIA AI Enterprise, CUDA stack) within the HPE alliance GTM in APAC
  • Collaborate with regional sales teams to position HPE-NVIDIA joint solutions in key customer opportunities
  • Support lighthouse accounts and pilot programs by providing architecture guidance, proof-of-concept oversight, and solution differentiation
  • Partner with NVIDIA and HPE technical teams to localize and scale global offerings such as AI Factory, RAG blueprints, and PCAI
  • Deliver technical enablement across HPE and NVIDIA field sellers, presales engineers, and partner ecosystems
  • Serve as a trusted advisor to customers on emerging AI infrastructure needs, workload optimization, and deployment patterns
  • Drive technical alignment with NVIDIA’s solution architects and GPU-accelerated software stack teams
  • Provide feedback from the field to influence joint solution roadmaps, content development, and strategic investments
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Global Lead Architect – Hybrid Cloud, AI & HPE Platform Delivery

A highly senior, customer-facing architecture and delivery leadership role respo...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12–15+ years in enterprise IT, with strong focus on: Solution architecture and delivery leadership
  • Hybrid cloud, AI/HPC, and infrastructure platforms
  • Proven background in professional services / delivery-led roles, not purely presales
  • Demonstrated experience leading large-scale, multi-technology programs end-to-end
  • Strong consulting mindset with excellent stakeholder and executive communication skills
  • Deep expertise in enterprise private cloud platforms and hybrid architectures
  • Strong understanding of workload migration, interoperability, and governance
  • AI platform design (GPU-based infrastructure, NVIDIA ecosystem)
  • HPC cluster architecture, workload schedulers (Slurm, PBS Pro), and performance tuning
  • Kubernetes ecosystems (OpenShift, Rancher, CNCF stack)
Job Responsibility
Job Responsibility
  • Serve as the technical validation authority during early sales cycles
  • Lead technical governance from opportunity qualification through delivery execution
  • Own solution integrity across the lifecycle—design, validation, implementation, and optimization
  • Architect and oversee end-to-end hybrid and private cloud solutions
  • Drive adoption of cloud-native, automated, and scalable architectures
  • Lead delivery teams across complex engagements
  • Act as the lead design authority ensuring delivery success for AI infrastructure and HPC deployments, Containerized platforms and cloud-native environments, Enterprise hybrid cloud transformations
  • Provide hands-on guidance during critical phases (design reviews, PoCs, escalations)
  • Lead technical due diligence during RFP/RFI responses, Solution workshops and discovery sessions, Proof-of-concept engagements
  • Translate business requirements into deliverable, production-ready architectures
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior AI Presales Consultant

We are seeking a high-impact, strategic AI Presales Consultant to join our elite...
Location
Location
India , Mumbai
Salary
Salary:
Not provided
eviden.com Logo
Eviden
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years in a customer-facing technical role (e.g., Presales, Solutions Architecture, AI Specialist, or Technical Consulting), with a proven track record of designing large-scale AI, ML, or HPC solutions
  • Deep, hands-on understanding of LLM architectures. Must be able to architect, explain, and build PoCs for RAG pipelines, including vector databases (e.g., Milvus, Pinecone, Chroma), embedding models, and data ingestion strategies
  • Direct experience in sizing AI infrastructure. Must be able to perform "napkin math" and detailed calculations for GPU, CPU, memory, and network requirements
  • Must be able to fluently discuss performance metrics (tokens/second, latency, throughput, TFLOPS) and their relationship to hardware choice (e.g., NVIDIA H100 vs. A100, memory bandwidth, interconnects like NVLink/InfiniBand)
  • Expertise in the AI software stack. Strong understanding of MLOps principles (Kubeflow, MLflow), Kubernetes (K8s) for AI workloads, and model serving platforms (NVIDIA Triton, KServe, or similar)
  • Strong, current knowledge of the AI model landscape (e.g., Llama family, Mistral, GPT-family, foundation models). Ability to discuss fine-tuning techniques, quantization, and pruning
  • Exceptional communication, whiteboarding, and presentation skills. Ability to translate executive-level business needs into detailed technical architecture and build a compelling C-level value proposition
  • Bachelor's or Master's degree in Computer Science, AI, Data Science, or a related engineering field
Job Responsibility
Job Responsibility
  • Strategic Client Advisory: Lead executive-level "Art of the Possible" workshops and technical discovery sessions to understand a client's business goals, data readiness, and AI maturity
  • Full-Stack Solution Architecture: Design holistic, end-to-end AI solutions that synergize our supercomputing hardware, AI software platform, and MLOps capabilities to meet specific client needs
  • Generative AI & LLM Expertise: Act as the subject matter expert on Generative AI. Architect and evangelize scalable data ingestion and preparation pipelines, specializing in Retrieval-Augmented Generation (RAG) frameworks
  • Infrastructure Sizing & Performance Modelling: Analyse customer workloads (data volume, model complexity, training frequency, inference throughput) to accurately size the required platform infrastructure, including Kubernetes clusters, data storage, and software licenses. This includes calculating compute, storage, and network requirements based on key performance metrics like model parameters, token performance (tokens/sec), desired latency, and concurrent user load
  • Model & Software Consultation: Advise clients on AI model selection, comparing the trade-offs of open-source vs. proprietary LLMs, fine-tuning vs. foundation models, and model quantization
  • Position and demonstrate our proprietary AI software platform, MLOps tools, and libraries, integrating them into the client's ecosystem
  • Inference Optimization: Design and architect robust, low-latency, and high-throughput inference solutions for complex AI models, including large-scale LLM serving
  • User Experience (UX) Advocacy: Collaborate with client teams to define the end-user experience, ensuring the solution delivers tangible business value and a seamless interface for data scientists, analysts, and application users
  • Sales Cycle Enablement: Own the technical narrative throughout the sales cycle. Build and deliver compelling presentations, custom demonstrations, and Proofs of Concept (PoCs). Lead the technical response to complex RFIs/RFPs
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Kubernetes Platform Engineer

Kubernetes Platform Engineer. This role has been designed as ‘Hybrid’ with an ex...
Location
Location
United States , Bloomington
Salary
Salary:
111500.00 - 211500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud Architectures
  • Cross Domain Knowledge
  • Design Thinking
  • Development Fundamentals
  • DevOps
  • Distributed Computing
  • Microservices Fluency
  • Full Stack Development
  • Security-First Mindset
  • Solutions Design
Job Responsibility
Job Responsibility
  • Lead Kubernetes‑native, RDMA‑class networking for distributed AI inference platforms on HPC clusters
  • Own the end‑to‑end technical design that allows Kubernetes‑orchestrated inference workloads (NVIDIA NIMs, vLLM, TensorRT‑LLM) to transparently consume high‑speed fabrics (e.g., HPE Slingshot/CXI) using Operators, DRA, CDI, Multus/secondary CNI, and Kubernetes networking abstractions—without container rebuilds, privileged pods, or manual tuning
  • Make HPC fabric capabilities consumable from standard containers
  • Design the mechanisms to expose RDMA‑capable NIC resources and required runtime components without baking the fabric into images, including mounting/injecting host user‑space libraries (e.g., libcxi + libfabric) in a controlled, supportable way
  • Define the reference design and implement for Kubernetes‑native RDMA enablement across Dynamic Resource Allocation (DRA), Container Device Interface (CDI), Multus + secondary CNIs, and Operator‑driven lifecycle management
  • Own API and CRD design (ResourceClaims, DeviceClasses, custom CRDs) with long‑term compatibility guarantees
  • Make and defend architectural tradeoffs between Device plugins vs DRA, CDI vs runtime hooks vs admission webhooks, Shared vs exclusive NIC models, and Performance vs operability vs isolation
  • Define how distributed inference patterns (KV‑cache movement, prefill/decode separation) map onto Kubernetes primitives
  • Ensure out-of-the-box compatibility with NVIDIA NIMs and the NIM Operator, KServe ServingRuntime / InferenceService, and GPU Operator (CDI mode)
  • Publish deployment patterns and validated manifests for inference workloads using RDMA fast paths
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior Manager, AI Infrastructure and Operations

The Sr. Manager/Staff Engineer, AI Infrastructure & MLOps Engineering is a senio...
Location
Location
Japan , Tokyo
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of hands-on software engineering experience in cloud infrastructure, DevOps, and MLOps
  • Deep expertise in Python, Kubernetes, Terraform, Helm, and CI/CD pipeline development
  • Proven experience architecting and operating containerized solutions on AWS, GCP, and Azure
  • Strong knowledge of Infrastructure-as-Code, distributed systems, and production system reliability
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field
Job Responsibility
Job Responsibility
  • Design, implement, and own large-scale cloud-based HPC and MLOps platforms supporting AI model training, genomic sequencing, and precision medicine
  • Architect multi-environment clusters (AWS, GCP, Azure), enabling GPU/FPGA workloads and advanced observability
  • Lead the development of developer and cloud platforms, including internal engineering accelerators and reusable toolsets
  • Design, implement, and manage unified platform catalogs using Backstage, enhancing developer experience and application metadata management
  • Develop custom plugins and APIs for Backstage to support internal engineering workflows and documentation
  • Build and maintain Python-based automation frameworks, CI/CD pipelines, and Infrastructure-as-Code (Terraform, Helm, Pulumi, AWS CDK)
  • Operationalize containerized solutions using Docker and Kubernetes, integrating MLflow, Kubeflow, and other orchestration platforms
  • Implement robust automation for provisioning, configuring, and managing cloud resources across multiple environments
  • Lead the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and advanced observability (Prometheus, Grafana, PagerDuty)
  • Develop and maintain APIs and services for model management, feature stores, and inference pipelines
  • Fulltime
Read More
Arrow Right

AI Infra Engineer

We are looking for an AI Infra engineer to join our growing team. We work with K...
Location
Location
United States , San Francisco; Palo Alto
Salary
Salary:
210000.00 - 385000.00 USD / Year
perplexity.ai Logo
Perplexity
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • Experience with deploying and managing distributed training systems at scale
  • Deep understanding of container orchestration and distributed systems architecture
  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
  • Experience managing GPU clusters and optimizing compute resource utilization
  • Expert-level Kubernetes administration and YAML configuration management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
Job Responsibility
Job Responsibility
  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
What we offer
What we offer
  • Equity
  • Health
  • Dental
  • Vision
  • Retirement
  • Fitness
  • Commuter and dependent care accounts
  • Fulltime
Read More
Arrow Right

Pcai And Ai Factory Expert

We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s / Master’s Degree in Computer Science, IT, or equivalent field
  • 8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPUbased environments
  • Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments
  • Hands-on experience in automation and orchestration across bare metal and containerized infrastructure
Job Responsibility
Job Responsibility
  • Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance
  • Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks
  • Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Ezmeral Runtime Enterprise, Kubernetes, and Rancher Harvester
  • Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers
  • Proactively monitor system health using DCGM, NetQ, Grafana, and Exivity dashboards
  • Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers
  • Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues
  • Maintain operational documentation, runbooks, and incident logs
  • Manage cluster lifecycle through Ansible, AWX, HPE Performance Cluster Manager (HPCM), and SLURM
  • Oversee automation for provisioning, scaling, and patch management of Compute and Containerized workloads
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right