This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Build and run Kubernetes and HPC platforms at national scale. Deliver secure, reliable and automated compute environments. Grow your skills across on‑prem and cloud at CSIRO.
Job Responsibility:
Design, deploy, and manage run: ai and AI development tools and environments on GPU clusters
Design, deploy, and manage K8s across various environments (on-premises, cloud, hybrid)
Implement and maintain K8s best practices to ensure efficient and reliable cluster operations
Develop and maintain automation scripts and tools for provisioning, configuration, and management of run: ai and K8s environments
Leverage Infrastructure as Code (IaC) tools such as Helm, Ansible or Terraform
Implement monitoring and logging solutions to ensure the health and performance of GPU clusters
Troubleshoot and resolve issues related to cluster operations, application deployments, and performance bottlenecks
Ensure that environments adhere to security best practices and compliance requirements
Implement and manage security controls such as role-based access control (RBAC), network policies, and image scanning
Work closely with DevOps, development teams, research users and other stakeholders to understand requirements, optimise workflows, and support scientific applications and workflows
Provide guidance and support for containerisation, K8s, and run: ai -related issues
Requirements:
Relevant Bachelor’s degree or equivalent relevant work experience in Information Technology, Computer Science, Mathematics, Physics or Engineering
Knowledge of containerisation technologies (Docker, containers) and microservices architecture
Knowledge of run: ai and AI development tools and environments
Proficiency in scripting and automation using tools such as Bash, Python, or Go
Familiarity with Infrastructure as Code (IaC) tools like Helm, Ansible or Terraform
Experience in Linux system administration
Understanding of networking concepts, security practices, and CI/CD pipelines
Strong problem-solving, analytical and communication skills
Demonstrated ability to work with independence and self-motivation within a distributed team environment
Nice to have:
Kubernetes (CKA or CKAD), or NVIDIA Certification, or equivalent
Experience with public cloud platforms (AWS, Azure, GCP) and associated services related to K8s and ML