CrawlJobs Logo

Senior Compute Cluster Administrator

amd.com Logo

AMD

Location Icon

Location:
United States , Austin

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

109760.00 - 164640.00 USD / Year

Job Description:

We are looking for a Senior Compute Cluster Administrator responsible for operating and supporting compute clusters used in upcoming datacenter buildouts leveraging AMD Instinct products. This role owns Day Two and beyond operations, encompassing both proactive maintenance and reactive support across complex, highly technical environments. This is an operational role supporting a demanding user base of AI server hardware, software, and firmware developers. You will manage a mix of R&D lab and production lab environments, each with distinct release cycles, stability requirements, and operational expectations. The role requires close collaboration with IT, Infosec, infrastructure automation teams, and deeply technical end users to ensure service quality, delivery commitments, and governance standards are consistently met.

Job Responsibility:

  • Work directly with tenants and stakeholders to maximize service quality, utilization, and availability of managed compute clusters
  • Collaborate with highly technical users working deep within AMD’s Instinct platform (e.g., ROCm) to troubleshoot misconfigurations impacting HPC performance
  • Lead the resolution of complex issues during new deployments and ongoing operations
  • Partner with hardware vendors on technical escalations involving third‑party OEM platforms and coordinate maintenance cycles aligned with upstream releases
  • Support multiple Linux distributions across Red Hat and Ubuntu/Debian families
  • Act as a subject matter expert in one or more cluster scheduling technologies such as Slurm, LSF, Sun Grid Engine, OpenLava, or Kubernetes
  • Compare configurations and behaviors across heterogeneous clusters within AMD’s compute estate
  • Engage with emerging technologies where formal documentation may be limited, including white‑box platforms and pre‑beta hardware
  • Maintain and evolve compute images using automated CI/CD pipelines, or deploy software manually where automation is not available
  • Monitor cluster health, performance, and availability using standard tooling such as Grafana, Prometheus, and Zabbix
  • Work collaboratively with team members to reproduce and resolve difficult or intermittent issues
  • Train and enable on‑site L1 support teams
  • Participate in on‑call incident response as L2 support when required

Requirements:

  • Hands‑on experience administering or supporting HPC clusters in production, research, or academic environments
  • Practical experience working as an HPC user combined with Linux system administration in enterprise or lab environments
  • Background in software development combined with deep Linux systems exposure in server or infrastructure contexts
  • Demonstrated intermediate to advanced Linux expertise
  • Strong understanding of networking fundamentals, including the OSI model, multi‑homed systems, firewall troubleshooting, and high‑speed interconnects
  • Willingness to experiment with open‑source and emerging technologies
  • Experience supporting infrastructure services such as DNS, DHCP, BOOTP, PXE, TFTP, NTP, and PAM
  • Understanding of interprocess communication and familiarity with MPI implementations such as OpenMPI or MPICH
  • Proficiency with Linux troubleshooting tools such as nmap, gdb, lsof, sar, and server management interfaces including IPMI, iDRAC, and iLO
  • Working knowledge of virtualization, VLANs, and directory services
  • Strong written communication skills with the ability to produce clear technical documentation
  • Experience developing automation using Python and/or Ansible
  • Familiarity with version control systems such as Git
  • Self‑directed, analytical, dependable, and comfortable working both independently and in a team‑based environment
  • Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or a related technical discipline

Nice to have:

  • Experience with RDMA
  • familiarity with PCIe, I2C, compiler optimization, or other low‑level system components is beneficial

Additional Information:

Job Posted:
March 25, 2026

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Compute Cluster Administrator

Senior Database Administrator

Senior Database Administrator role focusing on PostgreSQL and ClickHouse databas...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
  • 5-8 years of experience in database administration with strong PostgreSQL and ClickHouse expertise
  • Deep understanding of PostgreSQL architecture, configuration, replication, and tuning
  • Experience with ClickHouse configuration and optimization for large-scale analytics
  • Strong SQL and query optimization capabilities
  • Familiarity with backup/recovery tools such as pgBackRest, Barman, and ClickHouse utilities
  • Proficiency in Linux environments and shell scripting
  • Exposure to cloud-hosted database services like AWS RDS/Aurora, Azure Database, or GCP
  • Strong analytical thinking and problem-solving ability
  • Clear communication and effective cross-team collaboration
Job Responsibility
Job Responsibility
  • Install, configure, and maintain PostgreSQL and ClickHouse across development, staging, and production environments
  • Monitor database health, performance, and resource usage
  • Manage schemas, indexes, roles, and permissions for performance and security
  • Analyze and tune queries, indexing strategies, and configurations for low-latency, high-throughput workloads
  • Optimize ClickHouse for analytical and OLAP workloads on large datasets
  • Implement backup strategies and disaster recovery solutions
  • Configure and manage replication, clustering, and failover setups
  • Apply database security best practices including encryption, access controls, and audit logging
  • Ensure compliance with industry standards and regulations
  • Investigate and resolve database-related incidents and performance issues
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Career growth opportunities
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Senior HPC Administrator Technology Consultant

Provide technology consulting to external customers and internal project teams. ...
Location
Location
Saudi Arabia , Riyadh
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience
  • Bachelor of Arts/Science or equivalent degree in computer science or related area of study
  • without a degree, 11+ years of relevant professional experience
  • technical background and knowledge of industry trends
  • experience in HPC services including hardware and software for massively parallel (MPP) supercomputer systems, clusters and storage systems
  • ability to work on a 24 X 7 basis and be on standby when needed
  • willingness to learn new technologies in HPC including Cray EX Liquid cooled systems and Shasta
  • ability to manage customer relationship and communication
  • ability to analyze, qualify, troubleshoot, and resolve incidents
  • ability to collaborate with team members, other internal organizations, customers, and third parties
Job Responsibility
Job Responsibility
  • Verify and implement the detailed technical design solution
  • Provide a detailed technical design for enterprise solutions
  • Analyze and develop enterprise technology solutions
  • Lead in the technical assessment and delivery of specific technical solutions to the customer
  • Provide a team structure conducive to high performance, and manage the team lifecycle stages
  • Coordinate implementation of new installations, designs, and migrations for technology solutions
  • Provide advanced technical consulting and advice to others on proposal efforts, solution design, system management, tuning and modification of solutions
  • Provide input to the company strategy moving forward
  • Collect and determine data from appropriate sources to assist in determining customer needs and requirements
  • Respond to requests for technical information from customers
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Compute Engineer

This role involves providing advanced technical expertise in Compute infrastruct...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in HPE Compute platforms (C7000, Synergy, Virtual Connect, ProLiant)
  • Advanced Linux administration (RHEL, SUSE) including kernel tuning, system hardening, and troubleshooting
  • Strong virtualization experience in VMware (vSphere, SRM, Horizon), KVM, and Hyper-V
  • Proficient in VMware infrastructure management: VM lifecycle operations, cluster management, performance monitoring, capacity planning, patching, backup/restore, and snapshot handling
  • Skilled in analyzing logs (VM-support, HPSreport, SOSreport) and performing root cause analysis
  • Solid understanding of storage technologies (SAN/NAS/DAS) and protocols (FC, iSCSI, FCoE)
  • Experience with Red Hat Satellite, SUSE Manager, and patch lifecycle management
  • Expertise in HA/DR solutions using Serviceguard, Pacemaker, and Linux clustering
  • Familiarity with networking fundamentals (VLANs, MTU, flow control) and troubleshooting
  • Strong scripting and automation skills using Bash, Python, and Ansible
Job Responsibility
Job Responsibility
  • Acts as a senior technical expert in Compute infrastructure, VMware virtualization, and Linux-based operating systems, providing advanced support and strategic guidance
  • Leads complex troubleshooting, root cause analysis, and performance tuning across enterprise environments
  • Provides architectural input and contributes to the design and implementation of infrastructure solutions
  • Supports transition and transformation initiatives, including migrations, upgrades, and automation efforts
  • Ensures compliance with ITIL processes and industry best practices
  • Acts as a technical liaison between internal teams, customers, and third-party vendors
  • Mentors junior engineers and contributes to knowledge sharing and process improvement
  • Lead resolution of critical incidents and escalations, ensuring minimal business impact
  • Perform in-depth analysis of system logs, kernel dumps, and performance metrics
  • Design and implement automation for routine tasks using Ansible, Shell, Python, etc.
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Compute, VMware, and Linux Engineer

Compute, VMware, and Linux Engineer role focusing on technical expertise in ente...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in HPE Compute platforms (C7000, Synergy, Virtual Connect, ProLiant)
  • Advanced Linux administration (RHEL, SUSE) including kernel tuning, system hardening, and troubleshooting
  • Strong virtualization experience in VMware (vSphere, SRM, Horizon), KVM, and Hyper-V
  • Proficient in VMware infrastructure management: VM lifecycle operations, cluster management, performance monitoring, capacity planning, patching, backup/restore, and snapshot handling
  • Skilled in analyzing logs (VM-support, HPSreport, SOSreport) and performing root cause analysis
  • Solid understanding of storage technologies (SAN/NAS/DAS) and protocols (FC, iSCSI, FCoE)
  • Experience with Red Hat Satellite, SUSE Manager, and patch lifecycle management
  • Expertise in HA/DR solutions using Serviceguard, Pacemaker, and Linux clustering
  • Familiarity with networking fundamentals (VLANs, MTU, flow control) and troubleshooting
  • Strong scripting and automation skills using Bash, Python, and Ansible
Job Responsibility
Job Responsibility
  • Acts as a senior technical expert in Compute infrastructure, VMware virtualization, and Linux-based operating systems, providing advanced support and strategic guidance
  • Leads complex troubleshooting, root cause analysis, and performance tuning across enterprise environments
  • Provides architectural input and contributes to the design and implementation of infrastructure solutions
  • Supports transition and transformation initiatives, including migrations, upgrades, and automation efforts
  • Ensures compliance with ITIL processes and industry best practices
  • Acts as a technical liaison between internal teams, customers, and third-party vendors
  • Mentors junior engineers and contributes to knowledge sharing and process improvement
  • Lead resolution of critical incidents and escalations, ensuring minimal business impact
  • Perform in-depth analysis of system logs, kernel dumps, and performance metrics
  • Design and implement automation for routine tasks using Ansible, Shell, Python, etc.
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
India
Salary
Salary:
Not provided
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work for the respective hiring region
Read More
Arrow Right

Compute Linux Virtualization Engineer

This role has been designed as ‘Hybrid’ with an expectation that you will work o...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in HPE Compute platforms (C7000, Synergy, Virtual Connect, ProLiant)
  • Advanced Linux administration (RHEL, SUSE) including kernel tuning, system hardening, and troubleshooting
  • Strong virtualization experience in VMware (vSphere, SRM, Horizon), KVM, and Hyper-V
  • Proficient in VMware infrastructure management: VM lifecycle operations, cluster management, performance monitoring, capacity planning, patching, backup/restore, and snapshot handling
  • Skilled in analyzing logs (VM-support, HPSreport, SOSreport) and performing root cause analysis
  • Solid understanding of storage technologies (SAN/NAS/DAS) and protocols (FC, iSCSI, FCoE)
  • Experience with Red Hat Satellite, SUSE Manager, and patch lifecycle management
  • Expertise in HA/DR solutions using Serviceguard, Pacemaker, and Linux clustering
  • Familiarity with networking fundamentals (VLANs, MTU, flow control) and troubleshooting
  • Strong scripting and automation skills using Bash, Python, and Ansible
Job Responsibility
Job Responsibility
  • Acts as a senior technical expert in Compute infrastructure, VMware virtualization, and Linux-based operating systems, providing advanced support and strategic guidance
  • Leads complex troubleshooting, root cause analysis, and performance tuning across enterprise environments
  • Provides architectural input and contributes to the design and implementation of infrastructure solutions
  • Supports transition and transformation initiatives, including migrations, upgrades, and automation efforts
  • Ensures compliance with ITIL processes and industry best practices
  • Acts as a technical liaison between internal teams, customers, and third-party vendors
  • Mentors junior engineers and contributes to knowledge sharing and process improvement
  • Resolve customer’s issues via the telephone, email or remote sessions
  • Reproduce issues in-house and responding back in a timely manner
  • Regular follow ups with customers with recommendations, updates and action plans
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior Systems Administrator

The role of the System Administrator includes supporting the implementation, tro...
Location
Location
United States , Laurel
Salary
Salary:
Not provided
wrench.io Logo
Wrench Technology
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fourteen (10) years of experience of professional experience as a SA
  • Bachelor’s degree in Computer Science or related discipline from an accredited college or university is required
  • Five (5) years of additional SA experience may be substituted for a bachelor’s degree
  • Provide expert in troubleshooting IT systems
  • Provide thorough analysis and feedback to management and internal customers regarding escalated tickets
  • Extend support for dispatch system and hardware issues, remaining actively engaged in the resolution process
  • Handle configuration and management of UNIX and Windows (or other relevant) operating systems, including installation/loading of software, troubleshooting, maintaining integrity, configuring network components, and implementing enhancements to improve reliability and performance
  • NetApp experience required
  • Able to write the following scripting languages: Python, Ruby and Perl
Job Responsibility
Job Responsibility
  • Supporting the implementation, troubleshooting, and upkeep of Information Technology (IT) systems
  • Overseeing the IT system infrastructure and associated processes
  • Providing assistance for day-to-day operations, monitoring, and resolving issues related to client/server/storage/network devices, as well as mobile devices
  • Diagnosing and resolving problems
  • Configuring, and managing UNIX and Windows operating systems
  • Installing, and maintaining operating system software
  • Ensuring integrity, and configuring network components
  • Implementing enhancements to operating systems to enhance reliability and performance
  • Provides assistance with the installation, configuration, optimization, and administration of extensive Hadoop (Apache Accumulo) clusters dedicated to data-intensive computing tasks
Read More
Arrow Right