Technical Program Manager- AI Cluster Validation Job at AMD (Austin)

Technical Program Manager- AI Cluster Validation

AMD

Location:
United States , Austin

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

162640.00 - 243960.00 USD / Year

Save Job

Apply Position

Job Description:

We are seeking a Technical Program Manager to lead execution of AI cluster engineering programs with deep focus on GPU platforms, rack-level solutions, and AI Cluster validation. This role is responsible for driving end-to-end delivery from GPU + server integration through rack bring-up, scale testing, failure analysis, and system debug closure, ensuring platform readiness for hyperscale and enterprise AI deployments. This role operates at the intersection of hardware, firmware, networking, and scale-test execution, and requires strong technical depth combined with disciplined program execution.

Job Responsibility:

Define, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring-up, and cluster-scale deployment readiness
Create and maintain core PM artifacts: schedules, dependency maps, resource forecasts, risk/issue logs, and program dashboards/status reports
Identify and drive mitigation plans for issues/risks, including cross-team escalations and corrective actions across multiple engineering areas
Drive regular execution reviews with engineering teams and provide concise, data-driven updates to senior leadership
Own program execution for GPU-based AI platforms, spanning system bring-up, qualification, scale readiness, and deployment validation across server, rack, and cluster levels
Drive alignment across GPU, CPU, firmware, BIOS/BMC, and system teams to ensure readiness for scale testing and customer workloads
Track platform issues, and debug dependencies
ensure risks are clearly documented, owned, and mitigated
Own program planning and execution for multi-node and multi-rack scale testing, including test strategy, scheduling, coverage tracking, and readiness gates
Lead end-to-end delivery of rack-level AI solutions, including compute trays, switch trays, cabling, power, cooling, and management infrastructure
Ensure rack bring-up plans are executable, resourced, and gated with clear entry/exit criteria across EVT, DVT, and scale phases
Drive coordination across lab operations, infrastructure, and engineering teams to unblock rack access, power, networking, and test readiness
Partner with scale, performance, and automation teams to ensure workloads, stress tests, and regressions plans are ready before hardware arrives
Act as the execution lead for platform debug, coordinating across engineering teams to ensure fast triage, root-cause analysis, and resolution of system-level issues
Track high-impact failures (GPU, HSIO, FW, rack, network) through debug forums ensuring clear ownership and closure plans
Balance debug depth vs. program timelines, escalating tradeoffs when needed and ensuring leadership has a clear view of risk and impact

Requirements:

Experience leading complex hardware or AI infrastructure programs with ownership across bring-up, validation, and deployment phases
Strong technical understanding of GPU-based AI systems, rack architectures, and datacenter infrastructure
Proven ability to manage ambiguity, drive debug execution, and lead cross-functional teams without direct authority
Strong written and verbal communication skills, including executive-level status reporting
Proficiency with program management and execution tools (Jira, Confluence, dashboards, Excel/PowerPoint)
Bachelor's or master's degree in systems, EE, CS, or related engineering discipline
PMP, Scrum Master, or equivalent program management training

Nice to have:

Hands-on experience with GPU cluster scale testing, system stress, or performance validation
Familiarity with rack-level bring-up, power/cooling constraints, networking, and failure modes at scale
Experience working through hardware/firmware debug cycles in pre-production or customer-facing environments

Additional Information:

Job Posted:
May 04, 2026

Employment Type:

Fulltime

Work Type:

On-site work

AMD - All Job Offers

Job Link Share:

Technical Program Manager- AI Cluster Validation

AMD

Location:
United States , Austin

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
May 04, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Technical Program Manager- AI Cluster Validation

Director, Technical Program Management — Global Cluster Engineering

Senior Technical Program Manager

Principal AI Safety Engineer for Autonomous Vehicles Technical Lead

Principal Consultant A2 - Infra

Network Engineer, Engineering R&D Environments

Data Scientist

Staff Software Engineer - AI/ML Infra

Staff Software Engineer - AI/ML Platform

Technical Program Manager- AI Cluster Validation

AMD

Location:United States , Austin

Category:IT - Software Development

Contract Type:Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:May 04, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Technical Program Manager- AI Cluster Validation

Director, Technical Program Management — Global Cluster Engineering

Senior Technical Program Manager

Principal AI Safety Engineer for Autonomous Vehicles Technical Lead

Principal Consultant A2 - Infra

Network Engineer, Engineering R&D Environments

Data Scientist

Staff Software Engineer - AI/ML Infra

Staff Software Engineer - AI/ML Platform

Location:
United States , Austin

Category:
IT - Software Development

Contract Type:
Employment contract

Job Posted:
May 04, 2026