CrawlJobs Logo

Technical Program Manager- AI Cluster Validation

amd.com Logo

AMD

Location Icon

Location:
United States , Austin

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

162640.00 - 243960.00 USD / Year

Job Description:

We are seeking a Technical Program Manager to lead execution of AI cluster engineering programs with deep focus on GPU platforms, rack-level solutions, and AI Cluster validation. This role is responsible for driving end-to-end delivery from GPU + server integration through rack bring-up, scale testing, failure analysis, and system debug closure, ensuring platform readiness for hyperscale and enterprise AI deployments. This role operates at the intersection of hardware, firmware, networking, and scale-test execution, and requires strong technical depth combined with disciplined program execution.

Job Responsibility:

  • Define, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring-up, and cluster-scale deployment readiness
  • Create and maintain core PM artifacts: schedules, dependency maps, resource forecasts, risk/issue logs, and program dashboards/status reports
  • Identify and drive mitigation plans for issues/risks, including cross-team escalations and corrective actions across multiple engineering areas
  • Drive regular execution reviews with engineering teams and provide concise, data-driven updates to senior leadership
  • Own program execution for GPU-based AI platforms, spanning system bring-up, qualification, scale readiness, and deployment validation across server, rack, and cluster levels
  • Drive alignment across GPU, CPU, firmware, BIOS/BMC, and system teams to ensure readiness for scale testing and customer workloads
  • Track platform issues, and debug dependencies
  • ensure risks are clearly documented, owned, and mitigated
  • Own program planning and execution for multi-node and multi-rack scale testing, including test strategy, scheduling, coverage tracking, and readiness gates
  • Lead end-to-end delivery of rack-level AI solutions, including compute trays, switch trays, cabling, power, cooling, and management infrastructure
  • Ensure rack bring-up plans are executable, resourced, and gated with clear entry/exit criteria across EVT, DVT, and scale phases
  • Drive coordination across lab operations, infrastructure, and engineering teams to unblock rack access, power, networking, and test readiness
  • Partner with scale, performance, and automation teams to ensure workloads, stress tests, and regressions plans are ready before hardware arrives
  • Act as the execution lead for platform debug, coordinating across engineering teams to ensure fast triage, root-cause analysis, and resolution of system-level issues
  • Track high-impact failures (GPU, HSIO, FW, rack, network) through debug forums ensuring clear ownership and closure plans
  • Balance debug depth vs. program timelines, escalating tradeoffs when needed and ensuring leadership has a clear view of risk and impact

Requirements:

  • Experience leading complex hardware or AI infrastructure programs with ownership across bring-up, validation, and deployment phases
  • Strong technical understanding of GPU-based AI systems, rack architectures, and datacenter infrastructure
  • Proven ability to manage ambiguity, drive debug execution, and lead cross-functional teams without direct authority
  • Strong written and verbal communication skills, including executive-level status reporting
  • Proficiency with program management and execution tools (Jira, Confluence, dashboards, Excel/PowerPoint)
  • Bachelor's or master's degree in systems, EE, CS, or related engineering discipline
  • PMP, Scrum Master, or equivalent program management training

Nice to have:

  • Hands-on experience with GPU cluster scale testing, system stress, or performance validation
  • Familiarity with rack-level bring-up, power/cooling constraints, networking, and failure modes at scale
  • Experience working through hardware/firmware debug cycles in pre-production or customer-facing environments

Additional Information:

Job Posted:
May 04, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Technical Program Manager- AI Cluster Validation

Director, Technical Program Management — Global Cluster Engineering

AMD’s Global Cluster Engineering (GCE) team designs, validates, and deploys larg...
Location
Location
United States , Seattle, Washington or Austin, Texas
Salary
Salary:
224640.00 - 336960.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience in technical program management, engineering program leadership, infrastructure delivery, or adjacent roles
  • Proven track record delivering large-scale infrastructure programs (datacenter, cloud, AI/HPC clusters, platforms, or complex hardware/software systems)
  • Demonstrated experience partnering with supply chain organizations (procurement, sourcing, planning, manufacturing, logistics) and managing long-lead constraints and supplier dependencies
  • Strong program fundamentals: scope definition, critical path, integrated schedules, RAID management, executive communications, and stakeholder alignment in a matrix environment
  • Comfort with technical depth across compute platforms, networking/storage concepts, and operational tooling—enough to drive decisions and resolve ambiguity
  • Undergraduate degree is preferred
  • Applied Science Degree, PMP, and/or MBA are desired
Job Responsibility
Job Responsibility
  • Own a multi-year program portfolio for global cluster initiatives (new cluster builds, cluster validation and operational excellence), including critical milestones, dependencies, risk management, and executive reporting
  • Establish program governance (operating rhythms, QBRs, escalation paths, decision logs) across engineering, operations, finance, procurement, and suppliers
  • Lead end-to-end supply chain planning and execution for cluster infrastructure: server/GPU platforms, networking, storage, racks, power/cooling, spares, and long-lead components
  • Drive build readiness and NPI-style execution: BOM maturity, lead-time management, contract manufacturer alignment, and deployment sequencing
  • Partner with sourcing/procurement to optimize cost, availability, and resiliency across suppliers, balancing time-to-deploy with design and qualification constraints
  • Build and scale supply chain product automation for cluster delivery: forecasting, allocation, inventory visibility, exception management, and ETA/lead-time prediction
  • Own “product-like” delivery of internal platforms and tools (dashboards, APIs, workflow automation, digital-twin planning models) that improve supply chain decisions and reduce manual overhead
  • Define KPIs and data products for planning accuracy, schedule predictability, cost-to-serve, inventory health, and deployment velocity
  • Translate business and engineering objectives into executable program plans, including infrastructure requirements, capacity models, and deployment playbooks
  • Drive technical and operational trade-offs across performance, reliability, cost, availability, and schedule
Read More
Arrow Right

Senior Technical Program Manager

We are seeking a Senior Technical Program Manager (L64) to join the AI Delivery ...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree AND 4+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
  • 2+ years of experience managing cross-functional and/or cross-team projects
Job Responsibility
Job Responsibility
  • Own Regional GPU Feasibility Portfolio
  • Lead a regional portfolio of GPU feasibility assessment programs, ensuring consistent, predictable, and high‑quality execution across multiple parallel initiatives
  • Drive early feasibility validation across power, cooling, space, network, colo sequencing, and deployment constraints to improve execution signal quality and reduce late‑stage risk
  • Establish and maintain a standardized feasibility pre‑check framework that enables faster decision‑making and minimizes canceled or reworked execute signals
  • Accelerate GPU Cluster Design Readiness
  • Partner across organizations to accelerate cluster layout design creation and evolution for next‑generation GPU platforms
  • Create and maintain runbooks, templates, and governance mechanisms that serve as a single source of truth for cluster design workflows, change management, and handoffs
  • Drive clarity and predictability in change management, balancing speed with quality and downstream execution impact
  • Cross‑Team Orchestration & Governance
  • Act as a connective tissue across partner teams, aligning priorities, dependencies, and execution signals
  • Fulltime
Read More
Arrow Right

Principal AI Safety Engineer for Autonomous Vehicles Technical Lead

The AV Safety Strategy and Assessment team is seeking an AI Safety Technical Lea...
Location
Location
United States
Salary
Salary:
250600.00 - 384600.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Electrical Engineering, Mathematics, Physics, or a related field
  • or equivalent practical experience
  • 10+ years of experience in AI/ML, engineering or a related field
  • 5+ years in autonomous vehicles, robotics or related field
  • Experience in Machine Learning & AI: Extensive experience in building large-scale models with significant focus on E2E validation. Experience using Large Language Models (LLMs), Generative AI, RAG, Deep learning, Reinforcement Learning, Natural Language Processing (NLP), SVM, XGBoost, Random Forest, Decision Trees, Clustering
  • AI Standards and Evolving Regulations: Understanding of ISO/PAS 8800, NIST AI Risk Management Framework, EU AI Act (2024-2027), other applicable industry standards and best practices for autonomous vehicles, aerospace and/or robotics.
  • Programming & Frameworks : Python, R, Java, PySpark, PyTorch, TensorFlow, Scikit-learn, LangChain, SQL
  • Cloud & Big Data Platforms: ( Preferred Microsoft Azure - Data Lake, Machine Learning, Databricks)
  • Deployment & MLOps: MLflow, Model Monitoring & Versioning, Docker & Kubernetes, GitHub, Jira
  • Data Analysis & Visualization : Tableau, PowerBI, Pandas, NumPy
Job Responsibility
Job Responsibility
  • Lead the development of AI safety strategies for ADS and establish safety engineering guidance and sufficiency criteria.
  • Actively engage with partners and seek input, provide technical expertise to inform leadership decision-making, and take ownership of technical projects
  • Define GM’s strategy for AI safety standards, engage externally to influence evolving standards, and contribute to internal and external thought leadership that strengthens GM’s position in the autonomous vehicle ecosystem.
  • Support regulatory rulemaking and policy responses related to AI safety-critical systems.
  • Establish an assurance plan and process to evaluate AI-related safety case evidence and verify that sufficiency criteria are met.
  • Provide AI expertise and safety guidance across Global Product Safety, Systems, and Certification activities.
  • Identify and drive opportunities to improve the efficiency and quality of safety work through the application of AI methodologies.
  • Mentor and develop team members, fostering a culture of technical excellence and continuous learning.
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Principal Consultant A2 - Infra

Microsoft Industry Solution - Global Center Innovation and Delivery Center (GCID...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related field AND 3+ years leadership experience in relevant area of business. Higher Education Preferred
  • OR master’s degree in computer science, Information Technology, Engineering, or related field AND 6+ years’ experience in technology solutions, practice development, architecture, consulting, and/or Cloud Infrastructure domain
  • Highly proficient & solid Customer facing Project experience involving solution design, project envisioning, planning, development, and deployment of complex solutions with minimum of 10 plus years
  • Must have a proven record of delivering technical solutions
  • 2+ years managing multiple projects or portfolios
  • 1+ year(s) experience leading blended, multidisciplinary teams
  • Preferred Qualifications: Overall minimum 20+ Year of industry experience
  • Technical or Professional Certification in Cloud Infrastructure domain
  • Open to travel domestically and internationally and work with different cultures and customers
  • Technical certifications based on domain/service line (e.g., Azure, Security, Dynamics)
Job Responsibility
Job Responsibility
  • AI-First Delivery Leadership: Embed AI-first principles into delivery workflows, leveraging automation and intelligent orchestration where applicable
  • Lead end-to-end delivery of complex projects, ensuring solutions are scalable, robust, and aligned with client business outcomes
  • Drive engineering excellence through reusable components, accelerators, and scalable architecture
  • Oversee technical execution across multiple projects, ensuring adherence to best practices, quality standards, and compliance requirements
  • Collaborate with clients and internal stakeholders to define strategies, delivery plans, milestones, and risk mitigation approaches
  • Act as a technical point of contact for clients, translating business requirements into scalable technical solutions
  • Ensure delivery models are optimized for modern, AI-native execution, including integration of automation and intelligent processes
  • Ability to step into at risk projects, quickly assess issues, and establish a credible path to recovery or exit
  • Engineering Excellence: Champion high-quality engineering practices across all delivery engagements
  • Ensure adherence to coding standards, architectural integrity, and performance benchmarks
  • Fulltime
Read More
Arrow Right

Network Engineer, Engineering R&D Environments

Meta's Network and Infrastructure Services (NIS) team is actively looking for a ...
Location
Location
United States , Menlo Park
Salary
Salary:
135000.00 - 191000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of experience designing, deploying, and operating network infrastructure in production or lab environments
  • Demonstrated expertise in IPv4/IPv6, L2/L3 protocols, including STP, OSPF, BGP, TCP/IP, DHCP, DNS, VLANs, VRRP, LACP, MC-LAG, ACLs, MACSEC and EVPN/VXLAN
  • Hands-on experience with backend cluster networking, including scale-out fabrics, RDMA networks, and congestion management
  • Experience working in multi-vendor environments, including Arista, FBOSS-based platforms, and lab networking hardware
  • Experience gathering requirements and developing network architectures for engineering lab build-outs
  • Experience with config management, code repositories, and ZTP for network infrastructure
  • Working knowledge of scripting or programming languages (e.g., Python, shell) for automation and tooling
  • Demonstrated experience to operate consistently working under your own initiative, seeking feedback and input where appropriate in a global, time-critical environment, managing multiple priorities and mission-critical timelines
Job Responsibility
Job Responsibility
  • Own end-to-end Frontend (FE) and Backend (BE) network design, deployment, and operations for AI and compute lab clusters
  • Serve as a primary networking point of contact for backend fabrics, including Arista- and FBOSS-based scale-out networks supporting AI workloads
  • Design, deploy, and support high-throughput, low-latency cluster networking, including congestion management (PFC/ECN), RDMA validation, and lossless transport
  • Perform hands-on troubleshooting and root-cause analysis across L1–L4 using packet captures, telemetry, and vendor tools to resolve complex lab issues
  • Support silicon, hardware, and software bring-ups, ensuring reliable connectivity and on-time validation
  • Lead and execute lab network lifecycle activities, including upgrades, migrations, capacity expansions, and decommissioning across regions
  • Develop and maintain network automation, configuration templates, and zero-touch provisioning (ZTP) workflows
  • Create and maintain MOPs, runbooks, and readiness checklists for internal teams and vendor execution
  • Provide direct consultation and training to cross-functional partners, enabling teams to operate and troubleshoot lab networks
  • End-to-end ownership of projects from requirements definition through customer handoff
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Data Scientist

Inetum is a European leader in digital services. Inetum’s team of 28,000 consult...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Data Science, Machine Learning, or a related field
  • 3+ years of hands-on experience in machine learning and AI development
  • Strong expertise in Generative AI (model training, fine-tuning, and deployment)
  • Experience with open-source and commercial LLMs
  • Proficiency in implementing RAG techniques
  • Experience with vector databases
  • Advanced programming skills in Python with frameworks like TensorFlow, PyTorch, and Hugging Face
  • Deep understanding of NLP techniques and architectures
  • Practical experience with traditional AI/ML methods such as decision trees, SVMs, clustering and neural networks
  • Familiarity with cloud platforms (AWS, GCP, or Azure) for AI/ML deployments
Job Responsibility
Job Responsibility
  • Create machine learning models and tailored AI solutions
  • Fine-tune and extend Large Language Models (LLMs) including open-source and commercial models
  • Implement Retrieval-Augmented Generation (RAG) techniques for enhanced model efficiency
  • Perform exploratory data analysis and feature engineering to unlock better model performance
  • Develop Generative AI models for content creation, process automation, and innovation
  • Utilize decision trees, SVMs, neural networks, and more to solve diverse problems
  • Work alongside data engineers, product managers, and stakeholders
  • Ensure high-quality AI solutions through rigorous testing and validation
  • Maintain technical documentation for knowledge sharing and reproducibility.
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right