CrawlJobs Logo

Technical Program Manager- AI Cluster Validation

United States, Austin Employment contract 162640.00 - 243960.00 USD / Year · Job Posted May 04, 2026
Apply Position
Job Link Share

Job Description

We are seeking a Technical Program Manager to lead execution of AI cluster engineering programs with deep focus on GPU platforms, rack-level solutions, and AI Cluster validation. This role is responsible for driving end-to-end delivery from GPU + server integration through rack bring-up, scale testing, failure analysis, and system debug closure, ensuring platform readiness for hyperscale and enterprise AI deployments. This role operates at the intersection of hardware, firmware, networking, and scale-test execution, and requires strong technical depth combined with disciplined program execution.

Job Responsibility

  • Define, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring-up, and cluster-scale deployment readiness
  • Create and maintain core PM artifacts: schedules, dependency maps, resource forecasts, risk/issue logs, and program dashboards/status reports
  • Identify and drive mitigation plans for issues/risks, including cross-team escalations and corrective actions across multiple engineering areas
  • Drive regular execution reviews with engineering teams and provide concise, data-driven updates to senior leadership
  • Own program execution for GPU-based AI platforms, spanning system bring-up, qualification, scale readiness, and deployment validation across server, rack, and cluster levels
  • Drive alignment across GPU, CPU, firmware, BIOS/BMC, and system teams to ensure readiness for scale testing and customer workloads
  • Track platform issues, and debug dependencies
  • ensure risks are clearly documented, owned, and mitigated
  • Own program planning and execution for multi-node and multi-rack scale testing, including test strategy, scheduling, coverage tracking, and readiness gates
  • Lead end-to-end delivery of rack-level AI solutions, including compute trays, switch trays, cabling, power, cooling, and management infrastructure
  • Ensure rack bring-up plans are executable, resourced, and gated with clear entry/exit criteria across EVT, DVT, and scale phases
  • Drive coordination across lab operations, infrastructure, and engineering teams to unblock rack access, power, networking, and test readiness
  • Partner with scale, performance, and automation teams to ensure workloads, stress tests, and regressions plans are ready before hardware arrives
  • Act as the execution lead for platform debug, coordinating across engineering teams to ensure fast triage, root-cause analysis, and resolution of system-level issues
  • Track high-impact failures (GPU, HSIO, FW, rack, network) through debug forums ensuring clear ownership and closure plans
  • Balance debug depth vs. program timelines, escalating tradeoffs when needed and ensuring leadership has a clear view of risk and impact

Requirements

  • Experience leading complex hardware or AI infrastructure programs with ownership across bring-up, validation, and deployment phases
  • Strong technical understanding of GPU-based AI systems, rack architectures, and datacenter infrastructure
  • Proven ability to manage ambiguity, drive debug execution, and lead cross-functional teams without direct authority
  • Strong written and verbal communication skills, including executive-level status reporting
  • Proficiency with program management and execution tools (Jira, Confluence, dashboards, Excel/PowerPoint)
  • Bachelor's or master's degree in systems, EE, CS, or related engineering discipline
  • PMP, Scrum Master, or equivalent program management training

Nice to have

  • Hands-on experience with GPU cluster scale testing, system stress, or performance validation
  • Familiarity with rack-level bring-up, power/cooling constraints, networking, and failure modes at scale
  • Experience working through hardware/firmware debug cycles in pre-production or customer-facing environments

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Technical Program Manager- AI Cluster Validation

8 matching positions

Senior Technical Program Manager

We are seeking a Senior Technical Program Manager (L64) to join the AI Delivery ...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree AND 4+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
  • 2+ years of experience managing cross-functional and/or cross-team projects
Job Responsibility
Job Responsibility
  • Own Regional GPU Feasibility Portfolio
  • Lead a regional portfolio of GPU feasibility assessment programs, ensuring consistent, predictable, and high‑quality execution across multiple parallel initiatives
  • Drive early feasibility validation across power, cooling, space, network, colo sequencing, and deployment constraints to improve execution signal quality and reduce late‑stage risk
  • Establish and maintain a standardized feasibility pre‑check framework that enables faster decision‑making and minimizes canceled or reworked execute signals
  • Accelerate GPU Cluster Design Readiness
  • Partner across organizations to accelerate cluster layout design creation and evolution for next‑generation GPU platforms
  • Create and maintain runbooks, templates, and governance mechanisms that serve as a single source of truth for cluster design workflows, change management, and handoffs
  • Drive clarity and predictability in change management, balancing speed with quality and downstream execution impact
  • Cross‑Team Orchestration & Governance
  • Act as a connective tissue across partner teams, aligning priorities, dependencies, and execution signals
  • Fulltime
Read More
Arrow Right

Ai Technical Architect

Location
Location
United States , Auburn Hills
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 20+ years in software engineering with 5+ years focused on AI/ML systems
  • 3+ years hands-on experience architecting and shipping production LLM and agentic AI applications
  • Demonstrated success leading enterprise-scale AI platform builds with measurable business outcomes
  • Track record architecting scalable cloud-native systems on AWS in regulated or large-enterprise environments
  • Experience leading technical teams, mentoring engineers, and engaging executive stakeholders
  • Bachelor's or Master's degree in Computer Science, AI/ML, or a related technical field
  • Expert proficiency with LangGraph, LangChain, and agent orchestration frameworks
  • Deep experience with Amazon Bedrock, SageMaker, and Amazon Q, including Bedrock Agents and Knowledge Bases
  • Hands-on experience with Model Context Protocol (MCP), function calling, tool use, and structured output patterns
  • Strong command of prompt engineering, evaluation harnesses, fine-tuning, and model optimization
Job Responsibility
Job Responsibility
  • Design the enterprise AI platform architecture spanning the LLM API gateway, GPU and compute allocation pools, sandbox provisioning, model registry, and security gate automation
  • Define infrastructure standards, API gateway patterns, and reference architectures consumed by all AI delivery towers and partner integrations
  • Establish guardrails for token metering, rate limiting, audit logging, DLP validation, SAST, DAST, dependency scanning, and model card review embedded in CI/CD
  • Review security posture across all AI workloads with mapping to NIST AI RMF, AWS Well-Architected (including the Machine Learning Lens), and applicable enterprise compliance baselines
  • Architect multi-agent systems using LangGraph, LangChain, and Model Context Protocol (MCP) for complex workflow orchestration, planning, and tool use
  • Define patterns for ReAct, Chain-of-Thought, Tree-of-Thoughts, and agent-to-agent coordination across enterprise and customer-facing use cases
  • Design and optimize Retrieval-Augmented Generation (RAG) systems, embedding strategies, and semantic search across structured and unstructured enterprise data
  • Establish MLOps and AgentOps practices for deployment, evaluation, observability, and continuous improvement of agents and models in production
  • Architect solutions on Amazon Bedrock, Amazon SageMaker, Amazon Q, Bedrock Agents, and Bedrock Knowledge Bases
  • Define infrastructure patterns using Amazon EKS, AWS Lambda, ECS Fargate, API Gateway, EventBridge, SNS/SQS, Kinesis, S3, DynamoDB, Aurora, Redshift, Athena, OpenSearch, and Kendra
  • Fulltime
Read More
Arrow Right

Principal AI Safety Engineer for Autonomous Vehicles Technical Lead

The AV Safety Strategy and Assessment team is seeking an AI Safety Technical Lea...
Location
Location
United States
Salary
Salary:
250600.00 - 384600.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Electrical Engineering, Mathematics, Physics, or a related field
  • or equivalent practical experience
  • 10+ years of experience in AI/ML, engineering or a related field
  • 5+ years in autonomous vehicles, robotics or related field
  • Experience in Machine Learning & AI: Extensive experience in building large-scale models with significant focus on E2E validation. Experience using Large Language Models (LLMs), Generative AI, RAG, Deep learning, Reinforcement Learning, Natural Language Processing (NLP), SVM, XGBoost, Random Forest, Decision Trees, Clustering
  • AI Standards and Evolving Regulations: Understanding of ISO/PAS 8800, NIST AI Risk Management Framework, EU AI Act (2024-2027), other applicable industry standards and best practices for autonomous vehicles, aerospace and/or robotics.
  • Programming & Frameworks : Python, R, Java, PySpark, PyTorch, TensorFlow, Scikit-learn, LangChain, SQL
  • Cloud & Big Data Platforms: ( Preferred Microsoft Azure - Data Lake, Machine Learning, Databricks)
  • Deployment & MLOps: MLflow, Model Monitoring & Versioning, Docker & Kubernetes, GitHub, Jira
  • Data Analysis & Visualization : Tableau, PowerBI, Pandas, NumPy
Job Responsibility
Job Responsibility
  • Lead the development of AI safety strategies for ADS and establish safety engineering guidance and sufficiency criteria.
  • Actively engage with partners and seek input, provide technical expertise to inform leadership decision-making, and take ownership of technical projects
  • Define GM’s strategy for AI safety standards, engage externally to influence evolving standards, and contribute to internal and external thought leadership that strengthens GM’s position in the autonomous vehicle ecosystem.
  • Support regulatory rulemaking and policy responses related to AI safety-critical systems.
  • Establish an assurance plan and process to evaluate AI-related safety case evidence and verify that sufficiency criteria are met.
  • Provide AI expertise and safety guidance across Global Product Safety, Systems, and Certification activities.
  • Identify and drive opportunities to improve the efficiency and quality of safety work through the application of AI methodologies.
  • Mentor and develop team members, fostering a culture of technical excellence and continuous learning.
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Network Engineer, Engineering R&D Environments

Meta's Lab Infrastructure, Network, Compliance, and Security () team is seeking ...
Location
Location
United States , Menlo Park
Salary
Salary:
162000.00 - 227000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 8+ years of experience designing, deploying, and operating network infrastructure in production or lab environments
  • Experience working in multi-vendor environments, including network operating systems and switching platforms
  • Experience with configuration management, code repositories, and zero-touch provisioning (ZTP) for network infrastructure
  • Experience with IPv4/IPv6, L2/L3 protocols, including STP, OSPF, BGP, TCP/IP, DHCP, DNS, VLANs, VRRP, LACP, MC-LAG, ACLs, MACsec, and EVPN/VXLAN
  • Working knowledge of scripting or programming languages (e.g., Python, shell) for automation and tooling
  • Experience prioritizing competing workstreams based on impact, deadlines, and stakeholder needs in a global environment, with a track record of driving work independently while engaging cross-functional partners as needed
Job Responsibility
Job Responsibility
  • Own end-to-end frontend and backend network design, deployment, and operations for AI and compute lab clusters
  • Serve as a primary networking point of contact for backend fabrics, including Arista- and internally developed network OS-based scale-out networks supporting AI workloads
  • Design, deploy, and support high-throughput, low-latency cluster networking, including congestion management (PFC/ECN), RDMA validation, and lossless transport
  • Perform hands-on troubleshooting and root-cause analysis across L1–L4 using packet captures, telemetry, and vendor tools to resolve complex lab issues
  • Support silicon, hardware, and software bring-ups, ensuring reliable connectivity and on-time validation
  • Lead and execute lab network lifecycle activities, including upgrades, migrations, capacity expansions, and decommissioning across regions
  • Develop and maintain network automation, configuration templates, and zero-touch provisioning (ZTP) workflows
  • Create and maintain MOPs, runbooks, and readiness checklists for internal teams and vendor executions
  • Provide direct consultation and training to cross-functional partners, enabling teams to operate and troubleshoot lab networks
  • End-to-end ownership of projects from requirements definition through customer handoff
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

ML Engineer Senior - GenAI Solutions

As a Senior Machine Learning Engineer at NTT DATA, you will work alongside exper...
Location
Location
Italy , Milano
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5 years of production experience working in Data Science or Software Engineering
  • Deep knowledge of math, probability, statistics and algorithms
  • At least 6/12 months of experience in Generative AI deployment and underlying architecture handling
  • Vector Database knowledge is well appreciated
  • Understanding of data structures, data modeling and software architecture
  • Fluent in a at least two mainstream programming language (Python, Scala, Java, C++)
  • Experience in building an infrastructure for technical users, such as Data Scientist, ML practitioners or data consumers/producers
  • Strong knowledge of Spark, Databricks is a strong plus
  • Experience developing/deploying ML solutions in one of the public cloud platforms and on a Cross-cloud base, Snowflake knowledge is a plus
  • Deep knowledge with machine learning frameworks (such as Keras or PyTorch)
Job Responsibility
Job Responsibility
  • Apply hands-on Generative AI capabilities, preferably on Azure/GCP and on-premise GenAI architectures and MLOps
  • Leverage a strong mathematical background
  • Work on classification, information retrieval, clustering and optimization problems
  • Establish scalable, efficient and automated processes for large-scale data analysis
  • Contribute to model development, model validation and model implementation
  • Identify business opportunities
  • Design and create new data pipelines from scratch, from experiments to production deployment
  • Manage multiple projects
  • Lead ML Engineers
  • Connect with stakeholders
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right