CrawlJobs Logo

HPC & AI System Test Engineering Manager

United States, Chippewa Falls 137000.00 - 315000.00 USD / Year · Job Posted December 01, 2025
Apply Position
Job Link Share

Job Description

Manages a team of systems engineers for high-performance computing (HPC) server platforms, networking, storage, and software product solutions. Provides operational direction, leadership, and mentoring. Works on HPE & AI product offerings in a rapidly evolving market.

Job Responsibility

  • Provides direct and ongoing leadership for a team of individual contributors testing and validating new products, enhancements and updates
  • Manages headcount, deliverables, schedules, and costs for multiple ongoing projects
  • Communicates project status and escalates issues to direct managers, program managers, and internal and external development partners
  • Manages relationships with outsourced partners and suppliers
  • Proactively identifies opportunities for process improvement and cost reductions opportunities
  • Provides people-care management for assigned team members, including hiring, setting and monitoring of annual performance plans, coaching, and career development
  • Coordinates with third-party product vendors and engineering managers to track development issues and implement solutions

Requirements

  • First level university degree or equivalent experience required
  • May have advanced university degree
  • Typically 5 or more years of related work experience, including 0-2 years of people management experience
  • Strong leadership skills, including coaching, team building, and conflict resolution
  • Advanced project management skills including time and risk management, resource prioritization, and project structuring
  • Strong analytical and problem-solving skills
  • Ability to manage human capital across geographies to drive workforce development and achieve desired results
  • Strong verbal and written communication skills, including negotiation, presentation, and influence skills
  • Advanced business acumen, technical knowledge, and extensive knowledge in applications and technologies
  • Strong multi-tasking and prioritization skills
  • Strong communication skills (e.g. written, verbal, presentation)
  • mastery in English and local language
  • Good understanding of the company's policies and processes
  • Experience with Industry Standard Server, Storage, and Networking products
  • Experience with certification of major Operating Systems (OS) such as Linux (Ubuntu, RHEL, SUSE, etc.)
  • Ability to effectively manage diverse test tasks and priorities in a fast-paced, fluid environment
  • Understands new product development and test processes required for release to production

Nice to have

  • Accountability
  • Action Planning
  • Active Learning
  • Active Listening
  • Agile Methodology
  • Agile Scrum Development
  • Analytical Thinking
  • Bias
  • Coaching
  • Creativity
  • Critical Thinking
  • Cross-Functional Teamwork
  • Data Analysis Management
  • Design
  • Design Thinking
  • Empathy
  • Follow-Through
  • Group Problem Solving
  • Growth Mindset
  • Long Term Planning
  • Managing Ambiguity

What we offer

  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

HPC & AI System Test Engineering Manager

8 matching positions

HPC & AI System Test Engineering Manager

HPC & AI System Test Engineering Manager role at Hewlett Packard Enterprise. Man...
Location
Location
United States , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • First level university degree or equivalent experience required
  • May have advanced university degree
  • Typically 10 or more years of related work experience
  • People management experience
  • Strong leadership skills including coaching, team building, and conflict resolution
  • Advanced project management skills including time and risk management, resource prioritization, and project structuring
  • Ability to manage human capital across geographies
  • Strong analytical and problem-solving skills
  • Excellent understanding of testing methodologies
  • Great understanding of hardware and software interactions
Job Responsibility
Job Responsibility
  • Provides direct and ongoing leadership for a team of individual contributors designing and developing new products, enhancements and updates
  • Manages headcount, deliverables, schedules, and costs for multiple ongoing projects
  • Communicates project status and escalates issues to direct managers, program managers, and development partners
  • Manages relationships with outsourced partners and suppliers
  • Proactively identifies opportunities for process improvement and cost reductions
  • Provides people-care management for assigned team members including hiring, performance plans, coaching, and career development
  • Writes and executes complete testing plans, protocols, and documentation
  • Works with systems engineers and development partners to develop reliable, cost effective and high-quality solutions
  • Collaborates and communicates with management regarding systems design status, project progress, and issue resolution
  • Represents the systems engineering team for all phases of larger development projects
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

System Design & Debug Manager – AI Customer Engineering

This role serves as the debug execution backbone of AMD's AI Customer Engineerin...
Location
Location
United States , Santa Clara
Salary
Salary:
186080.00 - 279120.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep understanding of data center system architecture (CPU, GPU, FPGA, memory, connectivity, RAS, hotplug)
  • Familiarity with hardware bring up, validation, manufacturing, and test flows
  • Knowledge of reliability and quality metrics (yield, DPM, FIT)
  • Proven years of experience in the semiconductor industry
  • Deep hands-on experience with silicon debug (pre-silicon and post-silicon)
  • Strong background in product development, debug tools, validation, failure analysis, or customer engineering
  • Proven experience managing complex debug programs across multiple customer segments
  • Strong functional team and project management skills with ability to drive execution across global, cross-functional teams
  • Excellent written and verbal communication skills, including executive-level engagement
  • Bachelor's degree in Electrical Engineering, Computer Engineering, Computer Science, or related field required
Job Responsibility
Job Responsibility
  • Debug Program Leadership - Lead debug execution across hyperscale, OEM, HPC, and enterprise customer programs. Own high-impact, cross-customer and systemic issues and maintain visibility into top risks and trends
  • Customer Program Integration - Partner with Customer Program Managers to align debug execution with customer deliverables, platform readiness, and deployment schedules. Support escalations and executive-level customer engagements
  • Technical Debug Coordination - Drive cross-functional debug efforts across design, validation, product engineering, and failure analysis. Align pre- and post-silicon debug strategies and connect lab debug to real-world customer environments
  • Field Failure & Fleet Quality Management - Lead resolution of field failures, fleet anomalies, and data center reliability issues. Aggregate fleet, RMA, and production signals and feed learnings back into design, validation, and manufacturing
  • Governance & Process Improvement - Own debug tracking, prioritization, risk management, and executive reporting. Apply structured methodologies (8D, CAPA, FMEA) and drive continuous improvement in execution speed and consistency
  • Fulltime
Read More
Arrow Right

HPC & AI Systems Engineer for Integrated Systems Test

HPC & AI Systems Engineer for Integrated Systems Test role at Hewlett Packard En...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master's degree in Computer Engineering, Computer Science, Electrical Engineering, Information Systems, or equivalent
  • Minimum 4 years of experience
  • Experience with certification & submission to OS vendors of Linux (RedHat, SLES, Ubuntu, etc.), Windows Server operating systems, Windows Client operating systems, and VMWare (ESXi)
  • Experience installing and working with Linux, Windows and VMWare OSes
  • Experience in programming or scripting languages, Python, PowerShell, Perl, Linux Shell, Java, MySQL, MS SQL Server
  • Understanding of Redfish commands, RESTful API, and JSON format
  • Knowledge of creating and using Docker containers and VMs
  • Experience in configuring Storage (internal/external storage, file systems, and raid/non-raid settings) and Networking devices (iSCSI, FCoE, IPs, VLANs, Bonding, Jumbo Frames, LAGs)
  • Knowledge of networking concepts such as NIC teaming, VLANs, IPv4, IPv6
  • Excellent written and verbal communication skills in English
Job Responsibility
Job Responsibility
  • Work with Program & Product Management, technical leads, and product development teams to obtain product feature requirements
  • Design and implement new test features in existing and new test cases
  • Analyze, debug and provide feedback/resolution on issues uncovered by test team prior to submission of results to OS vendors for approval
  • Implement software solutions for multiple test programs/projects with internal and outsourced development partners
  • Review and evaluate the implementation and use of test automation and test tools
  • Planning, development, and implementation of software tools for the testing and evaluation of current and next-generation HPE HPC products
  • Debug and analyze issues to a successful resolution
  • Perform testing in local and remote labs
  • Drive appropriate automated test execution to test engineers at various global locations
  • Provide training and guidance to test teams both onshore and offshore
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Product Manager - AI Data Center Infrastructure

Product Manager - AI Data Center Infrastructure. We are seeking a Product Line M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–10+ years of experience in data center networking, AI infrastructure, or HPC environments
  • Strong hands-on experience with Juniper QFX platforms and JunOS
  • Deep understanding of GPU architectures: NVIDIA: H100/H200, GB200/GB300, NVLink/NVSwitch AMD: MI300/MI400, Pollara NICs, Infinity Fabric
  • Proven expertise in scale-up GPU interconnects and scale-out Ethernet fabrics
  • Strong knowledge of RDMA/ROCEv2, ECN, PFC, and buffer management
  • Familiarity with distributed AI workloads, collective operations (NCCL, RCCL)
  • Hands-on troubleshooting experience with high-speed optics, AEC cables, link training, and NIC firmware
  • Proficiency in automation and scripting (Python, Ansible, Bash, Terraform)
Job Responsibility
Job Responsibility
  • AI Data Center & Fabric Architecture: Define product requirements for AI data center network architectures supporting thousands of GPUs
  • Develop requirements for low-latency Ethernet fabrics using Juniper QFX platforms and Apstra-based automation
  • Enable high-bandwidth GPU and NIC interconnects optimized for large-scale distributed training and inference workloads
  • GPU, NIC & Interconnect Strategy: Lead requirements definition for next-generation GPUs, NICs, and interconnect technologies, staying ahead of industry roadmaps
  • Drive alignment with NVIDIA and AMD ecosystems
  • Ensure interoperability across DAC, AEC, ACC, and optical transceivers between switches and NIC endpoints
  • Define scale-up paths using PCIe, NVLink, NVSwitch, ensuring GPU-to-GPU symmetry, consistency, and bandwidth determinism
  • Switching, Routing & Telemetry: Specify and optimize L2/L3 architectures, including EVPN-VXLAN, Class-E IPv4, and AI-optimized buffer tuning
  • Leverage hardware telemetry, streaming sensors, and analytics for proactive performance assurance
  • Drive automation using Python, Ansible, Apstra, Terraform, and related tools to enforce configuration consistency and compliance
What we offer
What we offer
  • Health & Wellbeing: comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Personal & Professional Development: specific programs catered to helping you reach any career goals
  • Unconditional Inclusion: unconditionally inclusive in the way we work and celebrate individual uniqueness
Read More
Arrow Right

Senior Manager, AI Infrastructure and Operations

The Sr. Manager/Staff Engineer, AI Infrastructure & MLOps Engineering is a senio...
Location
Location
Japan , Tokyo
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of hands-on software engineering experience in cloud infrastructure, DevOps, and MLOps
  • Deep expertise in Python, Kubernetes, Terraform, Helm, and CI/CD pipeline development
  • Proven experience architecting and operating containerized solutions on AWS, GCP, and Azure
  • Strong knowledge of Infrastructure-as-Code, distributed systems, and production system reliability
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field
Job Responsibility
Job Responsibility
  • Design, implement, and own large-scale cloud-based HPC and MLOps platforms supporting AI model training, genomic sequencing, and precision medicine
  • Architect multi-environment clusters (AWS, GCP, Azure), enabling GPU/FPGA workloads and advanced observability
  • Lead the development of developer and cloud platforms, including internal engineering accelerators and reusable toolsets
  • Design, implement, and manage unified platform catalogs using Backstage, enhancing developer experience and application metadata management
  • Develop custom plugins and APIs for Backstage to support internal engineering workflows and documentation
  • Build and maintain Python-based automation frameworks, CI/CD pipelines, and Infrastructure-as-Code (Terraform, Helm, Pulumi, AWS CDK)
  • Operationalize containerized solutions using Docker and Kubernetes, integrating MLflow, Kubeflow, and other orchestration platforms
  • Implement robust automation for provisioning, configuring, and managing cloud resources across multiple environments
  • Lead the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and advanced observability (Prometheus, Grafana, PagerDuty)
  • Develop and maintain APIs and services for model management, feature stores, and inference pipelines
  • Fulltime
Read More
Arrow Right

HPC AI Electrical Engineer

Designs, analyzes, develops, modifies and evaluates electrical/electronic parts,...
Location
Location
United States , Spring
Salary
Salary:
92600.00 - 213500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Electrical Engineering
  • Typically 4-6 years experience
  • Experience with lab equipment - oscilloscopes, TDR, power supplies, grounding schemes, logic analyzer, data recorders
  • Linux - how networking/software/firmware/hardware interact
  • Report writing and data analysis and research - comparing measured data to specifications
  • Reading schematics and PCB layout files
  • Understanding of signal integrity and how measurement equipment can affect signal integrity
  • Using electrical design tools and software packages
  • Strong analytical and problem solving skills
  • Designing electronic components, integrated circuitry, and algorithms
Job Responsibility
Job Responsibility
  • Designs engineering solutions for electrical and electronic parts, subsystems, integrated circuitry, and algorithms based on established engineering principles
  • Develops and implements parameters and test plans for new and existing designs, including validation of tolerances, form/fit/function, shock and vibration, electromagnetic interference, safety, reliability, thermal generation, and system power measurements
  • Collaborates and communicates with management, internal, and outsourced development partners regarding design status, project progress, and issue resolution
  • Leads a project team of other electrical hardware engineers and internal and outsourced development partners to develop reliable, cost effective and high quality solutions for moderately-complex products
  • Represents the electrical hardware team for all phases of larger and more- complex development projects
  • Provides guidance and mentoring to less- experienced staff members
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

HPC AI Electrical Engineer

Designs, analyzes, develops, modifies and evaluates electrical/electronic parts,...
Location
Location
United States , Spring
Salary
Salary:
119500.00 - 275000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Electrical Engineering
  • Typically 6-10 years experience
  • Knowledgeable of full board design process
  • Contribute and drive architecture
  • Design to established specifications as well as have flexibility to create new solutions
  • Ability to identify HW, FW, and SW issues and assign to correct engineering teams
  • Schematic capture and review experience
  • Layout review experience
  • Signal and Power integrity knowledge
  • Understands “Gerber Release” process from a board designer standpoint
Job Responsibility
Job Responsibility
  • Leads multiple project teams of other electrical hardware engineers and internal and outsourced development partners responsible for all stages of electrical hardware design and development for complex products, solutions, and platforms, including design, validation, tooling and testing
  • Manages and expands relationships with internal and outsourced development partners on electrical hardware design and development
  • Reviews and evaluates designs and project activities for compliance with technology and development guidelines and standards
  • provides tangible feedback to improve product quality
  • Provides domain-specific expertise and overall electrical/electronic hardware and platform leadership and perspective to cross-organization projects, programs, and activities
  • Drives innovation and integration of new technologies into projects and activities in the electrical hardware design organization
  • Provides guidance and mentoring to less- experienced staff members
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Developer Experience Engineer

We are looking for a Developer Experience Engineer to enhance developer producti...
Location
Location
United States , San Jose
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong Python skills for automation, scripting, and infrastructure development
  • Experience with Slurm job scheduling in an HPC or hybrid environment
  • Hands-on experience with observability and monitoring tools like Prometheus, Grafana, and OpenTelemetry
  • Expertise with Docker and Kubernetes, including Helm charts and cluster management
  • Proficiency in modern CI/CD pipeline management with tools like GitHub Actions, Jenkins, or Buildkite
  • Experience with infrastructure-as-code tools like Terraform or Ansible
  • Knowledge of cloud infrastructure, compute, and storage optimization on AWS or GCP
Job Responsibility
Job Responsibility
  • Develop and maintain automation tools to streamline development, testing, and deployment workflows
  • Optimize and manage Slurm-based job scheduling for AI workloads, simulation, and chip design workflows
  • Build observability solutions using Grafana, Prometheus, and OpenTelemetry for monitoring pipelines, infrastructure, and compute clusters
  • Manage and optimize containerized environments using Docker and Kubernetes to enhance scalability and reproducibility
  • Enhance build, test, and deployment pipelines with CI/CD tools like GitHub Actions, Jenkins, Buildkite, or Bazel
  • Develop caching and artifact management systems to reduce build times and improve dependency resolution
  • Integrate and manage cloud resources (AWS, GCP) for scaling compute, storage, and hybrid workloads
  • Support security and compliance efforts including secrets management and access control
  • Document and share best practices for efficient developer tooling and workflows
What we offer
What we offer
  • Full medical, dental, and vision packages, with generous premium coverage
  • Housing subsidy of $2,000/month for those living within walking distance of the office
  • Daily lunch and dinner in our office
  • Relocation support for those moving to West San Jose
  • Unlimited compute budget subject to ROI justification
  • Fulltime
Read More
Arrow Right