Performance Infrastructure Engineer- Data Center GPU Job at AMD (Santa Clara)

Business Development Manager – HPE POD (Modular Data Center Solutions)

Develop and grow the HPE Modular Data Center (POD) and AI infrastructure busines...

Location

Japan , Tokyo

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering, computer science, or related technical field, or equivalent industry experience
Typically 8+ years of professional experience in data center infrastructure, HPC environments, modular infrastructure, or enterprise technology solutions
Experience in business development, solution sales, or infrastructure consulting within enterprise IT, cloud, or data center industries
Experience engaging with senior technical and executive stakeholders on infrastructure strategy and large-scale technology investments
Experience supporting complex infrastructure deals involving multiple stakeholders, partners, and delivery organizations
Strong understanding of modern data center architecture including modular data centers and containerized infrastructure, high-density GPU and HPC environments, liquid cooling and advanced thermal management, AI infrastructure and accelerated computing platforms, enterprise and hyperscale data center operations
Ability to translate complex technical infrastructure concepts into business value and strategic outcomes for customers
Strong commercial acumen with the ability to structure large infrastructure deals and navigate enterprise procurement processes
Experience working across multi-technology environments including compute, networking, storage, cooling systems, and facility infrastructure
Ability to develop scalable infrastructure solutions that enhance performance, efficiency, and time-to-deployment for AI and HPC workloads

Job Responsibility

Drive business development activities for HPE POD and modular data center solutions across targeted industries including AI, education, research, and enterprise environments
Identify, qualify, and develop new opportunities for modular data center deployments including AI factory infrastructure, HPC clusters, GPU environments, and edge data center solutions
Lead engagement with customers to understand technical, operational, and business requirements for large-scale data center deployments and translate these into POD-based solutions
Work closely with HPE account teams, solution architects, and partners to develop end-to-end proposals including infrastructure architecture, modular facility design, and lifecycle services
Act as a trusted advisor to customer executives, infrastructure teams, and decision makers on modern data center architecture, capacity scaling strategies, and AI-ready infrastructure
Coordinate cross-functional resources including engineering, supply chain, manufacturing partners, and delivery teams to ensure solutions are feasible, scalable, and aligned with customer timelines
Lead or support major proposal efforts and RFP responses for modular data center solutions, including technical positioning, commercial structuring, and value articulation
Support the creation of detailed solution architectures including modular data center configurations, cooling strategies (air and liquid)
Develop and maintain relationships with strategic ecosystem partners including cooling technology providers, modular construction manufacturers, and infrastructure integrators
Provide market intelligence and customer feedback to influence the evolution of the HPE POD portfolio

What we offer

Health & Wellbeing (comprehensive suite of benefits supporting physical, financial and emotional wellbeing)
Personal & Professional Development (programs to help reach career goals)
Unconditional Inclusion

Fulltime

Product Manager - AI Data Center Infrastructure

Product Manager - AI Data Center Infrastructure. We are seeking a Product Line M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

5–10+ years of experience in data center networking, AI infrastructure, or HPC environments
Strong hands-on experience with Juniper QFX platforms and JunOS
Deep understanding of GPU architectures: NVIDIA: H100/H200, GB200/GB300, NVLink/NVSwitch AMD: MI300/MI400, Pollara NICs, Infinity Fabric
Proven expertise in scale-up GPU interconnects and scale-out Ethernet fabrics
Strong knowledge of RDMA/ROCEv2, ECN, PFC, and buffer management
Familiarity with distributed AI workloads, collective operations (NCCL, RCCL)
Hands-on troubleshooting experience with high-speed optics, AEC cables, link training, and NIC firmware
Proficiency in automation and scripting (Python, Ansible, Bash, Terraform)

Job Responsibility

AI Data Center & Fabric Architecture: Define product requirements for AI data center network architectures supporting thousands of GPUs
Develop requirements for low-latency Ethernet fabrics using Juniper QFX platforms and Apstra-based automation
Enable high-bandwidth GPU and NIC interconnects optimized for large-scale distributed training and inference workloads
GPU, NIC & Interconnect Strategy: Lead requirements definition for next-generation GPUs, NICs, and interconnect technologies, staying ahead of industry roadmaps
Drive alignment with NVIDIA and AMD ecosystems
Ensure interoperability across DAC, AEC, ACC, and optical transceivers between switches and NIC endpoints
Define scale-up paths using PCIe, NVLink, NVSwitch, ensuring GPU-to-GPU symmetry, consistency, and bandwidth determinism
Switching, Routing & Telemetry: Specify and optimize L2/L3 architectures, including EVPN-VXLAN, Class-E IPv4, and AI-optimized buffer tuning
Leverage hardware telemetry, streaming sensors, and analytics for proactive performance assurance
Drive automation using Python, Ansible, Apstra, Terraform, and related tools to enforce configuration consistency and compliance

What we offer

Health & Wellbeing: comprehensive suite of benefits that supports physical, financial and emotional wellbeing
Personal & Professional Development: specific programs catered to helping you reach any career goals
Unconditional Inclusion: unconditionally inclusive in the way we work and celebrate individual uniqueness

Procurement/inventory Manager

This role is responsible for managing end-to-end strategic procurement of GPUs, ...

Location

India , Indore

Salary:

Not provided

RackBank

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Engineering, Supply Chain, Technology, or a related field
5–8 years of procurement or strategic sourcing experience in AI infrastructure, GPU hardware, data center, server, cloud, telecom, or related technology domains
Strong understanding of GPU hardware, server platforms, networking equipment, storage systems, and data center infrastructure
Ability to understand technical specifications and connect them with commercial decisions
Proven experience in strategic sourcing, vendor development, supplier management, and commercial negotiations
Experience engaging with OEMs, distributors, manufacturers, and hardware supply chain partners
Exposure to global sourcing and supplier development across international markets
Strong negotiation capability covering price, payment terms, landed cost, warranty, delivery commitments, and contractual risk
Hands-on experience in import execution, shipment coordination, customs clearance, SEZ processes, documentation, and vendor follow-up
Strong ownership mindset with proactive supplier discovery, structured follow-up, and practical problem-solving ability

Job Responsibility

Manage end-to-end procurement of GPUs, servers, storage, networking hardware, racks, PDUs, cooling systems, and other data center infrastructure
Develop and execute sourcing strategies to ensure competitive pricing, quality, supply continuity, and timely delivery
Identify, evaluate, and onboard suppliers, OEMs, manufacturers, distributors, and channel partners across domestic and global markets
Build direct manufacturer relationships to reduce intermediary dependency and improve commercial outcomes
Collaborate with engineering, infrastructure, and leadership teams to translate technical requirements into effective procurement decisions
Evaluate hardware based on performance, compatibility, deployment fitment, power/cooling impact, support, warranty, lead time, and total commercial value
Optimize bill-of-materials through alternative comparisons to prevent over-specification, delays, or excessive costs
Engage OEMs for customized GPU, server, storage, and infrastructure configurations aligned with business needs
Drive custom build discussions covering specifications, pricing, delivery commitments, warranty terms, and commercial feasibility
Lead negotiations on pricing, payment terms, contracts, logistics, lead times, and service levels

What we offer

Be part of a high-trust, ownership-driven work culture
Gain exposure to real-world infrastructure operations and decision-making
Grow with the organization as we expand our data center footprint
Freedom to take initiative and own outcomes, not just tasks
Make a meaningful impact in building India’s digital infrastructure backbone

Fulltime

Senior Infrastructure Engineer

We are seeking a highly skilled and motivated GPU Fleet Operations Engineer to j...

Location

United States , San Francisco; Sunnyvale

Salary:

183000.00 - 210000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

Proven experience diagnosing and repairing high-density, rack-mounted compute hardware in production environments
Deep understanding of GPU architectures and hands-on experience with GPU-based systems
Experience supporting NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X series platforms
Familiarity with high-speed interconnects such as InfiniBand, NVLink, and RDMA over Converged Ethernet (RoCE)
Strong Linux experience (Ubuntu, Rocky Linux, CentOS) using the command line for diagnostics and testing
Proficiency with GPU and system diagnostic tools such as NVIDIA DCGM and NVIDIA field diagnostic utilities
Experience working with enterprise server hardware, power delivery, and cooling systems
Strong analytical and problem-solving skills
Excellent communication and collaboration skills
Ability to work independently in a fast-paced data center or operations environment

Job Responsibility

Perform deep-level diagnosis and troubleshooting of hardware faults within GPU racks and high-density compute systems
Troubleshoot and support GPU platforms including NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X
Execute component-level diagnosis and remediation for failed or degraded hardware
Partner with data center operations to manage and perform field-replaceable unit (FRU) repairs for GPUs, power supplies, cooling systems, interconnects, and networking hardware
Conduct post-repair validation, burn-in testing, torch testing, and NVIDIA NCCL testing to ensure system stability and performance
Implement and execute preventative maintenance procedures to improve fleet reliability and extend hardware lifespan
Perform firmware and BIOS upgrades across the GPU fleet
Maintain detailed documentation of maintenance activities, failures, and resolutions in ticketing and asset management systems
Develop and update standard operating procedures (SOPs) for troubleshooting, repair, and validation workflows
Collaborate with engineering, software, and data center operations teams to identify root causes of systemic failures and implement preventative solutions

What we offer

Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Solutions Architect – Campus, DCN Switching & Routing

We are looking for a seasoned TME/Networking Solutions Architect with deep exper...

Location

China , Beijing

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Deep knowledge and hands-on experience in networking protocols: BGP, OSPF, EVPN, VXLAN, MCLAG, DRNI, ISSU, MACSec, DCI
Experience in Day 0 to Day 1 deployment of spine-leaf fabrics with any SDN controllers, micro segmentation, and service chaining
Working knowledge of automation and orchestration tools used in data center deployments
Familiarity with SDN controller architecture and integration with third-party services
Proven ability to engage with both technical and business stakeholders to design and defend high-impact networking solutions
Strong competitive knowledge of other vendor offerings — including campus solutions, 400G/800G switching platforms, and transceivers such as but not limited to QSFP-DD and OSFP
Excellent written and verbal communication skills
ability to create compelling documentation and technical collateral

Job Responsibility

Serve as a trusted technical advisor for customers across AI data centers, enterprise campus networks, and service provider environments — identifying technical requirements, resolving pain points, and showcasing HPE’s end-to-end networking capabilities
Architect and support AI-ready Ethernet data center deployments using leaf-spine topologies, EVPN-VXLAN overlays, and RoCEv2 fabrics optimized for GPU-based workloads
Lead and participate in customer-facing workshops, whiteboard sessions, and technical deep dives across campus switching, data center fabrics, and edge routing solutions
Conduct Proof of Concepts (PoCs) and hands-on validations to assess performance, scale, Day-0 automation, telemetry, and orchestration tools in both data center and campus environments
Create and maintain design guidelines, infrastructure blueprints, and best practices for performance-optimized and scalable networking deployments across AI DC, enterprise, and routers use cases
Collaborate with pre-sales and go-to-market teams to drive solution adoption and ensure alignment with customer needs and competitive differentiators
Contribute to RFP/RFI responses, creating comprehensive solution documentation including Bill of Materials (BoM), redundancy and topology planning
Work closely with product management and engineering, providing real-world field feedback to enhance product roadmaps around automation, telemetry, security, and feature development
Represent HPE at industry events, AI summits, and technology forums, highlighting the value of HPE’s networking portfolio in comparison to competitors
Stay ahead of the curve by tracking emerging trends, analysing the competitive landscape, and influencing internal strategies for next-gen network innovation

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Staff Strategic Sourcing Manager

Together AI is rapidly scaling its infrastructure, and we need a senior supply c...

Location

United States , San Francisco

Salary:

220000.00 - 260000.00 USD / Year

Together AI

Expiration Date

Until further notice

Requirements

7-10+ years of experience in hardware strategic sourcing, procurement, or supply chain management within data center infrastructure, cloud computing, or high-performance computing environments
Deep and direct experience with the full compute hardware stack (GPUs, servers, networking, storage), including expertise in major OEM/ODM supplier landscapes, semiconductor supply chain dynamics, and hands-on experience managing GPU sourcing and allocation at scale
Track record of personally leading and closing complex, high-value hardware deals. Experience structuring long-term supply agreements across pricing, delivery, and risk dimensions
Experience building supply chain models for technical hardware at scale: demand forecasting, inventory strategy, logistics coordination, and supply risk mitigation
Strong executive presence with the ability to partner with and influence C-level leaders, senior engineering teams, and cross-functional stakeholders. Comfortable presenting supply chain strategy and risk assessments to senior leadership
Advanced analytical skills with fluency in total cost of ownership modeling, financial trade-off analysis, and procurement performance metrics
Ability to travel to supplier sites
Must have recent experience in high-growth, ambiguous environments where processes are being defined rather than inherited

Job Responsibility

Lead the full strategic sourcing and procurement lifecycle for GPUs, servers, networking equipment, storage, and supporting components across large-scale cluster builds. Own sourcing, negotiation, contracting, delivery coordination, and acceptance
Negotiate and structure multi-million dollar supply agreements across several categories of hardware vendors. Secure pricing, volume commitments, lead times, and warranty terms that protect the company's cost position and supply continuity
Design and scale the company's hardware supply chain, including inventory planning, supplier diversification, and logistics. Build procurement infrastructure that keeps pace with rapid capacity expansion across multiple sites and geographies
Track GPU and data center commodity hardware markets for supply shifts, pricing dynamics, component roadmaps, and geopolitical risks. Translate market intelligence into sourcing recommendations and present findings to executive leadership. Optimize TCO
Own strategic supplier relationships and drive performance through regular executive reviews, joint planning, and clear accountability frameworks. Qualify and onboard new vendors as the hardware supply chain diversifies
Align supply chain strategy with technical roadmaps and capital plans. Provide visibility into supply chain status, risks, and investment trade-offs at the executive level
Stand up the supply chain tools, workflows, and reporting systems needed to manage hardware spending at scale, including cost tracking, order management, and vendor benchmarking

What we offer

competitive compensation
startup equity
health insurance
other benefits
flexibility in terms of remote work

Fulltime

Solutions Architect

TME/Solutions Architect – DCN Switching & Solution role at Hewlett Packard Enter...

Location

China , Beijing

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Deep knowledge and hands-on experience in networking protocols: BGP, OSPF, EVPN, VXLAN, MCLAG, DRNI, ISSU, MACSec, DCI, MPLS and SDN based solutions
Experience in Day 0 to Day 1 deployment of spine-leaf fabrics with any SDN controllers, micro segmentation, and service chaining
Working knowledge of automation and orchestration tools used in data center deployments
Familiarity with SDN controller architecture and integration with third-party services
Proven ability to engage with both technical and business stakeholders to design and defend high-impact networking solutions
Strong competitive knowledge of other vendor offerings including 100G/400G/800G switching platforms, transceivers and cables
Excellent written and verbal communication skills in English
Good presentation and event management skills

Job Responsibility

Serve as a trusted technical advisor for customers across AI data centers, and service provider and enterprise environments
Architect and support AI-ready Ethernet data center deployments using leaf-spine topologies, EVPN-VXLAN overlays, and RoCEv2 fabrics optimized for GPU-based workloads
Lead and participate in customer-facing workshops, whiteboard sessions, and technical deep dives across campus switching, data center fabrics, and edge routing solutions
Conduct Proof of Concepts (PoCs) and hands-on validations to assess performance, scale, Day-0 automation, telemetry, and orchestration tools
Create and maintain design guidelines, infrastructure blueprints, and best practices for performance-optimized and scalable networking deployments
Collaborate with pre-sales and go-to-market teams to drive solution adoption and ensure alignment with customer needs
Contribute to RFP/RFI responses, creating comprehensive solution documentation including Bill of Materials (BoM), redundancy and topology planning
Work closely with product management and engineering, providing real-world field feedback to enhance product roadmaps and feature development
Represent HPE at industry events, AI summits, and technology forums
Stay ahead of the curve by tracking emerging trends, analysing the competitive landscape, and influencing internal strategies for next-gen network innovation

What we offer

Health & Wellbeing benefits
Personal & Professional Development programs
Unconditional Inclusion environment
Comprehensive suite of benefits supporting physical, financial and emotional wellbeing

Fulltime

New

Machine Learning Infrastructure Engineer

At Boeing, we innovate and collaborate to make the world a better place. We’re c...

Location

United States , Huntsville, Alabama

Salary:

Not provided

Boeing

Expiration Date

May 08, 2026

Requirements

Bachelor's degree
Ability to obtain a U.S. Security Clearance for which the U.S. Government requires U.S. Citizenship
1+ years of experience with LINUX system administration
1+ years of experience developing software using Docker or Kubernetes for container-based applications
1+ years of experience with computing networking/storage concepts and architecture

Job Responsibility

Supports Linux and Windows system administration tasks including system monitoring, patching, updates, and routine maintenance
Supports compliance with enterprise IT policies, cybersecurity standards, and regulatory requirements
Assists with deployment and support of ML Ops tooling used to manage GPU computing resources for AI/ML workloads
Supports management and operation of computing infrastructure used by AI and ML development teams
Assists in configuring and maintaining network devices (firewalls, switches) to ensure secure and reliable operations
Helps troubleshoot network connectivity issues, including VPN access, escalating complex issues to senior engineers as required
Assists in optimizing cloud infrastructure resources to improve performance, cost efficiency, and scalability
Supports virtualization platforms and cluster technologies, ensuring availability and performance
Assists with administration of distributed storage and storage networking systems
Supports Kubernetes cluster operations using platforms such as Rancher and OpenShift, ensuring cluster health and security

What we offer

Generous Paid Time Off (PTO)
Flexible work environment
Paid parental leave
Industry-leading retirement benefits with strong matching
Very generous tuition assistance for earning advanced degrees
Paid medical leave programs
Health insurance
Flexible spending accounts
Health savings accounts
Retirement savings plans

Fulltime

!

Performance Infrastructure Engineer- Data Center GPU

AMD

Location:
United States , Santa Clara

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
March 19, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Performance Infrastructure Engineer- Data Center GPU

Business Development Manager – HPE POD (Modular Data Center Solutions)