CrawlJobs Logo

Performance Infrastructure Engineer- Data Center GPU

amd.com Logo

AMD

Location Icon

Location:
United States , Santa Clara

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

192000.00 - 288000.00 USD / Year

Job Description:

You will be part of a small, but dedicated team driving discrete GPU products’ performance attainment solutions across hardware, software and the platform. We are seeking a highly skilled engineer to join our Infrastructure team, focused on building scalable solutions for workload automation and performance analysis supporting advanced machine learning workloads.

Job Responsibility:

  • Technical team lead for a team of 5-6 engineers
  • Assess and understand the current automation and performance analysis infrastructure, identifying strengths, gaps, and opportunities for improvement
  • Collaborate with internal teams to gather technical requirements and understand evolving needs
  • Develop a forward looking plan that balances reusing existing systems with building new infrastructure where appropriate
  • Design, develop, and maintain automation and performance analysis tooling using Python, Bash, Make, and related technologies
  • Build and enhance workflow automation solutions using internally developed tools to orchestrate ML workloads
  • Develop new techniques and tooling to optimize ML workload execution, profiling, and analysis at scale

Requirements:

  • Strong development experience in Python and/or Bash (or equivalent scripting languages)
  • Experience with Github, Jenkins, or similar CI/CD and code review systems
  • Linux system administration experience preferred
  • Experience developing automated test infrastructure and orchestrating multisystem workflows is preferred
  • Ansible experience is a bonus
  • Strong analytical, problem solving, and debugging skills
  • Excellent communication skills
  • must be a critical thinker and self-starter
  • Ability to quickly learn and apply new tools, technologies, and frameworks
  • Networking experience preferred, including common protocols and basic debugging
  • Experience with Docker/containers and/or virtualization technologies preferred
  • Motivating leader with good interpersonal skills
  • Bachelor’s degree in a Computer Engineering/Computer Science field with 9+ years of hands-on experience, or a Master’s degree with 7+ years of relevant experience

Nice to have:

  • Ansible experience is a bonus
  • Networking experience preferred, including common protocols and basic debugging
  • Experience with Docker/containers and/or virtualization technologies preferred

Additional Information:

Job Posted:
March 19, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Performance Infrastructure Engineer- Data Center GPU

Business Development Manager – HPE POD (Modular Data Center Solutions)

Develop and grow the HPE Modular Data Center (POD) and AI infrastructure busines...
Location
Location
Japan , Tokyo
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering, computer science, or related technical field, or equivalent industry experience
  • Typically 8+ years of professional experience in data center infrastructure, HPC environments, modular infrastructure, or enterprise technology solutions
  • Experience in business development, solution sales, or infrastructure consulting within enterprise IT, cloud, or data center industries
  • Experience engaging with senior technical and executive stakeholders on infrastructure strategy and large-scale technology investments
  • Experience supporting complex infrastructure deals involving multiple stakeholders, partners, and delivery organizations
  • Strong understanding of modern data center architecture including modular data centers and containerized infrastructure, high-density GPU and HPC environments, liquid cooling and advanced thermal management, AI infrastructure and accelerated computing platforms, enterprise and hyperscale data center operations
  • Ability to translate complex technical infrastructure concepts into business value and strategic outcomes for customers
  • Strong commercial acumen with the ability to structure large infrastructure deals and navigate enterprise procurement processes
  • Experience working across multi-technology environments including compute, networking, storage, cooling systems, and facility infrastructure
  • Ability to develop scalable infrastructure solutions that enhance performance, efficiency, and time-to-deployment for AI and HPC workloads
Job Responsibility
Job Responsibility
  • Drive business development activities for HPE POD and modular data center solutions across targeted industries including AI, education, research, and enterprise environments
  • Identify, qualify, and develop new opportunities for modular data center deployments including AI factory infrastructure, HPC clusters, GPU environments, and edge data center solutions
  • Lead engagement with customers to understand technical, operational, and business requirements for large-scale data center deployments and translate these into POD-based solutions
  • Work closely with HPE account teams, solution architects, and partners to develop end-to-end proposals including infrastructure architecture, modular facility design, and lifecycle services
  • Act as a trusted advisor to customer executives, infrastructure teams, and decision makers on modern data center architecture, capacity scaling strategies, and AI-ready infrastructure
  • Coordinate cross-functional resources including engineering, supply chain, manufacturing partners, and delivery teams to ensure solutions are feasible, scalable, and aligned with customer timelines
  • Lead or support major proposal efforts and RFP responses for modular data center solutions, including technical positioning, commercial structuring, and value articulation
  • Support the creation of detailed solution architectures including modular data center configurations, cooling strategies (air and liquid)
  • Develop and maintain relationships with strategic ecosystem partners including cooling technology providers, modular construction manufacturers, and infrastructure integrators
  • Provide market intelligence and customer feedback to influence the evolution of the HPE POD portfolio
What we offer
What we offer
  • Health & Wellbeing (comprehensive suite of benefits supporting physical, financial and emotional wellbeing)
  • Personal & Professional Development (programs to help reach career goals)
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Product Manager - AI Data Center Infrastructure

Product Manager - AI Data Center Infrastructure. We are seeking a Product Line M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–10+ years of experience in data center networking, AI infrastructure, or HPC environments
  • Strong hands-on experience with Juniper QFX platforms and JunOS
  • Deep understanding of GPU architectures: NVIDIA: H100/H200, GB200/GB300, NVLink/NVSwitch AMD: MI300/MI400, Pollara NICs, Infinity Fabric
  • Proven expertise in scale-up GPU interconnects and scale-out Ethernet fabrics
  • Strong knowledge of RDMA/ROCEv2, ECN, PFC, and buffer management
  • Familiarity with distributed AI workloads, collective operations (NCCL, RCCL)
  • Hands-on troubleshooting experience with high-speed optics, AEC cables, link training, and NIC firmware
  • Proficiency in automation and scripting (Python, Ansible, Bash, Terraform)
Job Responsibility
Job Responsibility
  • AI Data Center & Fabric Architecture: Define product requirements for AI data center network architectures supporting thousands of GPUs
  • Develop requirements for low-latency Ethernet fabrics using Juniper QFX platforms and Apstra-based automation
  • Enable high-bandwidth GPU and NIC interconnects optimized for large-scale distributed training and inference workloads
  • GPU, NIC & Interconnect Strategy: Lead requirements definition for next-generation GPUs, NICs, and interconnect technologies, staying ahead of industry roadmaps
  • Drive alignment with NVIDIA and AMD ecosystems
  • Ensure interoperability across DAC, AEC, ACC, and optical transceivers between switches and NIC endpoints
  • Define scale-up paths using PCIe, NVLink, NVSwitch, ensuring GPU-to-GPU symmetry, consistency, and bandwidth determinism
  • Switching, Routing & Telemetry: Specify and optimize L2/L3 architectures, including EVPN-VXLAN, Class-E IPv4, and AI-optimized buffer tuning
  • Leverage hardware telemetry, streaming sensors, and analytics for proactive performance assurance
  • Drive automation using Python, Ansible, Apstra, Terraform, and related tools to enforce configuration consistency and compliance
What we offer
What we offer
  • Health & Wellbeing: comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Personal & Professional Development: specific programs catered to helping you reach any career goals
  • Unconditional Inclusion: unconditionally inclusive in the way we work and celebrate individual uniqueness
Read More
Arrow Right

Procurement/inventory Manager

This role is responsible for managing end-to-end strategic procurement of GPUs, ...
Location
Location
India , Indore
Salary
Salary:
Not provided
RackBank
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Engineering, Supply Chain, Technology, or a related field
  • 5–8 years of procurement or strategic sourcing experience in AI infrastructure, GPU hardware, data center, server, cloud, telecom, or related technology domains
  • Strong understanding of GPU hardware, server platforms, networking equipment, storage systems, and data center infrastructure
  • Ability to understand technical specifications and connect them with commercial decisions
  • Proven experience in strategic sourcing, vendor development, supplier management, and commercial negotiations
  • Experience engaging with OEMs, distributors, manufacturers, and hardware supply chain partners
  • Exposure to global sourcing and supplier development across international markets
  • Strong negotiation capability covering price, payment terms, landed cost, warranty, delivery commitments, and contractual risk
  • Hands-on experience in import execution, shipment coordination, customs clearance, SEZ processes, documentation, and vendor follow-up
  • Strong ownership mindset with proactive supplier discovery, structured follow-up, and practical problem-solving ability
Job Responsibility
Job Responsibility
  • Manage end-to-end procurement of GPUs, servers, storage, networking hardware, racks, PDUs, cooling systems, and other data center infrastructure
  • Develop and execute sourcing strategies to ensure competitive pricing, quality, supply continuity, and timely delivery
  • Identify, evaluate, and onboard suppliers, OEMs, manufacturers, distributors, and channel partners across domestic and global markets
  • Build direct manufacturer relationships to reduce intermediary dependency and improve commercial outcomes
  • Collaborate with engineering, infrastructure, and leadership teams to translate technical requirements into effective procurement decisions
  • Evaluate hardware based on performance, compatibility, deployment fitment, power/cooling impact, support, warranty, lead time, and total commercial value
  • Optimize bill-of-materials through alternative comparisons to prevent over-specification, delays, or excessive costs
  • Engage OEMs for customized GPU, server, storage, and infrastructure configurations aligned with business needs
  • Drive custom build discussions covering specifications, pricing, delivery commitments, warranty terms, and commercial feasibility
  • Lead negotiations on pricing, payment terms, contracts, logistics, lead times, and service levels
What we offer
What we offer
  • Be part of a high-trust, ownership-driven work culture
  • Gain exposure to real-world infrastructure operations and decision-making
  • Grow with the organization as we expand our data center footprint
  • Freedom to take initiative and own outcomes, not just tasks
  • Make a meaningful impact in building India’s digital infrastructure backbone
  • Fulltime
Read More
Arrow Right

Senior Infrastructure Engineer

We are seeking a highly skilled and motivated GPU Fleet Operations Engineer to j...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
183000.00 - 210000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience diagnosing and repairing high-density, rack-mounted compute hardware in production environments
  • Deep understanding of GPU architectures and hands-on experience with GPU-based systems
  • Experience supporting NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X series platforms
  • Familiarity with high-speed interconnects such as InfiniBand, NVLink, and RDMA over Converged Ethernet (RoCE)
  • Strong Linux experience (Ubuntu, Rocky Linux, CentOS) using the command line for diagnostics and testing
  • Proficiency with GPU and system diagnostic tools such as NVIDIA DCGM and NVIDIA field diagnostic utilities
  • Experience working with enterprise server hardware, power delivery, and cooling systems
  • Strong analytical and problem-solving skills
  • Excellent communication and collaboration skills
  • Ability to work independently in a fast-paced data center or operations environment
Job Responsibility
Job Responsibility
  • Perform deep-level diagnosis and troubleshooting of hardware faults within GPU racks and high-density compute systems
  • Troubleshoot and support GPU platforms including NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X
  • Execute component-level diagnosis and remediation for failed or degraded hardware
  • Partner with data center operations to manage and perform field-replaceable unit (FRU) repairs for GPUs, power supplies, cooling systems, interconnects, and networking hardware
  • Conduct post-repair validation, burn-in testing, torch testing, and NVIDIA NCCL testing to ensure system stability and performance
  • Implement and execute preventative maintenance procedures to improve fleet reliability and extend hardware lifespan
  • Perform firmware and BIOS upgrades across the GPU fleet
  • Maintain detailed documentation of maintenance activities, failures, and resolutions in ticketing and asset management systems
  • Develop and update standard operating procedures (SOPs) for troubleshooting, repair, and validation workflows
  • Collaborate with engineering, software, and data center operations teams to identify root causes of systemic failures and implement preventative solutions
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Solutions Architect – Campus, DCN Switching & Routing

We are looking for a seasoned TME/Networking Solutions Architect with deep exper...
Location
Location
China , Beijing
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep knowledge and hands-on experience in networking protocols: BGP, OSPF, EVPN, VXLAN, MCLAG, DRNI, ISSU, MACSec, DCI
  • Experience in Day 0 to Day 1 deployment of spine-leaf fabrics with any SDN controllers, micro segmentation, and service chaining
  • Working knowledge of automation and orchestration tools used in data center deployments
  • Familiarity with SDN controller architecture and integration with third-party services
  • Proven ability to engage with both technical and business stakeholders to design and defend high-impact networking solutions
  • Strong competitive knowledge of other vendor offerings — including campus solutions, 400G/800G switching platforms, and transceivers such as but not limited to QSFP-DD and OSFP
  • Excellent written and verbal communication skills
  • ability to create compelling documentation and technical collateral
Job Responsibility
Job Responsibility
  • Serve as a trusted technical advisor for customers across AI data centers, enterprise campus networks, and service provider environments — identifying technical requirements, resolving pain points, and showcasing HPE’s end-to-end networking capabilities
  • Architect and support AI-ready Ethernet data center deployments using leaf-spine topologies, EVPN-VXLAN overlays, and RoCEv2 fabrics optimized for GPU-based workloads
  • Lead and participate in customer-facing workshops, whiteboard sessions, and technical deep dives across campus switching, data center fabrics, and edge routing solutions
  • Conduct Proof of Concepts (PoCs) and hands-on validations to assess performance, scale, Day-0 automation, telemetry, and orchestration tools in both data center and campus environments
  • Create and maintain design guidelines, infrastructure blueprints, and best practices for performance-optimized and scalable networking deployments across AI DC, enterprise, and routers use cases
  • Collaborate with pre-sales and go-to-market teams to drive solution adoption and ensure alignment with customer needs and competitive differentiators
  • Contribute to RFP/RFI responses, creating comprehensive solution documentation including Bill of Materials (BoM), redundancy and topology planning
  • Work closely with product management and engineering, providing real-world field feedback to enhance product roadmaps around automation, telemetry, security, and feature development
  • Represent HPE at industry events, AI summits, and technology forums, highlighting the value of HPE’s networking portfolio in comparison to competitors
  • Stay ahead of the curve by tracking emerging trends, analysing the competitive landscape, and influencing internal strategies for next-gen network innovation
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Staff Strategic Sourcing Manager

Together AI is rapidly scaling its infrastructure, and we need a senior supply c...
Location
Location
United States , San Francisco
Salary
Salary:
220000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7-10+ years of experience in hardware strategic sourcing, procurement, or supply chain management within data center infrastructure, cloud computing, or high-performance computing environments
  • Deep and direct experience with the full compute hardware stack (GPUs, servers, networking, storage), including expertise in major OEM/ODM supplier landscapes, semiconductor supply chain dynamics, and hands-on experience managing GPU sourcing and allocation at scale
  • Track record of personally leading and closing complex, high-value hardware deals. Experience structuring long-term supply agreements across pricing, delivery, and risk dimensions
  • Experience building supply chain models for technical hardware at scale: demand forecasting, inventory strategy, logistics coordination, and supply risk mitigation
  • Strong executive presence with the ability to partner with and influence C-level leaders, senior engineering teams, and cross-functional stakeholders. Comfortable presenting supply chain strategy and risk assessments to senior leadership
  • Advanced analytical skills with fluency in total cost of ownership modeling, financial trade-off analysis, and procurement performance metrics
  • Ability to travel to supplier sites
  • Must have recent experience in high-growth, ambiguous environments where processes are being defined rather than inherited
Job Responsibility
Job Responsibility
  • Lead the full strategic sourcing and procurement lifecycle for GPUs, servers, networking equipment, storage, and supporting components across large-scale cluster builds. Own sourcing, negotiation, contracting, delivery coordination, and acceptance
  • Negotiate and structure multi-million dollar supply agreements across several categories of hardware vendors. Secure pricing, volume commitments, lead times, and warranty terms that protect the company's cost position and supply continuity
  • Design and scale the company's hardware supply chain, including inventory planning, supplier diversification, and logistics. Build procurement infrastructure that keeps pace with rapid capacity expansion across multiple sites and geographies
  • Track GPU and data center commodity hardware markets for supply shifts, pricing dynamics, component roadmaps, and geopolitical risks. Translate market intelligence into sourcing recommendations and present findings to executive leadership. Optimize TCO
  • Own strategic supplier relationships and drive performance through regular executive reviews, joint planning, and clear accountability frameworks. Qualify and onboard new vendors as the hardware supply chain diversifies
  • Align supply chain strategy with technical roadmaps and capital plans. Provide visibility into supply chain status, risks, and investment trade-offs at the executive level
  • Stand up the supply chain tools, workflows, and reporting systems needed to manage hardware spending at scale, including cost tracking, order management, and vendor benchmarking
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other benefits
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Solutions Architect

TME/Solutions Architect – DCN Switching & Solution role at Hewlett Packard Enter...
Location
Location
China , Beijing
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep knowledge and hands-on experience in networking protocols: BGP, OSPF, EVPN, VXLAN, MCLAG, DRNI, ISSU, MACSec, DCI, MPLS and SDN based solutions
  • Experience in Day 0 to Day 1 deployment of spine-leaf fabrics with any SDN controllers, micro segmentation, and service chaining
  • Working knowledge of automation and orchestration tools used in data center deployments
  • Familiarity with SDN controller architecture and integration with third-party services
  • Proven ability to engage with both technical and business stakeholders to design and defend high-impact networking solutions
  • Strong competitive knowledge of other vendor offerings including 100G/400G/800G switching platforms, transceivers and cables
  • Excellent written and verbal communication skills in English
  • Good presentation and event management skills
Job Responsibility
Job Responsibility
  • Serve as a trusted technical advisor for customers across AI data centers, and service provider and enterprise environments
  • Architect and support AI-ready Ethernet data center deployments using leaf-spine topologies, EVPN-VXLAN overlays, and RoCEv2 fabrics optimized for GPU-based workloads
  • Lead and participate in customer-facing workshops, whiteboard sessions, and technical deep dives across campus switching, data center fabrics, and edge routing solutions
  • Conduct Proof of Concepts (PoCs) and hands-on validations to assess performance, scale, Day-0 automation, telemetry, and orchestration tools
  • Create and maintain design guidelines, infrastructure blueprints, and best practices for performance-optimized and scalable networking deployments
  • Collaborate with pre-sales and go-to-market teams to drive solution adoption and ensure alignment with customer needs
  • Contribute to RFP/RFI responses, creating comprehensive solution documentation including Bill of Materials (BoM), redundancy and topology planning
  • Work closely with product management and engineering, providing real-world field feedback to enhance product roadmaps and feature development
  • Represent HPE at industry events, AI summits, and technology forums
  • Stay ahead of the curve by tracking emerging trends, analysing the competitive landscape, and influencing internal strategies for next-gen network innovation
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right
New

Machine Learning Infrastructure Engineer

At Boeing, we innovate and collaborate to make the world a better place. We’re c...
Location
Location
United States , Huntsville, Alabama
Salary
Salary:
Not provided
boeing.com Logo
Boeing
Expiration Date
May 08, 2026
Flip Icon
Requirements
Requirements
  • Bachelor's degree
  • Ability to obtain a U.S. Security Clearance for which the U.S. Government requires U.S. Citizenship
  • 1+ years of experience with LINUX system administration
  • 1+ years of experience developing software using Docker or Kubernetes for container-based applications
  • 1+ years of experience with computing networking/storage concepts and architecture
Job Responsibility
Job Responsibility
  • Supports Linux and Windows system administration tasks including system monitoring, patching, updates, and routine maintenance
  • Supports compliance with enterprise IT policies, cybersecurity standards, and regulatory requirements
  • Assists with deployment and support of ML Ops tooling used to manage GPU computing resources for AI/ML workloads
  • Supports management and operation of computing infrastructure used by AI and ML development teams
  • Assists in configuring and maintaining network devices (firewalls, switches) to ensure secure and reliable operations
  • Helps troubleshoot network connectivity issues, including VPN access, escalating complex issues to senior engineers as required
  • Assists in optimizing cloud infrastructure resources to improve performance, cost efficiency, and scalability
  • Supports virtualization platforms and cluster technologies, ensuring availability and performance
  • Assists with administration of distributed storage and storage networking systems
  • Supports Kubernetes cluster operations using platforms such as Rancher and OpenShift, ensuring cluster health and security
What we offer
What we offer
  • Generous Paid Time Off (PTO)
  • Flexible work environment
  • Paid parental leave
  • Industry-leading retirement benefits with strong matching
  • Very generous tuition assistance for earning advanced degrees
  • Paid medical leave programs
  • Health insurance
  • Flexible spending accounts
  • Health savings accounts
  • Retirement savings plans
  • Fulltime
!
Read More
Arrow Right