CrawlJobs Logo

Performance Infrastructure Engineer- Data Center GPU

amd.com Logo

AMD

Location Icon

Location:
United States , Santa Clara

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

192000.00 - 288000.00 USD / Year

Job Description:

You will be part of a small, but dedicated team driving discrete GPU products’ performance attainment solutions across hardware, software and the platform. We are seeking a highly skilled engineer to join our Infrastructure team, focused on building scalable solutions for workload automation and performance analysis supporting advanced machine learning workloads.

Job Responsibility:

  • Technical team lead for a team of 5-6 engineers
  • Assess and understand the current automation and performance analysis infrastructure, identifying strengths, gaps, and opportunities for improvement
  • Collaborate with internal teams to gather technical requirements and understand evolving needs
  • Develop a forward looking plan that balances reusing existing systems with building new infrastructure where appropriate
  • Design, develop, and maintain automation and performance analysis tooling using Python, Bash, Make, and related technologies
  • Build and enhance workflow automation solutions using internally developed tools to orchestrate ML workloads
  • Develop new techniques and tooling to optimize ML workload execution, profiling, and analysis at scale

Requirements:

  • Strong development experience in Python and/or Bash (or equivalent scripting languages)
  • Experience with Github, Jenkins, or similar CI/CD and code review systems
  • Linux system administration experience preferred
  • Experience developing automated test infrastructure and orchestrating multisystem workflows is preferred
  • Ansible experience is a bonus
  • Strong analytical, problem solving, and debugging skills
  • Excellent communication skills
  • must be a critical thinker and self-starter
  • Ability to quickly learn and apply new tools, technologies, and frameworks
  • Networking experience preferred, including common protocols and basic debugging
  • Experience with Docker/containers and/or virtualization technologies preferred
  • Motivating leader with good interpersonal skills
  • Bachelor’s degree in a Computer Engineering/Computer Science field with 9+ years of hands-on experience, or a Master’s degree with 7+ years of relevant experience

Nice to have:

  • Ansible experience is a bonus
  • Networking experience preferred, including common protocols and basic debugging
  • Experience with Docker/containers and/or virtualization technologies preferred

Additional Information:

Job Posted:
March 19, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Performance Infrastructure Engineer- Data Center GPU

Product Manager - AI Data Center Infrastructure

Product Manager - AI Data Center Infrastructure. We are seeking a Product Line M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–10+ years of experience in data center networking, AI infrastructure, or HPC environments
  • Strong hands-on experience with Juniper QFX platforms and JunOS
  • Deep understanding of GPU architectures: NVIDIA: H100/H200, GB200/GB300, NVLink/NVSwitch AMD: MI300/MI400, Pollara NICs, Infinity Fabric
  • Proven expertise in scale-up GPU interconnects and scale-out Ethernet fabrics
  • Strong knowledge of RDMA/ROCEv2, ECN, PFC, and buffer management
  • Familiarity with distributed AI workloads, collective operations (NCCL, RCCL)
  • Hands-on troubleshooting experience with high-speed optics, AEC cables, link training, and NIC firmware
  • Proficiency in automation and scripting (Python, Ansible, Bash, Terraform)
Job Responsibility
Job Responsibility
  • AI Data Center & Fabric Architecture: Define product requirements for AI data center network architectures supporting thousands of GPUs
  • Develop requirements for low-latency Ethernet fabrics using Juniper QFX platforms and Apstra-based automation
  • Enable high-bandwidth GPU and NIC interconnects optimized for large-scale distributed training and inference workloads
  • GPU, NIC & Interconnect Strategy: Lead requirements definition for next-generation GPUs, NICs, and interconnect technologies, staying ahead of industry roadmaps
  • Drive alignment with NVIDIA and AMD ecosystems
  • Ensure interoperability across DAC, AEC, ACC, and optical transceivers between switches and NIC endpoints
  • Define scale-up paths using PCIe, NVLink, NVSwitch, ensuring GPU-to-GPU symmetry, consistency, and bandwidth determinism
  • Switching, Routing & Telemetry: Specify and optimize L2/L3 architectures, including EVPN-VXLAN, Class-E IPv4, and AI-optimized buffer tuning
  • Leverage hardware telemetry, streaming sensors, and analytics for proactive performance assurance
  • Drive automation using Python, Ansible, Apstra, Terraform, and related tools to enforce configuration consistency and compliance
What we offer
What we offer
  • Health & Wellbeing: comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Personal & Professional Development: specific programs catered to helping you reach any career goals
  • Unconditional Inclusion: unconditionally inclusive in the way we work and celebrate individual uniqueness
Read More
Arrow Right

Senior Infrastructure Engineer

We are seeking a highly skilled and motivated GPU Fleet Operations Engineer to j...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
183000.00 - 210000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience diagnosing and repairing high-density, rack-mounted compute hardware in production environments
  • Deep understanding of GPU architectures and hands-on experience with GPU-based systems
  • Experience supporting NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X series platforms
  • Familiarity with high-speed interconnects such as InfiniBand, NVLink, and RDMA over Converged Ethernet (RoCE)
  • Strong Linux experience (Ubuntu, Rocky Linux, CentOS) using the command line for diagnostics and testing
  • Proficiency with GPU and system diagnostic tools such as NVIDIA DCGM and NVIDIA field diagnostic utilities
  • Experience working with enterprise server hardware, power delivery, and cooling systems
  • Strong analytical and problem-solving skills
  • Excellent communication and collaboration skills
  • Ability to work independently in a fast-paced data center or operations environment
Job Responsibility
Job Responsibility
  • Perform deep-level diagnosis and troubleshooting of hardware faults within GPU racks and high-density compute systems
  • Troubleshoot and support GPU platforms including NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X
  • Execute component-level diagnosis and remediation for failed or degraded hardware
  • Partner with data center operations to manage and perform field-replaceable unit (FRU) repairs for GPUs, power supplies, cooling systems, interconnects, and networking hardware
  • Conduct post-repair validation, burn-in testing, torch testing, and NVIDIA NCCL testing to ensure system stability and performance
  • Implement and execute preventative maintenance procedures to improve fleet reliability and extend hardware lifespan
  • Perform firmware and BIOS upgrades across the GPU fleet
  • Maintain detailed documentation of maintenance activities, failures, and resolutions in ticketing and asset management systems
  • Develop and update standard operating procedures (SOPs) for troubleshooting, repair, and validation workflows
  • Collaborate with engineering, software, and data center operations teams to identify root causes of systemic failures and implement preventative solutions
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Solutions Architect – Campus, DCN Switching & Routing

We are looking for a seasoned TME/Networking Solutions Architect with deep exper...
Location
Location
China , Beijing
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep knowledge and hands-on experience in networking protocols: BGP, OSPF, EVPN, VXLAN, MCLAG, DRNI, ISSU, MACSec, DCI
  • Experience in Day 0 to Day 1 deployment of spine-leaf fabrics with any SDN controllers, micro segmentation, and service chaining
  • Working knowledge of automation and orchestration tools used in data center deployments
  • Familiarity with SDN controller architecture and integration with third-party services
  • Proven ability to engage with both technical and business stakeholders to design and defend high-impact networking solutions
  • Strong competitive knowledge of other vendor offerings — including campus solutions, 400G/800G switching platforms, and transceivers such as but not limited to QSFP-DD and OSFP
  • Excellent written and verbal communication skills
  • ability to create compelling documentation and technical collateral
Job Responsibility
Job Responsibility
  • Serve as a trusted technical advisor for customers across AI data centers, enterprise campus networks, and service provider environments — identifying technical requirements, resolving pain points, and showcasing HPE’s end-to-end networking capabilities
  • Architect and support AI-ready Ethernet data center deployments using leaf-spine topologies, EVPN-VXLAN overlays, and RoCEv2 fabrics optimized for GPU-based workloads
  • Lead and participate in customer-facing workshops, whiteboard sessions, and technical deep dives across campus switching, data center fabrics, and edge routing solutions
  • Conduct Proof of Concepts (PoCs) and hands-on validations to assess performance, scale, Day-0 automation, telemetry, and orchestration tools in both data center and campus environments
  • Create and maintain design guidelines, infrastructure blueprints, and best practices for performance-optimized and scalable networking deployments across AI DC, enterprise, and routers use cases
  • Collaborate with pre-sales and go-to-market teams to drive solution adoption and ensure alignment with customer needs and competitive differentiators
  • Contribute to RFP/RFI responses, creating comprehensive solution documentation including Bill of Materials (BoM), redundancy and topology planning
  • Work closely with product management and engineering, providing real-world field feedback to enhance product roadmaps around automation, telemetry, security, and feature development
  • Represent HPE at industry events, AI summits, and technology forums, highlighting the value of HPE’s networking portfolio in comparison to competitors
  • Stay ahead of the curve by tracking emerging trends, analysing the competitive landscape, and influencing internal strategies for next-gen network innovation
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Staff Strategic Sourcing Manager

Together AI is rapidly scaling its infrastructure, and we need a senior supply c...
Location
Location
United States , San Francisco
Salary
Salary:
220000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7-10+ years of experience in hardware strategic sourcing, procurement, or supply chain management within data center infrastructure, cloud computing, or high-performance computing environments
  • Deep and direct experience with the full compute hardware stack (GPUs, servers, networking, storage), including expertise in major OEM/ODM supplier landscapes, semiconductor supply chain dynamics, and hands-on experience managing GPU sourcing and allocation at scale
  • Track record of personally leading and closing complex, high-value hardware deals. Experience structuring long-term supply agreements across pricing, delivery, and risk dimensions
  • Experience building supply chain models for technical hardware at scale: demand forecasting, inventory strategy, logistics coordination, and supply risk mitigation
  • Strong executive presence with the ability to partner with and influence C-level leaders, senior engineering teams, and cross-functional stakeholders. Comfortable presenting supply chain strategy and risk assessments to senior leadership
  • Advanced analytical skills with fluency in total cost of ownership modeling, financial trade-off analysis, and procurement performance metrics
  • Ability to travel to supplier sites
  • Must have recent experience in high-growth, ambiguous environments where processes are being defined rather than inherited
Job Responsibility
Job Responsibility
  • Lead the full strategic sourcing and procurement lifecycle for GPUs, servers, networking equipment, storage, and supporting components across large-scale cluster builds. Own sourcing, negotiation, contracting, delivery coordination, and acceptance
  • Negotiate and structure multi-million dollar supply agreements across several categories of hardware vendors. Secure pricing, volume commitments, lead times, and warranty terms that protect the company's cost position and supply continuity
  • Design and scale the company's hardware supply chain, including inventory planning, supplier diversification, and logistics. Build procurement infrastructure that keeps pace with rapid capacity expansion across multiple sites and geographies
  • Track GPU and data center commodity hardware markets for supply shifts, pricing dynamics, component roadmaps, and geopolitical risks. Translate market intelligence into sourcing recommendations and present findings to executive leadership. Optimize TCO
  • Own strategic supplier relationships and drive performance through regular executive reviews, joint planning, and clear accountability frameworks. Qualify and onboard new vendors as the hardware supply chain diversifies
  • Align supply chain strategy with technical roadmaps and capital plans. Provide visibility into supply chain status, risks, and investment trade-offs at the executive level
  • Stand up the supply chain tools, workflows, and reporting systems needed to manage hardware spending at scale, including cost tracking, order management, and vendor benchmarking
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other benefits
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Solutions Architect

TME/Solutions Architect – DCN Switching & Solution role at Hewlett Packard Enter...
Location
Location
China , Beijing
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep knowledge and hands-on experience in networking protocols: BGP, OSPF, EVPN, VXLAN, MCLAG, DRNI, ISSU, MACSec, DCI, MPLS and SDN based solutions
  • Experience in Day 0 to Day 1 deployment of spine-leaf fabrics with any SDN controllers, micro segmentation, and service chaining
  • Working knowledge of automation and orchestration tools used in data center deployments
  • Familiarity with SDN controller architecture and integration with third-party services
  • Proven ability to engage with both technical and business stakeholders to design and defend high-impact networking solutions
  • Strong competitive knowledge of other vendor offerings including 100G/400G/800G switching platforms, transceivers and cables
  • Excellent written and verbal communication skills in English
  • Good presentation and event management skills
Job Responsibility
Job Responsibility
  • Serve as a trusted technical advisor for customers across AI data centers, and service provider and enterprise environments
  • Architect and support AI-ready Ethernet data center deployments using leaf-spine topologies, EVPN-VXLAN overlays, and RoCEv2 fabrics optimized for GPU-based workloads
  • Lead and participate in customer-facing workshops, whiteboard sessions, and technical deep dives across campus switching, data center fabrics, and edge routing solutions
  • Conduct Proof of Concepts (PoCs) and hands-on validations to assess performance, scale, Day-0 automation, telemetry, and orchestration tools
  • Create and maintain design guidelines, infrastructure blueprints, and best practices for performance-optimized and scalable networking deployments
  • Collaborate with pre-sales and go-to-market teams to drive solution adoption and ensure alignment with customer needs
  • Contribute to RFP/RFI responses, creating comprehensive solution documentation including Bill of Materials (BoM), redundancy and topology planning
  • Work closely with product management and engineering, providing real-world field feedback to enhance product roadmaps and feature development
  • Represent HPE at industry events, AI summits, and technology forums
  • Stay ahead of the curve by tracking emerging trends, analysing the competitive landscape, and influencing internal strategies for next-gen network innovation
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Compute Partnerships Lead

We are looking for a Compute Partnerships Lead to architect and operate our glob...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3–7+ years in infrastructure partnerships, business development, commercial sourcing, or AI infrastructure strategy
  • Direct experience negotiating GPU/cloud/data center agreements strongly preferred
  • Strong understanding of AI workloads (training vs inference, memory constraints, networking, utilization economics)
  • Experience working cross-functionally with engineering and finance
  • High commercial discipline — comfortable modeling margin, utilization, and contract tradeoffs
  • Comfortable operating in constrained supply environments
  • Strong ownership mentality — you build systems, not just deals
  • Ability to travel and manage global partnerships across time zones
Job Responsibility
Job Responsibility
  • Develop and execute Prime Intellect’s global GPU sourcing strategy across H100/H200/B200-class infrastructure and beyond
  • Structure commercial agreements that balance cost, flexibility, term length, and growth optionality
  • Identify and evaluate infrastructure partners across hyperscalers, specialized AI clouds, data centers, colocation providers, and hardware vendors
  • Lead negotiations on pricing, SLAs, capacity reservations, expansion rights, and risk allocation
  • Continuously optimize blended gross margins through disciplined sourcing and contract structuring
  • Secure capacity for internal frontier RL research and model training
  • Coordinate closely with research and engineering teams to understand workload requirements (training vs inference vs long-context deployments)
  • Align capacity planning with enterprise deployment roadmaps
  • Ensure compute supply keeps pace with customer expansion and new model launches
  • Work with infrastructure, platform, and DevOps teams to ensure partner capacity is onboarded efficiently and runs reliably in production
What we offer
What we offer
  • Competitive Compensation + equity incentives
  • Flexible Work (remote or San Francisco)
  • Visa Sponsorship and relocation support
  • Professional Development budget
  • Team off-sites and conference attendance
  • Opportunity to shape decentralized AI at Prime Intellect
  • Fulltime
Read More
Arrow Right

Principal Engineer

The Senior Data Center Operations Engineer is responsible for the bedrock of our...
Location
Location
United States , Santa Clara
Salary
Salary:
147000.00 - 237500.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, IT, or equivalent experience
  • 5+ years of experience specifically operating Red Hat OpenShift (OCP) in a production environment
  • Deep experience racking/stacking and cabling high-density GPU systems (e.g., NVIDIA DGX or similar) and specialized AI/ML hardware
  • Advanced proficiency in Ansible or Pulumi for automating bare-metal provisioning and cluster configuration
  • Strong Python and Bash skills for developing custom health-check scripts and API integrations
  • Expert-level CoreOS and RHEL administration, including kernel tuning and systemd management
  • Solid understanding of BGP, VLAN tagging, LACP, and Load Balancing (F5/NGINX) essential for cluster ingress
  • Experience with vSphere or KVM, and persistent storage solutions like OpenShift Data Foundation (ODF) or Ceph
  • Familiarity with DCIM tools (Netbox) and monitoring stacks ( ELK/Lok ..etci)
  • Ability to lift and move equipment up to 50 pounds (e.g., high-density 2U/4U servers)
Job Responsibility
Job Responsibility
  • Design and development of a scalable distributed management plane infrastructure to manage Palo Alto Networks’ next-generation network security solutions
  • Ensure 99.99% availability by architecting resilient physical layouts and automating the deployment, scaling, and self-healing capabilities of our production clusters
  • Monitor and maintain data center systems with a focus on 'Zero Single Point of Failure' (ZSPoF) architecture for OpenShift control planes and worker nodes
  • Implement and manage OpenShift 4.x clusters across multiple power and cooling zones to ensure 99.99% uptime
  • Design, test, and execute automated failover strategies and backup/restore procedures using tools like OADP (Velero) and Red Hat ACM
  • Perform routine maintenance and upgrades using GitOps (ArgoCD) and the Machine Config Operator to ensure zero-downtime node evacuations and patching
  • Resolve deep-stack hardware and software issues, from faulty GPU firmware to OpenShift SDN (OVN-Kubernetes) network latencies
  • Coordinate with vendors for specialized hardware (e.g., NVIDIA, Dell, Cisco) while maintaining strict security and firmware compliance
  • Optimize rack density for high-performance GPU clusters while managing thermal loads and power distribution (PDU) to prevent circuit-trip outages
  • Maintain accurate documentation and integrate hardware health metrics (IPMI/SNMP) into Prometheus/Grafana for proactive alerting
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Together Cloud Infrastructure

Together AI is building the AI Acceleration Cloud, an end-to-end platform for th...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 230000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and/or storage a big plus
  • Experience virtualizing GPUs and/or Infiniband a big plus
Job Responsibility
Job Responsibility
  • Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning
  • Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs
  • Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining
  • Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining
  • Perform architecture and research work for decentralized AI workloads
  • Work on the core, open-source Together AI platform
  • Create services, tools, and developer documentation
  • Create testing frameworks for robustness and fault-tolerance
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other benefits
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right