CrawlJobs Logo

Performance Infrastructure Engineer- Data Center GPU

United States, Santa Clara 192000.00 - 288000.00 USD / Year · Job Posted March 19, 2026
Apply Position
Job Link Share

Job Description

You will be part of a small, but dedicated team driving discrete GPU products’ performance attainment solutions across hardware, software and the platform. We are seeking a highly skilled engineer to join our Infrastructure team, focused on building scalable solutions for workload automation and performance analysis supporting advanced machine learning workloads.

Job Responsibility

  • Technical team lead for a team of 5-6 engineers
  • Assess and understand the current automation and performance analysis infrastructure, identifying strengths, gaps, and opportunities for improvement
  • Collaborate with internal teams to gather technical requirements and understand evolving needs
  • Develop a forward looking plan that balances reusing existing systems with building new infrastructure where appropriate
  • Design, develop, and maintain automation and performance analysis tooling using Python, Bash, Make, and related technologies
  • Build and enhance workflow automation solutions using internally developed tools to orchestrate ML workloads
  • Develop new techniques and tooling to optimize ML workload execution, profiling, and analysis at scale

Requirements

  • Strong development experience in Python and/or Bash (or equivalent scripting languages)
  • Experience with Github, Jenkins, or similar CI/CD and code review systems
  • Linux system administration experience preferred
  • Experience developing automated test infrastructure and orchestrating multisystem workflows is preferred
  • Ansible experience is a bonus
  • Strong analytical, problem solving, and debugging skills
  • Excellent communication skills
  • must be a critical thinker and self-starter
  • Ability to quickly learn and apply new tools, technologies, and frameworks
  • Networking experience preferred, including common protocols and basic debugging
  • Experience with Docker/containers and/or virtualization technologies preferred
  • Motivating leader with good interpersonal skills
  • Bachelor’s degree in a Computer Engineering/Computer Science field with 9+ years of hands-on experience, or a Master’s degree with 7+ years of relevant experience

Nice to have

  • Ansible experience is a bonus
  • Networking experience preferred, including common protocols and basic debugging
  • Experience with Docker/containers and/or virtualization technologies preferred

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Performance Infrastructure Engineer- Data Center GPU

8 matching positions

Product Manager - AI Data Center Infrastructure

Product Manager - AI Data Center Infrastructure. We are seeking a Product Line M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–10+ years of experience in data center networking, AI infrastructure, or HPC environments
  • Strong hands-on experience with Juniper QFX platforms and JunOS
  • Deep understanding of GPU architectures: NVIDIA: H100/H200, GB200/GB300, NVLink/NVSwitch AMD: MI300/MI400, Pollara NICs, Infinity Fabric
  • Proven expertise in scale-up GPU interconnects and scale-out Ethernet fabrics
  • Strong knowledge of RDMA/ROCEv2, ECN, PFC, and buffer management
  • Familiarity with distributed AI workloads, collective operations (NCCL, RCCL)
  • Hands-on troubleshooting experience with high-speed optics, AEC cables, link training, and NIC firmware
  • Proficiency in automation and scripting (Python, Ansible, Bash, Terraform)
Job Responsibility
Job Responsibility
  • AI Data Center & Fabric Architecture: Define product requirements for AI data center network architectures supporting thousands of GPUs
  • Develop requirements for low-latency Ethernet fabrics using Juniper QFX platforms and Apstra-based automation
  • Enable high-bandwidth GPU and NIC interconnects optimized for large-scale distributed training and inference workloads
  • GPU, NIC & Interconnect Strategy: Lead requirements definition for next-generation GPUs, NICs, and interconnect technologies, staying ahead of industry roadmaps
  • Drive alignment with NVIDIA and AMD ecosystems
  • Ensure interoperability across DAC, AEC, ACC, and optical transceivers between switches and NIC endpoints
  • Define scale-up paths using PCIe, NVLink, NVSwitch, ensuring GPU-to-GPU symmetry, consistency, and bandwidth determinism
  • Switching, Routing & Telemetry: Specify and optimize L2/L3 architectures, including EVPN-VXLAN, Class-E IPv4, and AI-optimized buffer tuning
  • Leverage hardware telemetry, streaming sensors, and analytics for proactive performance assurance
  • Drive automation using Python, Ansible, Apstra, Terraform, and related tools to enforce configuration consistency and compliance
What we offer
What we offer
  • Health & Wellbeing: comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Personal & Professional Development: specific programs catered to helping you reach any career goals
  • Unconditional Inclusion: unconditionally inclusive in the way we work and celebrate individual uniqueness
Read More
Arrow Right

Business Development Manager – HPE POD (Modular Data Center Solutions)

Develop and grow the HPE Modular Data Center (POD) and AI infrastructure busines...
Location
Location
Japan , Tokyo
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering, computer science, or related technical field, or equivalent industry experience
  • Typically 8+ years of professional experience in data center infrastructure, HPC environments, modular infrastructure, or enterprise technology solutions
  • Experience in business development, solution sales, or infrastructure consulting within enterprise IT, cloud, or data center industries
  • Experience engaging with senior technical and executive stakeholders on infrastructure strategy and large-scale technology investments
  • Experience supporting complex infrastructure deals involving multiple stakeholders, partners, and delivery organizations
  • Strong understanding of modern data center architecture including modular data centers and containerized infrastructure, high-density GPU and HPC environments, liquid cooling and advanced thermal management, AI infrastructure and accelerated computing platforms, enterprise and hyperscale data center operations
  • Ability to translate complex technical infrastructure concepts into business value and strategic outcomes for customers
  • Strong commercial acumen with the ability to structure large infrastructure deals and navigate enterprise procurement processes
  • Experience working across multi-technology environments including compute, networking, storage, cooling systems, and facility infrastructure
  • Ability to develop scalable infrastructure solutions that enhance performance, efficiency, and time-to-deployment for AI and HPC workloads
Job Responsibility
Job Responsibility
  • Drive business development activities for HPE POD and modular data center solutions across targeted industries including AI, education, research, and enterprise environments
  • Identify, qualify, and develop new opportunities for modular data center deployments including AI factory infrastructure, HPC clusters, GPU environments, and edge data center solutions
  • Lead engagement with customers to understand technical, operational, and business requirements for large-scale data center deployments and translate these into POD-based solutions
  • Work closely with HPE account teams, solution architects, and partners to develop end-to-end proposals including infrastructure architecture, modular facility design, and lifecycle services
  • Act as a trusted advisor to customer executives, infrastructure teams, and decision makers on modern data center architecture, capacity scaling strategies, and AI-ready infrastructure
  • Coordinate cross-functional resources including engineering, supply chain, manufacturing partners, and delivery teams to ensure solutions are feasible, scalable, and aligned with customer timelines
  • Lead or support major proposal efforts and RFP responses for modular data center solutions, including technical positioning, commercial structuring, and value articulation
  • Support the creation of detailed solution architectures including modular data center configurations, cooling strategies (air and liquid)
  • Develop and maintain relationships with strategic ecosystem partners including cooling technology providers, modular construction manufacturers, and infrastructure integrators
  • Provide market intelligence and customer feedback to influence the evolution of the HPE POD portfolio
What we offer
What we offer
  • Health & Wellbeing (comprehensive suite of benefits supporting physical, financial and emotional wellbeing)
  • Personal & Professional Development (programs to help reach career goals)
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior Infrastructure Engineer

We are seeking a highly skilled and motivated GPU Fleet Operations Engineer to j...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
183000.00 - 210000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience diagnosing and repairing high-density, rack-mounted compute hardware in production environments
  • Deep understanding of GPU architectures and hands-on experience with GPU-based systems
  • Experience supporting NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X series platforms
  • Familiarity with high-speed interconnects such as InfiniBand, NVLink, and RDMA over Converged Ethernet (RoCE)
  • Strong Linux experience (Ubuntu, Rocky Linux, CentOS) using the command line for diagnostics and testing
  • Proficiency with GPU and system diagnostic tools such as NVIDIA DCGM and NVIDIA field diagnostic utilities
  • Experience working with enterprise server hardware, power delivery, and cooling systems
  • Strong analytical and problem-solving skills
  • Excellent communication and collaboration skills
  • Ability to work independently in a fast-paced data center or operations environment
Job Responsibility
Job Responsibility
  • Perform deep-level diagnosis and troubleshooting of hardware faults within GPU racks and high-density compute systems
  • Troubleshoot and support GPU platforms including NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X
  • Execute component-level diagnosis and remediation for failed or degraded hardware
  • Partner with data center operations to manage and perform field-replaceable unit (FRU) repairs for GPUs, power supplies, cooling systems, interconnects, and networking hardware
  • Conduct post-repair validation, burn-in testing, torch testing, and NVIDIA NCCL testing to ensure system stability and performance
  • Implement and execute preventative maintenance procedures to improve fleet reliability and extend hardware lifespan
  • Perform firmware and BIOS upgrades across the GPU fleet
  • Maintain detailed documentation of maintenance activities, failures, and resolutions in ticketing and asset management systems
  • Develop and update standard operating procedures (SOPs) for troubleshooting, repair, and validation workflows
  • Collaborate with engineering, software, and data center operations teams to identify root causes of systemic failures and implement preventative solutions
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Together Cloud Infrastructure

Together AI is building the AI Acceleration Cloud, an end-to-end platform for th...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 230000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and/or storage a big plus
  • Experience virtualizing GPUs and/or Infiniband a big plus
Job Responsibility
Job Responsibility
  • Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning
  • Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs
  • Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining
  • Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining
  • Perform architecture and research work for decentralized AI workloads
  • Work on the core, open-source Together AI platform
  • Create services, tools, and developer documentation
  • Create testing frameworks for robustness and fault-tolerance
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other benefits
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Enterprise Territory Executive

AMD is seeking a high-impact Enterprise Territory Executive to drive growth acro...
Location
Location
United States , Texas
Salary
Salary:
224560.00 - 336840.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven success selling enterprise technology solutions and consistently exceeding revenue objectives
  • Strong ability to build executive relationships and influence business and technology decision-makers
  • Strategic mindset with a track record of identifying, developing, and closing complex opportunities
  • Passion for emerging technologies including AI, cloud computing, data center modernization, and digital transformation
  • Strong consultative selling skills focused on customer outcomes and business value
  • Collaborative approach with the ability to work effectively across highly matrixed organizations
  • Excellent communication, presentation, and relationship-building skills
  • Willingness and ability to travel approximately 50%
  • Proven success selling Data Center, AI Infrastructure, Enterprise Compute, Cloud, Storage, Networking, or Commercial Client solutions into enterprise organizations
  • Demonstrated ability to consistently exceed sales targets and drive business growth
Job Responsibility
Job Responsibility
  • Develop and execute territory growth strategies that expand AMD's presence across enterprise customers throughout the Central U.S. region
  • Identify, qualify, and close opportunities that drive revenue growth, market share expansion, and long-term customer value
  • Build and maintain a healthy pipeline across Data Center, AI, Cloud, and Commercial Client opportunities
  • Develop territory plans that align customer priorities with AMD's strategic growth objectives
  • Build trusted advisor relationships with CIOs, CTOs, Chief Architects, Infrastructure Leaders, Procurement Organizations, and Line-of-Business stakeholders
  • Understand customer technology roadmaps, business priorities, and modernization initiatives
  • Position AMD solutions as strategic enablers of AI adoption, cloud transformation, infrastructure modernization, and workforce productivity
  • Drive adoption of AMD EPYC™ processors across virtualization, cloud, storage, enterprise applications, and high-performance computing workloads
  • Position AMD Instinct™ accelerators to support AI training, inference, advanced analytics, and emerging enterprise AI initiatives
  • Engage customers on AI infrastructure strategies, GPU acceleration, software ecosystems, and future workload requirements
What we offer
What we offer
  • Benefits offered are described: AMD benefits at a glance.
  • Fulltime
Read More
Arrow Right

Ai network engineer

We are seeking an experienced AI Network Engineer to support and optimize high-p...
Location
Location
United States , Houston
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in network engineering or infrastructure engineering
  • Hands-on experience with high-performance networking (InfiniBand, RDMA, RoCE)
  • Experience supporting GPU-based or HPC environments
  • Strong knowledge of data center networking (L2/L3, BGP, EVPN, VXLAN)
  • Familiarity with Linux systems and performance tuning
  • Experience with NVIDIA ecosystems (DGX, CUDA, NCCL, or similar)
  • Ability to diagnose low-latency and high-throughput network issues
Job Responsibility
Job Responsibility
  • Design, implement, and support high-performance networks for AI/ML workloads, including GPU clusters and distributed training environments
  • Deploy and optimize NVIDIA-based infrastructure (DGX systems, HGX platforms, or GPU clusters)
  • Configure and manage high-speed networking technologies such as InfiniBand, RoCE, and 100/200/400Gb Ethernet
  • Optimize network performance for east-west traffic, low latency, and large data throughput required for AI model training
  • Integrate NVIDIA software stack (CUDA, NCCL, GPU Cloud, AI Enterprise) with networking and compute environments
  • Troubleshoot performance bottlenecks across network, storage, and GPU interconnects
  • Collaborate with AI/ML engineers to ensure infrastructure meets training and inference demands
  • Support automation and infrastructure-as-code initiatives for scalable AI environments
What we offer
What we offer
  • medical, vision, dental, and life and disability insurance
  • company 401(k) plan
Read More
Arrow Right

Sr. Specialist Solutions Architect, High Performance Computing and Machine Learning

AWS Worldwide Specialist Solutions Architects (SSAs) are technologists with deep...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
Amazon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of specific technology domain areas (e.g. software development, cloud computing, systems engineering, infrastructure, security, networking, data & analytics) experience
  • 3+ years of design, implementation, or consulting in applications and infrastructures experience
  • 10+ years of IT development or implementation/consulting in the software or Internet industries experience
Job Responsibility
Job Responsibility
  • Design Customer Solutions – Collaborate with the wider AWS teams to help customers and partners architect HPC Solutions that leverage AWS Services
  • Engage with Solution Architects, Account Managers, Professional Services, and Partners to define an HPC Engagement strategy for AWS operational territories and key accounts, with emphasis on public sector National Super Computing Centers, Government agencies, and/or AI/ML, CAE, Weather and accelerated computing with GPU
  • Thought Leadership – Provide thought leadership on solutions that benefit customers through the use of AWS Services
  • Serve as a key member of the business development and accounting management team in helping to ensure customer success in building and migrating applications, software and services on the AWS platform
  • Assist solution providers with the definition and implementation of technical and business strategies
  • Capture and share best-practice knowledge amongst the worldwide AWS solution architect community
  • Understand AWS market segments, and industry verticals
  • Understand and exploit the use of internal business support systems
  • Fulltime
Read More
Arrow Right

Technical Sourcing Manager - AI GPU & Cloud

The Global Technology Sourcing Team moves fast and leverages key partnerships to...
Location
Location
United States , Menlo Park
Salary
Salary:
208000.00 - 289000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Electrical Engineering, Mechanical Engineering or relevant technical field, and/or equivalent practical experience
  • 12+ years of technical experience in engineering, product management, or supply chain roles within data center, infrastructure, cloud, AI silicon semiconductor, or associated high-tech hardware industries
  • Demonstrated understanding of relevant supplier technology roadmaps, performance drivers and a proven pragmatic approach to influencing supplier product roadmaps
  • Experience moving seamlessly from strategy to execution and delivering tangible results in complex, cross-functional, and fast-paced environments
  • Interpersonal and communication skills, with experience influencing, negotiating, building consensus and making key strategic decisions
  • Experience interfacing with internal and external partners in a fast-paced, often ambiguous, entrepreneurial and cross-functional environment, requiring a wide latitude for independent judgment while coordinating people and technical resources
Job Responsibility
Job Responsibility
  • Maintain current knowledge of the technology and industry trends and perform competitive analysis and due diligence on relevant products
  • Develop, manage and refresh individual and customized technology and commodity sourcing strategies and roadmaps and in-depth understanding of the actively managed adjacent technologies
  • Develop and maintain in depth technical relationships with executive management at relevant suppliers. Influence supplier technology roadmaps to ensure Meta's system architecture and technical requirements are met
  • In conjunction with Hardware/Software Engineering, Technology Strategy, and Technical Program Management, lead the creation and definition of future technology directions in the assigned commodity
  • Provide technical commodity and supplier expertise and consulting to research & design engineering for technology migration and lower TCO
  • Provide technical review and analysis for, and responses to, supplier proposals and RFQs. Resolve all technical queries arising from the quote process
  • Drive pre-POR evaluation of key technologies usability/suitability within the Meta infrastructure
  • Own technical expertise and due diligence in supplier negotiations and provide review and input for product development and master supply agreements
  • Work with hardware engineers to ensure part specifications and requirements are within broad commodity and supplier capabilities, avoiding special or unique SKUs
  • Develop supplier technical process improvement plans to drive positive gain on cost, quality, and reduced qualification and time to deployment
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right