CrawlJobs Logo

HPC Fabrics Engineer

India, Bangalore · Job Posted March 21, 2026
Apply Position
Job Link Share

Job Description

High Performance Computing, AI and Labs is a critical element of HPE. We are focused on delivering innovative solutions that accelerate our customers’ digital transformation, enabling them to tackle their complex, and data-intensive workloads. Combining deep expertise and the development of the world’s most cutting-edge, high-performance supercomputers, is defining the next era of computing delivering valuable insight & innovation. Join us and redefine what’s next for you.

Job Responsibility

  • Develop, test and release Firmware and Driver components for HPC Option cards (InfiniBand, High Speed Ethernet adapters) on Linux, Windows and VMware OS
  • Qualify HPC Option card components on HPE Server platforms
  • Handle Level-4 support for HPC Fabrics components. Collaborate with Level3 support, account teams and customers as needed and provide technical assistance on escalated issues
  • Work closely with partner to ensure product quality requirements and release timelines are met
  • Collaborate with Engineering teams (Platform, Thermal, Factory, Benchmarking and test teams) on HPC Option card firmware, driver component related issues
  • Create and contribute to Trainings, Advisories and knowledge base articles on HPC Fabrics Technology

Requirements

  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 2-4 years experience
  • Experience with Linux, Windows and VMware ESXi platforms
  • Scripting knowledge (Shell, Perl, Python)
  • Working knowledge of virtualization environment and Hypervisors
  • Knowledge on Server hardware architecture, PCIe, NVLink speeds and concepts like NUMA
  • Hands-on experience with InfiniBand and Ethernet (RoCE) networks
  • Experience in configuring, troubleshooting and tuning Infiniband and Ethernet networks
  • Hands-on Experience with network performance testing (RDMA Perftest, iPerf, netperf) and application level testing (NCCL, RCCL)
  • Knowledge of High Performance Computing (HPC) stack components, infrastructure and scale-out deployments
  • Ability to work with multiple internal teams and interact with customers/partners
  • Ability to prioritize and handle multiple tasks simultaneously
  • Excellent communication, collaboration and interpersonal skills
  • Ability to work well in team environment, take on challenges, comfortable and effective working on new areas that require experimentation and rapid problem solving

Nice to have

  • Experience with GPUs, Compute accelerators, SSD drives and Storage controllers would be a plus
  • Cloud Architectures, Cross Domain Knowledge, Design Thinking, Development Fundamentals, DevOps, Distributed Computing, Microservices Fluency, Full Stack Development, Security-First Mindset, Solutions Design, Testing & Automation, User Experience (UX)

What we offer

  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

HPC Fabrics Engineer

8 matching positions

Sr Kubernetes Engineer for HPC

Our client, located in Milpitas, CA, is currently in need of a Sr Kubernetes Eng...
Location
Location
United States , Milpitas
Salary
Salary:
70.00 - 90.00 USD / Hour
clearbridgetech.com Logo
ClearBridge Technology Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to work onsite
  • Understand Kubernetes deeply, support troubleshoot and optimize a Kubernetes driven HPC system
  • Containers: Kubernetes, Docker, etc.
  • Automation Tools: Grafana, Ansible
  • Cluster Management
  • Open-source data tools: Kafka
  • Cloud Databases: AWS Databases
  • Linux
  • HPC related tools
Job Responsibility
Job Responsibility
  • Help the customer gain long term confidence in the network stability, and detect marginal ports, packet drops and flapping links
  • Comprehensive fabric health monitoring, and integration with existing observability tools
  • Perform post-deployment integration testing beyond PS baseline tests
  • Validate end to end system behavior in real customer environments
  • Validate sustained performance targets
  • Diagnose performance issues across layers
  • Perform root-cause analysis
What we offer
What we offer
  • excellent benefits and compensation packages
Read More
Arrow Right

Infrastructure Hardware Technical Program Manager (Server And Network Systems)

As an Infrastructure Hardware Technical Program Manager (Server and Network Syst...
Location
Location
United States; Canada , Sunnyvale; Toronto
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.S. or M.S. in Computer Science, Electrical/Computer Engineering, or equivalent experience
  • 8+ years in Technical Program Management (or similar delivery leadership) for server, network, or infrastructure platforms from concept through production
  • Experience coordinating complex server and/or datacenter network programs across OEM/ODMs, switch vendors, and internal engineering teams
  • Working knowledge of server architecture (CPU/NUMA, memory bandwidth, PCIe, NIC and storage IO) and enough networking fundamentals (leaf-spine fabrics, switch platforms, high-performance interconnects) to run effective technical reviews
  • Familiarity with Linux server fleet management (provisioning, firmware/BIOS, drivers, field triage)
  • Strong multi-team program execution skills: integrated plans, risk management, dependency tracking, and executive-level communication
  • Ability to operate in ambiguity and keep parallel server and network workstreams aligned
Job Responsibility
Job Responsibility
  • Own end-to-end program execution for server systems and network equipment in Cerebras clusters, including new platforms, refreshes, and major component/config changes
  • Drive requirements gathering and convert inputs into executable plans with clear milestones, readiness gates, and cross-functional deliverables
  • Represent Cluster Architecture in executive reviews, OKR cycles, and leadership/customer forums as needed
  • Build and manage integrated schedules across vendors and internal teams, track dependencies, critical path, and risks
  • Manage OEM/ODM and switch/vendor engagements (RFI/RFP, samples, escalations, roadmap alignment)
  • Partner with Compute / Server Platform / Network Architects to turn architectural decisions into qualification plans, acceptance criteria, and rollout strategies
  • Lead qualification and release readiness (lab/staging validation, regression tracking, go/no-go decisions)
  • Own risk and change management into production, including versioning, rollout sequencing, and stakeholder communication
  • Ensure operational readiness with deployment and fleet teams and maintain alignment with rack/physical DC owners on power, cooling, space, and cabling constraints
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
  • Fulltime
Read More
Arrow Right

Data Center GPU Performance Attainment Lead

The successful candidate will assume responsibility for post-silicon activities ...
Location
Location
United States , Austin
Salary
Salary:
178400.00 - 267600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven leadership skills with experience mentoring junior engineers, coordinating cross-functional teams, and driving complex performance characterization and optimization efforts across multiple locations
  • Strong programming skills, with preference for Python and experience with ML frameworks (e.g., TensorFlow or PyTorch)
  • Proficiency in C/C++, scripting (Shell), and familiarity with performance tooling and automation workflows
  • Strong understanding of computer architecture and system organization
  • Deep knowledge of HPC and ML workloads, including scaling behavior and performance bottlenecks
  • Experience with scale-up and scale-out performance analysis at rack-level and cluster-level deployments
  • Strong analytical and problem-solving skills, with a high level of attention to detail
  • Excellent interpersonal, collaboration, and communication skills
  • Bachelor's or Master's degree in Computer Engineering, Electrical Engineering, Computer Science, or related field
Job Responsibility
Job Responsibility
  • Develop and maintain automation frameworks for workload execution and performance data collection, enabling scalable and repeatable characterization across configurations
  • Become a key stakeholder in the product power and performance definition process, ensuring alignment between architectural goals and measured silicon performance
  • Develop, execute, and evolve performance characterization and optimization test plans across diverse usage scenarios, including High Performance Computing (HPC) and Machine Learning (ML) workloads
  • Drive performance attainment for both scale-up (intra-node) and scale-out (multi-node) configurations, including: Multi-GPU scaling efficiency within a node, Interconnect bandwidth utilization (e.g., XGMI / Infinity Fabric), Collective communication efficiency and communication-compute overlap, Workload scaling behavior (strong and weak scaling), Identification and mitigation of system-level bottlenecks across distributed environments
  • Analyze interactions between power management features and performance behavior, optimizing configurations to achieve the best performance and performance-per-watt tradeoffs
  • Identify architectural and system-level bottlenecks and develop strategies to stress, expose, and mitigate worst-case performance scenarios
  • Support prototyping and experimentation efforts to evaluate enhancements and new features that impact performance
  • Debug and troubleshoot system-level issues across hardware, firmware, and software stacks observed in lab and production test environments
  • Collaborate with cross-functional teams (architecture, firmware, drivers, platform, and workload teams) to drive root-cause analysis through to resolution and performance closure
  • Proactively drive continuous improvement of post-silicon performance methodologies, tools, and workflows
  • Fulltime
Read More
Arrow Right

AI/HPC System Performance Engineer

Meta is building some of the world's largest AI and high-performance computing i...
Location
Location
United States , Menlo Park
Salary
Salary:
154000.00 - 217000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI
  • Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers
  • Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure
  • Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments
Job Responsibility
Job Responsibility
  • Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
  • Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
  • Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
  • Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
  • Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
  • Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents
  • Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets
  • Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders
  • Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices
  • Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack
What we offer
What we offer
  • bonus + equity + benefits
  • Fulltime
Read More
Arrow Right

Kubernetes Platform Engineer

Kubernetes Platform Engineer. This role has been designed as ‘Hybrid’ with an ex...
Location
Location
United States , Bloomington
Salary
Salary:
111500.00 - 211500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud Architectures
  • Cross Domain Knowledge
  • Design Thinking
  • Development Fundamentals
  • DevOps
  • Distributed Computing
  • Microservices Fluency
  • Full Stack Development
  • Security-First Mindset
  • Solutions Design
Job Responsibility
Job Responsibility
  • Lead Kubernetes‑native, RDMA‑class networking for distributed AI inference platforms on HPC clusters
  • Own the end‑to‑end technical design that allows Kubernetes‑orchestrated inference workloads (NVIDIA NIMs, vLLM, TensorRT‑LLM) to transparently consume high‑speed fabrics (e.g., HPE Slingshot/CXI) using Operators, DRA, CDI, Multus/secondary CNI, and Kubernetes networking abstractions—without container rebuilds, privileged pods, or manual tuning
  • Make HPC fabric capabilities consumable from standard containers
  • Design the mechanisms to expose RDMA‑capable NIC resources and required runtime components without baking the fabric into images, including mounting/injecting host user‑space libraries (e.g., libcxi + libfabric) in a controlled, supportable way
  • Define the reference design and implement for Kubernetes‑native RDMA enablement across Dynamic Resource Allocation (DRA), Container Device Interface (CDI), Multus + secondary CNIs, and Operator‑driven lifecycle management
  • Own API and CRD design (ResourceClaims, DeviceClasses, custom CRDs) with long‑term compatibility guarantees
  • Make and defend architectural tradeoffs between Device plugins vs DRA, CDI vs runtime hooks vs admission webhooks, Shared vs exclusive NIC models, and Performance vs operability vs isolation
  • Define how distributed inference patterns (KV‑cache movement, prefill/decode separation) map onto Kubernetes primitives
  • Ensure out-of-the-box compatibility with NVIDIA NIMs and the NIM Operator, KServe ServingRuntime / InferenceService, and GPU Operator (CDI mode)
  • Publish deployment patterns and validated manifests for inference workloads using RDMA fast paths
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Pcai And Ai Factory Expert

We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s / Master’s Degree in Computer Science, IT, or equivalent field
  • 8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPUbased environments
  • Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments
  • Hands-on experience in automation and orchestration across bare metal and containerized infrastructure
Job Responsibility
Job Responsibility
  • Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance
  • Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks
  • Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Ezmeral Runtime Enterprise, Kubernetes, and Rancher Harvester
  • Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers
  • Proactively monitor system health using DCGM, NetQ, Grafana, and Exivity dashboards
  • Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers
  • Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues
  • Maintain operational documentation, runbooks, and incident logs
  • Manage cluster lifecycle through Ansible, AWX, HPE Performance Cluster Manager (HPCM), and SLURM
  • Oversee automation for provisioning, scaling, and patch management of Compute and Containerized workloads
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Principal Supercomputing Operations Software Engineer

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • 6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
  • Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
  • Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads
  • Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
  • Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues
Job Responsibility
Job Responsibility
  • Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
  • Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
  • Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
  • Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform
  • Fulltime
Read More
Arrow Right

Senior Director, CTIO Engineering Technologists

From applied research to advanced engineering, the Engineering Technologist team...
Location
Location
United States , Austin; Santa Clara
Salary
Salary:
277000.00 - 358000.00 USD / Year
dell.com Logo
Dell
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 18 yrs overall experience with 5 years' experience Leading and Directing the strategic and operational objectives of their organization related to HPC (high-performance compute) clusters, AI compute, AI Datacenter, AI Storage etc.
  • Demonstrated experience delivering AI Solutions
  • Experience developing long-term technology strategies based on the technical and business information
  • Drives for internal and external alignment of the strategy
  • Identifies and develops differentiation opportunities, provides technical information, and makes recommendations to marketing, procurement, engineering, customers, and business executives
  • Participates as required in strategic initiatives for improvements in process, quality, and cost
Job Responsibility
Job Responsibility
  • Lead a team of highly skilled SME's in the development of next generation large scale AI Systems including accelerated compute, AI fabrics and AI optimized storage and AI Software Stack
  • Responsibilities include the assimilation and understanding of the industry and competitive environment for a given technology or product line, and the derivation of a technology/product strategy from this information.
  • Leads technology investigations, performs a strategic analysis of the industry capabilities, and develops recommendations, which influence the technical product strategy and/or definition of products for a given product line, including evaluation of potential acquisitions and vendor partner opportunities.
  • Engages design teams, systems engineering, marketing teams, suppliers, and business unit leaders and executives to ensure the strategy or product architecture meets Dell’s requirement of product leadership for the given technology area or product line.
What we offer
What we offer
  • Comprehensive Healthcare Programs
  • Award Winning Financial Wellness Tools and Resources
  • Generous Leave of Absence for New Parents and Caregivers
  • Industry Leading Wellness Platform
  • Employee Assistance Program
  • Fulltime
Read More
Arrow Right