CrawlJobs Logo

HPC Fabrics Engineer

https://www.hpe.com/ Logo

Hewlett Packard Enterprise

Location Icon

Location:
India , Bangalore

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

High Performance Computing, AI and Labs is a critical element of HPE. We are focused on delivering innovative solutions that accelerate our customers’ digital transformation, enabling them to tackle their complex, and data-intensive workloads. Combining deep expertise and the development of the world’s most cutting-edge, high-performance supercomputers, is defining the next era of computing delivering valuable insight & innovation. Join us and redefine what’s next for you.

Job Responsibility:

  • Develop, test and release Firmware and Driver components for HPC Option cards (InfiniBand, High Speed Ethernet adapters) on Linux, Windows and VMware OS
  • Qualify HPC Option card components on HPE Server platforms
  • Handle Level-4 support for HPC Fabrics components. Collaborate with Level3 support, account teams and customers as needed and provide technical assistance on escalated issues
  • Work closely with partner to ensure product quality requirements and release timelines are met
  • Collaborate with Engineering teams (Platform, Thermal, Factory, Benchmarking and test teams) on HPC Option card firmware, driver component related issues
  • Create and contribute to Trainings, Advisories and knowledge base articles on HPC Fabrics Technology

Requirements:

  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 2-4 years experience
  • Experience with Linux, Windows and VMware ESXi platforms
  • Scripting knowledge (Shell, Perl, Python)
  • Working knowledge of virtualization environment and Hypervisors
  • Knowledge on Server hardware architecture, PCIe, NVLink speeds and concepts like NUMA
  • Hands-on experience with InfiniBand and Ethernet (RoCE) networks
  • Experience in configuring, troubleshooting and tuning Infiniband and Ethernet networks
  • Hands-on Experience with network performance testing (RDMA Perftest, iPerf, netperf) and application level testing (NCCL, RCCL)
  • Knowledge of High Performance Computing (HPC) stack components, infrastructure and scale-out deployments
  • Ability to work with multiple internal teams and interact with customers/partners
  • Ability to prioritize and handle multiple tasks simultaneously
  • Excellent communication, collaboration and interpersonal skills
  • Ability to work well in team environment, take on challenges, comfortable and effective working on new areas that require experimentation and rapid problem solving

Nice to have:

  • Experience with GPUs, Compute accelerators, SSD drives and Storage controllers would be a plus
  • Cloud Architectures, Cross Domain Knowledge, Design Thinking, Development Fundamentals, DevOps, Distributed Computing, Microservices Fluency, Full Stack Development, Security-First Mindset, Solutions Design, Testing & Automation, User Experience (UX)
What we offer:
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion

Additional Information:

Job Posted:
March 21, 2026

Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for HPC Fabrics Engineer

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
India
Salary
Salary:
Not provided
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work for the respective hiring region
Read More
Arrow Right

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Senior Network Fabric Engineer

Our client, located in Milpitas, CA, is currently in need of a Sr Network Engine...
Location
Location
United States , Milpitas
Salary
Salary:
70.00 - 90.00 USD / Hour
clearbridgetech.com Logo
ClearBridge Technology Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to work 100% onsite
  • Experience supporting leaf-spine style architecture
  • Experience supporting low-speed control / management networks and high-speed data acquisition paths
  • Experience working within HPC networks
  • Experience working within Linux container environments automated with Grafana and Ansible
Job Responsibility
Job Responsibility
  • Build post-handoff network observability
  • Implement and enhance telemetry, advanced monitoring, health dashboards
  • Detect and alert on fiber/optic degradation, port failures, link state changes, environmental impacts
  • Work with Spectrum-X telemetry and advanced QoS policies
  • Establish operational workflows: detection → alerting → remediation
  • Maintain fabric stability after PS team exits
  • Support integration troubleshooting across compute, storage, network
What we offer
What we offer
  • Excellent benefits and compensation packages
Read More
Arrow Right

Hpc Solution Architect

The Software Engineering team delivers next-generation software application enha...
Location
Location
United States , Austin; Hopkinton
Salary
Salary:
210000.00 - 265000.00 USD / Year
dell.com Logo
Dell
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years engineering large‑scale HPC and distributed infrastructure, with strong knowledge of cluster architecture, schedulers, and provisioning workflows
  • Deep experience with RHEL/Rocky/Ubuntu
  • hands‑on cluster deployments using open‑source toolchains, Omnia, and OpenCHAMI (composable provisioning, cloud‑init, microservices)
  • Production experience with Slurm and/or Kubernetes
  • proficient with Docker/Podman, OpenTelemetry pipelines, and telemetry instrumentation
  • Solid L2/L3 fundamentals, PXE/iPXE, DHCP/TFTP
  • experience with InfiniBand/RoCE/Omni‑Path fabrics and event streaming with Kafka
  • Strong skills in Ansible, Python, Bash
  • expertise with Prometheus and Grafana dashboards
  • proven communication skills for escalations and simplifying complex HPC concepts
Job Responsibility
Job Responsibility
  • Lead customer architecture & design, translating HPC/AI workload requirements into scalable cluster architectures (compute, schedulers, storage, interconnects)
  • Deploy and operationalize clusters using Omnia or similar automation, including provisioning, scheduler bring‑up, telemetry, authentication, and repo management
  • Build and maintain provisioning workflows (OpenCHAMI‑based or equivalent) covering PXE/iPXE boot, cloud‑init, security, and identity/cert operations
  • Serve as Tier‑3 engineering escalation, troubleshooting complex provisioning, scheduling, GPU, networking, and performance issues
  • perform RCAs and drive permanent fixes
  • Contribute to open source and customer enablement through code contributions, documentation, workshops, runbooks, templates, and field readiness materials
What we offer
What we offer
  • Comprehensive Healthcare Programs
  • Award Winning Financial Wellness Tools and Resources
  • Generous Leave of Absence for New Parents and Caregivers
  • Industry Leading Wellness Platform
  • Employee Assistance Program
Read More
Arrow Right

Senior Supercomputing Operations Engineer

Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience operating high performance computing (HPC), artificial intelligence (AI), or largescale distributed systems in production environments
  • Handson experience operating interconnect fabrics for HPC, AI, or largescale distributed systems in production
  • Strong Linux systems knowledge with demonstrated experience debugging lowlevel infrastructure issues
  • Demonstrated ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve production issues
  • Familiarity with InfiniBand Subnet Manager behavior, including routing, congestion control, and fabric telemetry
Job Responsibility
Job Responsibility
  • Act as DRI for InfiniBand and GPU interconnect fabric operations, ensuring GPU availability and AI training stability
  • Lead incident triage, mitigation, recovery, and root cause analysis for fabric-related production issues
  • Perform deep multi-layer debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, and GPU interactions
  • Drive operational excellence and prevention by identifying systemic failure patterns and authoring TSGs, playbooks, and escalation guides
  • Build and leverage automation, telemetry, and tooling to improve detection, debuggability, and mean time to mitigation
  • Fulltime
Read More
Arrow Right

Sr Kubernetes Engineer for HPC

Our client, located in Milpitas, CA, is currently in need of a Sr Kubernetes Eng...
Location
Location
United States , Milpitas
Salary
Salary:
70.00 - 90.00 USD / Hour
clearbridgetech.com Logo
ClearBridge Technology Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to work onsite
  • Understand Kubernetes deeply, support troubleshoot and optimize a Kubernetes driven HPC system
  • Containers: Kubernetes, Docker, etc.
  • Automation Tools: Grafana, Ansible
  • Cluster Management
  • Open-source data tools: Kafka
  • Cloud Databases: AWS Databases
  • Linux
  • HPC related tools
Job Responsibility
Job Responsibility
  • Help the customer gain long term confidence in the network stability, and detect marginal ports, packet drops and flapping links
  • Comprehensive fabric health monitoring, and integration with existing observability tools
  • Perform post-deployment integration testing beyond PS baseline tests
  • Validate end to end system behavior in real customer environments
  • Validate sustained performance targets
  • Diagnose performance issues across layers
  • Perform root-cause analysis
What we offer
What we offer
  • excellent benefits and compensation packages
Read More
Arrow Right
New

Kubernetes Platform Engineer

Kubernetes Platform Engineer. This role has been designed as ‘Hybrid’ with an ex...
Location
Location
United States , Bloomington
Salary
Salary:
111500.00 - 211500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud Architectures
  • Cross Domain Knowledge
  • Design Thinking
  • Development Fundamentals
  • DevOps
  • Distributed Computing
  • Microservices Fluency
  • Full Stack Development
  • Security-First Mindset
  • Solutions Design
Job Responsibility
Job Responsibility
  • Lead Kubernetes‑native, RDMA‑class networking for distributed AI inference platforms on HPC clusters
  • Own the end‑to‑end technical design that allows Kubernetes‑orchestrated inference workloads (NVIDIA NIMs, vLLM, TensorRT‑LLM) to transparently consume high‑speed fabrics (e.g., HPE Slingshot/CXI) using Operators, DRA, CDI, Multus/secondary CNI, and Kubernetes networking abstractions—without container rebuilds, privileged pods, or manual tuning
  • Make HPC fabric capabilities consumable from standard containers
  • Design the mechanisms to expose RDMA‑capable NIC resources and required runtime components without baking the fabric into images, including mounting/injecting host user‑space libraries (e.g., libcxi + libfabric) in a controlled, supportable way
  • Define the reference design and implement for Kubernetes‑native RDMA enablement across Dynamic Resource Allocation (DRA), Container Device Interface (CDI), Multus + secondary CNIs, and Operator‑driven lifecycle management
  • Own API and CRD design (ResourceClaims, DeviceClasses, custom CRDs) with long‑term compatibility guarantees
  • Make and defend architectural tradeoffs between Device plugins vs DRA, CDI vs runtime hooks vs admission webhooks, Shared vs exclusive NIC models, and Performance vs operability vs isolation
  • Define how distributed inference patterns (KV‑cache movement, prefill/decode separation) map onto Kubernetes primitives
  • Ensure out-of-the-box compatibility with NVIDIA NIMs and the NIM Operator, KServe ServingRuntime / InferenceService, and GPU Operator (CDI mode)
  • Publish deployment patterns and validated manifests for inference workloads using RDMA fast paths
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior Director, CTIO Engineering Technologists

From applied research to advanced engineering, the Engineering Technologist team...
Location
Location
United States , Austin; Santa Clara
Salary
Salary:
277000.00 - 358000.00 USD / Year
dell.com Logo
Dell
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 18 yrs overall experience with 5 years' experience Leading and Directing the strategic and operational objectives of their organization related to HPC (high-performance compute) clusters, AI compute, AI Datacenter, AI Storage etc.
  • Demonstrated experience delivering AI Solutions
  • Experience developing long-term technology strategies based on the technical and business information
  • Drives for internal and external alignment of the strategy
  • Identifies and develops differentiation opportunities, provides technical information, and makes recommendations to marketing, procurement, engineering, customers, and business executives
  • Participates as required in strategic initiatives for improvements in process, quality, and cost
Job Responsibility
Job Responsibility
  • Lead a team of highly skilled SME's in the development of next generation large scale AI Systems including accelerated compute, AI fabrics and AI optimized storage and AI Software Stack
  • Responsibilities include the assimilation and understanding of the industry and competitive environment for a given technology or product line, and the derivation of a technology/product strategy from this information.
  • Leads technology investigations, performs a strategic analysis of the industry capabilities, and develops recommendations, which influence the technical product strategy and/or definition of products for a given product line, including evaluation of potential acquisitions and vendor partner opportunities.
  • Engages design teams, systems engineering, marketing teams, suppliers, and business unit leaders and executives to ensure the strategy or product architecture meets Dell’s requirement of product leadership for the given technology area or product line.
What we offer
What we offer
  • Comprehensive Healthcare Programs
  • Award Winning Financial Wellness Tools and Resources
  • Generous Leave of Absence for New Parents and Caregivers
  • Industry Leading Wellness Platform
  • Employee Assistance Program
  • Fulltime
Read More
Arrow Right