CrawlJobs Logo

HPC Fabrics Engineer

https://www.hpe.com/ Logo

Hewlett Packard Enterprise

Location Icon

Location:
India , Bangalore

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

High Performance Computing, AI and Labs is a critical element of HPE. We are focused on delivering innovative solutions that accelerate our customers’ digital transformation, enabling them to tackle their complex, and data-intensive workloads. Combining deep expertise and the development of the world’s most cutting-edge, high-performance supercomputers, is defining the next era of computing delivering valuable insight & innovation. Join us and redefine what’s next for you.

Job Responsibility:

  • Develop, test and release Firmware and Driver components for HPC Option cards (InfiniBand, High Speed Ethernet adapters) on Linux, Windows and VMware OS
  • Qualify HPC Option card components on HPE Server platforms
  • Handle Level-4 support for HPC Fabrics components. Collaborate with Level3 support, account teams and customers as needed and provide technical assistance on escalated issues
  • Work closely with partner to ensure product quality requirements and release timelines are met
  • Collaborate with Engineering teams (Platform, Thermal, Factory, Benchmarking and test teams) on HPC Option card firmware, driver component related issues
  • Create and contribute to Trainings, Advisories and knowledge base articles on HPC Fabrics Technology

Requirements:

  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 2-4 years experience
  • Experience with Linux, Windows and VMware ESXi platforms
  • Scripting knowledge (Shell, Perl, Python)
  • Working knowledge of virtualization environment and Hypervisors
  • Knowledge on Server hardware architecture, PCIe, NVLink speeds and concepts like NUMA
  • Hands-on experience with InfiniBand and Ethernet (RoCE) networks
  • Experience in configuring, troubleshooting and tuning Infiniband and Ethernet networks
  • Hands-on Experience with network performance testing (RDMA Perftest, iPerf, netperf) and application level testing (NCCL, RCCL)
  • Knowledge of High Performance Computing (HPC) stack components, infrastructure and scale-out deployments
  • Ability to work with multiple internal teams and interact with customers/partners
  • Ability to prioritize and handle multiple tasks simultaneously
  • Excellent communication, collaboration and interpersonal skills
  • Ability to work well in team environment, take on challenges, comfortable and effective working on new areas that require experimentation and rapid problem solving

Nice to have:

  • Experience with GPUs, Compute accelerators, SSD drives and Storage controllers would be a plus
  • Cloud Architectures, Cross Domain Knowledge, Design Thinking, Development Fundamentals, DevOps, Distributed Computing, Microservices Fluency, Full Stack Development, Security-First Mindset, Solutions Design, Testing & Automation, User Experience (UX)
What we offer:
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion

Additional Information:

Job Posted:
March 21, 2026

Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for HPC Fabrics Engineer

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
India
Salary
Salary:
Not provided
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work for the respective hiring region
Read More
Arrow Right

Customer Support Engineer

As a Customer Support Engineer at a pioneering AI company, you'll be the first l...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a customer-facing technical role with at least 1 year in a support function in AI
  • Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments
  • Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible) high-performance network fabrics, NFS-based storage management, container infrastructure, and scripting and programming languages
  • Familiarity with operating storage systems in HPC environments such as Vast and Weka
  • Familiarity with inspecting and resolving network-related errors
  • Strong knowledge of Python, TypeScript, and/or JavaScript with testing/debugging experience using curl and Postman-like tools
  • Foundational understanding in the installation, configuration, administration, troubleshooting, and securing of compute clusters
  • Complex technical problem solving and troubleshooting, with a proactive approach to issue resolution
  • Ability to work cross-functionally with teams such as Sales, Engineering, Support, Product and Research to drive customer success
  • Strong sense of ownership and willingness to learn new skills to ensure both team and customer success
Job Responsibility
Job Responsibility
  • Engage directly with customers to tackle and resolve complex technical challenges involving our cutting-edge GPU clusters and our inference and fine-tuning services
  • ensure swift and effective solutions every time
  • Become a product expert in all of our Gen AI solutions, serving as the last line of technical defense before issues are escalated to Engineering and Product teams
  • Collaborate seamlessly across Engineering, Research, and Product teams to address customer concerns
  • collaborate with senior leaders both internally and externally to ensure the highest levels of customer satisfaction
  • Transform customer insights into action by identifying patterns in support cases and working with Engineering and Go-To-Market teams to drive Together’s roadmap (e.g., future models to support)
  • Maintain detailed documentation of system configurations, procedures, troubleshooting guides, and FAQs to facilitate knowledge sharing with team and customers
  • Be flexible in providing support coverage during holidays, nights and weekends as required by business needs to ensure consistent and reliable service for our customers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Senior Network Fabric Engineer

Our client, located in Milpitas, CA, is currently in need of a Sr Network Engine...
Location
Location
United States , Milpitas
Salary
Salary:
70.00 - 90.00 USD / Hour
clearbridgetech.com Logo
ClearBridge Technology Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to work 100% onsite
  • Experience supporting leaf-spine style architecture
  • Experience supporting low-speed control / management networks and high-speed data acquisition paths
  • Experience working within HPC networks
  • Experience working within Linux container environments automated with Grafana and Ansible
Job Responsibility
Job Responsibility
  • Build post-handoff network observability
  • Implement and enhance telemetry, advanced monitoring, health dashboards
  • Detect and alert on fiber/optic degradation, port failures, link state changes, environmental impacts
  • Work with Spectrum-X telemetry and advanced QoS policies
  • Establish operational workflows: detection → alerting → remediation
  • Maintain fabric stability after PS team exits
  • Support integration troubleshooting across compute, storage, network
What we offer
What we offer
  • Excellent benefits and compensation packages
Read More
Arrow Right

Senior Supercomputing Operations Engineer

Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience operating high performance computing (HPC), artificial intelligence (AI), or largescale distributed systems in production environments
  • Handson experience operating interconnect fabrics for HPC, AI, or largescale distributed systems in production
  • Strong Linux systems knowledge with demonstrated experience debugging lowlevel infrastructure issues
  • Demonstrated ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve production issues
  • Familiarity with InfiniBand Subnet Manager behavior, including routing, congestion control, and fabric telemetry
Job Responsibility
Job Responsibility
  • Act as DRI for InfiniBand and GPU interconnect fabric operations, ensuring GPU availability and AI training stability
  • Lead incident triage, mitigation, recovery, and root cause analysis for fabric-related production issues
  • Perform deep multi-layer debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, and GPU interactions
  • Drive operational excellence and prevention by identifying systemic failure patterns and authoring TSGs, playbooks, and escalation guides
  • Build and leverage automation, telemetry, and tooling to improve detection, debuggability, and mean time to mitigation
  • Fulltime
Read More
Arrow Right

Sr Kubernetes Engineer for HPC

Our client, located in Milpitas, CA, is currently in need of a Sr Kubernetes Eng...
Location
Location
United States , Milpitas
Salary
Salary:
70.00 - 90.00 USD / Hour
clearbridgetech.com Logo
ClearBridge Technology Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to work onsite
  • Understand Kubernetes deeply, support troubleshoot and optimize a Kubernetes driven HPC system
  • Containers: Kubernetes, Docker, etc.
  • Automation Tools: Grafana, Ansible
  • Cluster Management
  • Open-source data tools: Kafka
  • Cloud Databases: AWS Databases
  • Linux
  • HPC related tools
Job Responsibility
Job Responsibility
  • Help the customer gain long term confidence in the network stability, and detect marginal ports, packet drops and flapping links
  • Comprehensive fabric health monitoring, and integration with existing observability tools
  • Perform post-deployment integration testing beyond PS baseline tests
  • Validate end to end system behavior in real customer environments
  • Validate sustained performance targets
  • Diagnose performance issues across layers
  • Perform root-cause analysis
What we offer
What we offer
  • excellent benefits and compensation packages
Read More
Arrow Right

Principal Supercomputing Operations Engineering Manager

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
  • Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
  • Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
  • Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
  • Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
  • Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet
  • Fulltime
Read More
Arrow Right

Senior Linux System Administrator - Support Engineer

We are seeking an experienced Senior Linux System Administrator/System Support E...
Location
Location
Australia , Canberra
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Information Technology, or related field, or equivalent work experience
  • At least 5 years of hands-on experience managing Linux systems in production environments, including HPC systems
  • Expertise in Linux/Unix operating systems, parallel file systems (Lustre, GPFS), and networking technologies
  • Proficiency in scripting/programming languages (Bash, Python, Perl, C++)
  • Experience with automation/configuration management tools (Ansible, Puppet, Chef, Terraform)
  • Strong understanding of networking concepts (TCP/IP, DNS, DHCP, firewalls, VPNs)
  • Familiarity with monitoring/logging tools (Nagios, Grafana, ELK Stack)
  • Experience with containerization technologies (Docker, Kubernetes)
  • Excellent problem-solving, analytical, and communication skills
  • able to diagnose complex technical problems to root cause
Job Responsibility
Job Responsibility
  • Deploy, configure, maintain, and troubleshoot Linux servers and HPC clusters systems (Red Hat, CentOS, Ubuntu, or others) across physical (primarily), virtual, and cloud environments
  • Support, maintain, and optimize HPC systems, including cluster manager, operating system and network fabric installation, servicing, and advanced technical troubleshooting of hardware/software and parallel file systems (e.g., Lustre, GPFS)
  • Monitor system performance, availability, and security using industry-standard tools and practices
  • ensure compliance with organizational policies and external regulations
  • Plan and execute upgrades, patches, enhancements, and migrations to ensure systems are current, secure, and optimized
  • Automate system administration tasks using scripting languages (Bash, Python, Perl, etc.) and configuration management tools (Ansible, Puppet, Chef, Terraform)
  • Implement and maintain backup/recovery strategies, disaster recovery plans, and system documentation
  • Collaborate with development, network, and security teams to support application deployments and troubleshoot issues, particularly in multi-technology HPC environments
  • Provide technical consulting, mentoring, and guidance to junior team members and contribute to internal knowledge sharing
  • Ensure compliance with strict security protocols in sensitive environments (e.g., government, research)
What we offer
What we offer
  • Health & Wellbeing: comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Personal & Professional Development: specific programs catered to helping you reach any career goals
  • Unconditional Inclusion: unconditionally inclusive in the way we work and celebrate individual uniqueness
Read More
Arrow Right

Senior Solution Architect AI & HPC

AI is a high-growth market for HPE, and we believe we are uniquely suited to bri...
Location
Location
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Engineering, Computer Science, or similar quantitative focus preferred
  • Ability to quickly prototype functionality into scripts for demos, integrations, troubleshooting, etc.
  • Expertise in cloud architectures, specifically with public cloud platforms such as AWS, Azure, or Google Cloud
  • Strong understanding of AI technologies, including machine learning, deep learning, and neural networks
  • Experience participating in solution configurations and the creation of PoCs to meet customer requirements
  • Solid knowledge of infrastructure components, including servers, storage, networking, and virtualization
  • Experience with high-performance computing (HPC) and GPU-accelerated systems is advantageous
  • Demonstrates expert technical skills in assigned area of specialization
  • Expert knowledge of the company offerings, strategic initiatives, current trends, competitor products and strategies within area of responsibility
  • Expert level written and verbal communication skills and mastery over English and local language
Job Responsibility
Job Responsibility
  • Collaborate with sales teams to understand customer requirements and develop tailored solutions for their AI infrastructure needs
  • Engage in pre-sales activities, including technical presentations, demonstrations, and proof-of-concepts
  • Act as a trusted advisor to customers, addressing their questions, concerns, and technical challenges effectively
  • Stay up-to-date with the latest advancements in AI technologies, cloud architectures, and infrastructure trends
  • Lead Proof-of-Concepts (PoC) for HPE customers expanding into Deep Learning or Machine Learning use cases
  • Architect reusable end-to-end AI solutions for HPE customers and prospects
  • Lead technical discussions with customers and partners to propose HPE and partner Integrated solutions
  • Identify solutions, define action plans, and help coordinate and deliver optimal solutions and enhancements
  • Recommend configurations and settings for different types of hardware and interconnect fabrics
  • Assist in any product or technical issue towards an initial sale or renewal of a customer
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right