CrawlJobs Logo

AI/HPC System Performance Engineer

184000.00 - 257000.00 USD / Year · Job Posted March 20, 2026
Apply Position
Job Link Share

Job Description

Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing use cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads. These workloads expect a loss-less fabric interconnect with minimal latency. To improve performance of these systems we constantly look for opportunities across stack: network fabric and host networking, communications lib and scheduling infrastructure.

Job Responsibility

  • Lead multi-disciplinary teams to develop solutions for large scale training systems. Assess trade-offs of various solutions and make pragmatic decisions
  • Ensure timely milestone delivery with teamwork and close collaboration
  • Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
  • Defining technical strategy and driving a multi-year roadmap to make progress towards the related objectives
  • Work with crossfunctional teams and provide guidance on the AI network architecture including topologies, transport, congestion control techniques

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • 10+ years of experience in designing, deploying and operating networks
  • Experience with triaging performance issues in complex scale-out distributed applications
  • Understanding of AI training workloads and demands they exert on networks

Nice to have

  • Experience with developing communication libraries, such as Message Passing Interface, NCCL, and UCX
  • Understanding of RDMA congestion control mechanisms on InfiniBand and RoCE Networks
  • Understanding of the latest artificial intelligence (AI) technologies
  • Experience with machine learning frameworks such as PyTorch and TensorFlow
  • Experience in developing systems software in languages like C++

What we offer

  • bonus
  • equity
  • benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

AI/HPC System Performance Engineer

8 matching positions

AI/HPC System Performance Engineer

Meta is building some of the world's largest AI and high-performance computing i...
Location
Location
United States , Menlo Park
Salary
Salary:
154000.00 - 217000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI
  • Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers
  • Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure
  • Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments
Job Responsibility
Job Responsibility
  • Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
  • Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
  • Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
  • Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
  • Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
  • Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents
  • Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets
  • Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders
  • Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices
  • Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack
What we offer
What we offer
  • bonus + equity + benefits
  • Fulltime
Read More
Arrow Right

AI/HPC System Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...
Location
Location
United States , Austin
Salary
Salary:
219000.00 - 301000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • 10+ years of experience in designing, deploying and operating networks
  • Experience with triaging performance issues in complex scale-out distributed applications
Job Responsibility
Job Responsibility
  • Lead multi-disciplinary teams to develop solutions for large scale training systems. Assess trade-offs of various solutions and make pragmatic decisions
  • Ensure timely milestone delivery with teamwork and close collaboration
  • Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
  • Defining technical vision and driving a multi-year roadmap to make progress towards the related objectives
  • Work with cross functional teams and provide guidance on the AI network architecture including topologies, transport, congestion control techniques
What we offer
What we offer
  • bonus
  • equity
  • benefits
Read More
Arrow Right

Ai/hpc System Performance Engineer, Phd

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...
Location
Location
United States , Menlo Park
Salary
Salary:
122000.00 - 181000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • BS/MS/PhD in relevant fields (EE, CS), with 2+ years work experience
  • Experience with using communication libraries, such as MPI, NCCL, and UCX
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • Experience with triaging performance issues in complex scale-out distributed applications
  • Must obtain work authorization in country of employment at the time of hire and maintain ongoing work authorization during employment
Job Responsibility
Job Responsibility
  • Active member of a multi-disciplinary team to develop solutions for large scale training systems
  • Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
  • Identify potential performance issues across the stack: comms lib, RDMA transport, host networking, scheduling and network fabric. Develop and deploy innovative solutions to address the performance issues
What we offer
What we offer
  • bonus
  • equity
  • benefits
Read More
Arrow Right

AI/HPC Systems Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...
Location
Location
United States , Menlo Park
Salary
Salary:
122000.00 - 181000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
  • Bachelor's degree in Computer Science, Computer Engineering, or other relevant technical field, with 2+ years work experience
  • Experience with using communication libraries, such as MPI, NCCL, and UCX
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • Experience with triaging performance issues in complex scale-out distributed applications
Job Responsibility
Job Responsibility
  • Collaborate with hardware and software teams to optimize end-to-end communication pathways for large-scale distributed training workloads, ensuring seamless integration between compute, storage, and networking components
  • Design, implement, and validate new collective communication algorithms tailored for AI/HPC workloads, leveraging RDMA and advanced networking technologies to maximize throughput and minimize latency
  • Develop and maintain automated performance testing frameworks for continuous benchmarking of communication libraries and RDMA transport layers, enabling rapid identification of regressions and bottlenecks
  • Analyze and profile communication patterns in real-world training jobs, using telemetry and tracing tools to uncover inefficiencies and recommend architectural improvements
  • Drive adoption of best practices for scalable, fault-tolerant communication in production environments, including tuning RDMA parameters, optimizing network fabric configurations, and ensuring robust error handling
  • Work closely with vendors and internal teams to evaluate and integrate new hardware features (e.g., NICs, switches, accelerators) that can enhance communication performance for AI/HPC clusters
  • Contribute to documentation and knowledge sharing by authoring technical guides, performance reports, and internal wiki pages to educate peers and stakeholders on communication system optimizations
  • Participate in code reviews and design discussions to ensure high-quality, maintainable solutions that meet the evolving needs of large-scale AI/HPC infrastructure
What we offer
What we offer
  • bonus
  • equity
  • benefits
Read More
Arrow Right

Sr AI/HPC Applications and Performance Engineer

Sr AI/HPC Applications and Performance Engineer role at Hewlett Packard Enterpri...
Location
Location
United States
Salary
Salary:
161500.00 - 370500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years' experience
  • Deep expertise in AI and HPC applications and performance engineering including simulation, modeling and emulation capabilities
  • Expertise in large-scale AI and HPC systems
  • Experience architecting, designing, and developing innovative software system design tools and languages
  • Excellent analytical and problem-solving skills
  • Experience in leading overall architecture of software systems for products and solutions
  • Designing and integrating efficient and scalable software systems running on multiple platform types into overall architecture
  • Evaluating and selecting forms and processes for software systems testing and methodology
  • History of innovation with multiple patents or deployed solutions in the field of software design
  • Excellent written and verbal communication skills
Job Responsibility
Job Responsibility
  • Develops organization-wide architectures, strategies, and methodologies for software systems design and development across multiple platforms and organizations
  • Identifies and makes informed recommendations regarding new technologies, innovations, and outsourced development partner relationships
  • Reviews, evaluates, and influences designs and project activities for compliance with development guidelines and standards
  • Provides tangible solutions that improve product quality and mitigate failure risk
  • Contributes to domain expertise, business acumen, and experience to influence decisions of executive business leadership
  • Brings creativity and innovation to the organization
  • Provides guidance and mentoring to less-experienced team members
  • Acts as an internal authority on software systems design
  • Contributes to the external technical community through whitepapers, patents, or other significant innovations
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Software Engineer - AI/HPC Specialist

We are looking for software engineers to help scale and improve the efficiency o...
Location
Location
Norway , Oslo
Salary
Salary:
Not provided
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience developing in C++/C and Python
  • Experience with High Performance Computing/Networking or AI systems applications frameworks
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Specialized experience in one or more of the following machine learning/deep learning domains: Hardware accelerators, AI Infrastructure, or high performance networking
  • Solid experience in debugging of distributed systems, revision control systems, testing, and CI pipelines
Job Responsibility
Job Responsibility
  • Work on collective communications stacks to optimise networking operations, leading to improved AI inference and training model performance
  • Drive implementation of latency and bandwidth critical networking operations, as well as out-of-band signalling
  • Debug custom and third party multi-host, accelerator enabled AI platforms
  • Software development using C++/C and Python
  • Work closely with other teams to deliver impact
  • develop & improve features and innovations
  • Extend and optimize large scale learning collective operations
Read More
Arrow Right

Ai/hpc Cluster Thermal Design Engineer

We are seeking a Cluster Thermal Engineer to help architect and deliver scalable...
Location
Location
United States , Austin
Salary
Salary:
143120.00 - 214680.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong understanding of fundamentals: thermodynamics, fluid dynamics, and heat transfer
  • Familiarity with electronics cooling concepts (heat sinks, cold plates, TIMs, heat exchangers, pumps, valves, fans)
  • Exposure to data center or cluster thermal concepts such as: rack/row layout considerations, CDU/coolant distribution, RDHx/AHU/fan-wall concepts, chilled water interfaces and heat rejection
  • Exposure to one or more thermal/CFD simulation tools (ANSYS, COMSOL, FloTHERM, OpenFOAM, or similar)
  • Familiarity with measurement and validation practices (instrumentation, uncertainty, sensor placement, data analysis)
  • Comfort working in cross-functional engineering environments and communicating technical ideas clearly
  • Bonus: understanding of PUE/WUE drivers, economizers/free cooling, or waste-heat reuse concepts
  • Bonus: coursework in heat transfer, thermodynamics, two-phase flow and heat transfer, refrigeration
  • Bonus: projects, internships, or research related to HPC/AI infrastructure, data centers, or high-power electronics cooling
  • Bachelor’s, Master’s or Ph.D. degrees in Mechanical Engineering with expertise in thermal management of electronics
Job Responsibility
Job Responsibility
  • Support the thermal design of AI/HPC cluster solutions, including compute racks, cooling loops, and facility interfaces
  • Assist in evaluating cooling architectures (air cooling, direct liquid cooling, hybrid approaches) and identifying trade-offs in performance, cost, complexity, and reliability
  • Build and refine thermal and airflow models for system/cluster/data center concepts using industry tools (e.g., OpenFOAM, ANSYS, FloTHERM, or similar)
  • Contribute to flow-network modeling for liquid cooling and coolant distribution analyses to ensure adequate flow, pressure, and temperature margins
  • Help define and execute test plans to validate thermal performance at component, system, and rack/cluster levels
  • Support integration of cooling solutions and thermal telemetry at the cluster level in collaboration with power, networking, platform, firmware, and controls teams
  • Participate in design reviews with internal stakeholders and external partners/customers
  • summarize findings and track action items
  • Assist with experimental setup, instrumentation, data collection, and analysis in partnership with validation and lab teams
  • Create and maintain technical documentation, including requirements, design notes, modeling assumptions, test reports, and user/customer-facing summaries
  • Fulltime
Read More
Arrow Right

Pcai And Ai Factory Expert

We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s / Master’s Degree in Computer Science, IT, or equivalent field
  • 8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPUbased environments
  • Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments
  • Hands-on experience in automation and orchestration across bare metal and containerized infrastructure
Job Responsibility
Job Responsibility
  • Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance
  • Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks
  • Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Ezmeral Runtime Enterprise, Kubernetes, and Rancher Harvester
  • Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers
  • Proactively monitor system health using DCGM, NetQ, Grafana, and Exivity dashboards
  • Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers
  • Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues
  • Maintain operational documentation, runbooks, and incident logs
  • Manage cluster lifecycle through Ansible, AWX, HPE Performance Cluster Manager (HPCM), and SLURM
  • Oversee automation for provisioning, scaling, and patch management of Compute and Containerized workloads
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right