AI/HPC System Performance Engineer Job at Meta (Menlo Park)

AI/HPC System Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...

Location

Salary:

184000.00 - 257000.00 USD / Year

Meta

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Experience with developing, evaluating and debugging host networking protocols such as RDMA
10+ years of experience in designing, deploying and operating networks
Experience with triaging performance issues in complex scale-out distributed applications
Understanding of AI training workloads and demands they exert on networks

Job Responsibility

Lead multi-disciplinary teams to develop solutions for large scale training systems. Assess trade-offs of various solutions and make pragmatic decisions
Ensure timely milestone delivery with teamwork and close collaboration
Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
Defining technical strategy and driving a multi-year roadmap to make progress towards the related objectives
Work with crossfunctional teams and provide guidance on the AI network architecture including topologies, transport, congestion control techniques

What we offer

bonus
equity
benefits

AI/HPC System Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...

Location

United States , Austin

Salary:

219000.00 - 301000.00 USD / Year

Meta

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Experience with developing, evaluating and debugging host networking protocols such as RDMA
10+ years of experience in designing, deploying and operating networks
Experience with triaging performance issues in complex scale-out distributed applications

Job Responsibility

Lead multi-disciplinary teams to develop solutions for large scale training systems. Assess trade-offs of various solutions and make pragmatic decisions
Ensure timely milestone delivery with teamwork and close collaboration
Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
Defining technical vision and driving a multi-year roadmap to make progress towards the related objectives
Work with cross functional teams and provide guidance on the AI network architecture including topologies, transport, congestion control techniques

What we offer

bonus
equity
benefits

Ai/hpc System Performance Engineer, Phd

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...

Location

United States , Menlo Park

Salary:

122000.00 - 181000.00 USD / Year

Meta

Expiration Date

Until further notice

Requirements

Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
BS/MS/PhD in relevant fields (EE, CS), with 2+ years work experience
Experience with using communication libraries, such as MPI, NCCL, and UCX
Experience with developing, evaluating and debugging host networking protocols such as RDMA
Experience with triaging performance issues in complex scale-out distributed applications
Must obtain work authorization in country of employment at the time of hire and maintain ongoing work authorization during employment

Job Responsibility

Active member of a multi-disciplinary team to develop solutions for large scale training systems
Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
Identify potential performance issues across the stack: comms lib, RDMA transport, host networking, scheduling and network fabric. Develop and deploy innovative solutions to address the performance issues

What we offer

bonus
equity
benefits

AI/HPC Systems Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...

Location

United States , Menlo Park

Salary:

122000.00 - 181000.00 USD / Year

Meta

Expiration Date

Until further notice

Requirements

Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
Bachelor's degree in Computer Science, Computer Engineering, or other relevant technical field, with 2+ years work experience
Experience with using communication libraries, such as MPI, NCCL, and UCX
Experience with developing, evaluating and debugging host networking protocols such as RDMA
Experience with triaging performance issues in complex scale-out distributed applications

Job Responsibility

Collaborate with hardware and software teams to optimize end-to-end communication pathways for large-scale distributed training workloads, ensuring seamless integration between compute, storage, and networking components
Design, implement, and validate new collective communication algorithms tailored for AI/HPC workloads, leveraging RDMA and advanced networking technologies to maximize throughput and minimize latency
Develop and maintain automated performance testing frameworks for continuous benchmarking of communication libraries and RDMA transport layers, enabling rapid identification of regressions and bottlenecks
Analyze and profile communication patterns in real-world training jobs, using telemetry and tracing tools to uncover inefficiencies and recommend architectural improvements
Drive adoption of best practices for scalable, fault-tolerant communication in production environments, including tuning RDMA parameters, optimizing network fabric configurations, and ensuring robust error handling
Work closely with vendors and internal teams to evaluate and integrate new hardware features (e.g., NICs, switches, accelerators) that can enhance communication performance for AI/HPC clusters
Contribute to documentation and knowledge sharing by authoring technical guides, performance reports, and internal wiki pages to educate peers and stakeholders on communication system optimizations
Participate in code reviews and design discussions to ensure high-quality, maintainable solutions that meet the evolving needs of large-scale AI/HPC infrastructure

What we offer

bonus
equity
benefits

Software Engineer - AI/HPC Specialist

We are looking for software engineers to help scale and improve the efficiency o...

Location

Norway , Oslo

Salary:

Not provided

Meta

Expiration Date

Until further notice

Requirements

3+ years of experience developing in C++/C and Python
Experience with High Performance Computing/Networking or AI systems applications frameworks
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Specialized experience in one or more of the following machine learning/deep learning domains: Hardware accelerators, AI Infrastructure, or high performance networking
Solid experience in debugging of distributed systems, revision control systems, testing, and CI pipelines

Job Responsibility

Work on collective communications stacks to optimise networking operations, leading to improved AI inference and training model performance
Drive implementation of latency and bandwidth critical networking operations, as well as out-of-band signalling
Debug custom and third party multi-host, accelerator enabled AI platforms
Software development using C++/C and Python
Work closely with other teams to deliver impact
develop & improve features and innovations
Extend and optimize large scale learning collective operations

Ai/hpc Cluster Thermal Design Engineer

We are seeking a Cluster Thermal Engineer to help architect and deliver scalable...

Location

United States , Austin

Salary:

143120.00 - 214680.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Strong understanding of fundamentals: thermodynamics, fluid dynamics, and heat transfer
Familiarity with electronics cooling concepts (heat sinks, cold plates, TIMs, heat exchangers, pumps, valves, fans)
Exposure to data center or cluster thermal concepts such as: rack/row layout considerations, CDU/coolant distribution, RDHx/AHU/fan-wall concepts, chilled water interfaces and heat rejection
Exposure to one or more thermal/CFD simulation tools (ANSYS, COMSOL, FloTHERM, OpenFOAM, or similar)
Familiarity with measurement and validation practices (instrumentation, uncertainty, sensor placement, data analysis)
Comfort working in cross-functional engineering environments and communicating technical ideas clearly
Bonus: understanding of PUE/WUE drivers, economizers/free cooling, or waste-heat reuse concepts
Bonus: coursework in heat transfer, thermodynamics, two-phase flow and heat transfer, refrigeration
Bonus: projects, internships, or research related to HPC/AI infrastructure, data centers, or high-power electronics cooling
Bachelor’s, Master’s or Ph.D. degrees in Mechanical Engineering with expertise in thermal management of electronics

Job Responsibility

Support the thermal design of AI/HPC cluster solutions, including compute racks, cooling loops, and facility interfaces
Assist in evaluating cooling architectures (air cooling, direct liquid cooling, hybrid approaches) and identifying trade-offs in performance, cost, complexity, and reliability
Build and refine thermal and airflow models for system/cluster/data center concepts using industry tools (e.g., OpenFOAM, ANSYS, FloTHERM, or similar)
Contribute to flow-network modeling for liquid cooling and coolant distribution analyses to ensure adequate flow, pressure, and temperature margins
Help define and execute test plans to validate thermal performance at component, system, and rack/cluster levels
Support integration of cooling solutions and thermal telemetry at the cluster level in collaboration with power, networking, platform, firmware, and controls teams
Participate in design reviews with internal stakeholders and external partners/customers
summarize findings and track action items
Assist with experimental setup, instrumentation, data collection, and analysis in partnership with validation and lab teams
Create and maintain technical documentation, including requirements, design notes, modeling assumptions, test reports, and user/customer-facing summaries

Fulltime

Pcai And Ai Factory Expert

We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI...

Location

India , Bengaluru

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor’s / Master’s Degree in Computer Science, IT, or equivalent field
8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPUbased environments
Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments
Hands-on experience in automation and orchestration across bare metal and containerized infrastructure

Job Responsibility

Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance
Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks
Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Ezmeral Runtime Enterprise, Kubernetes, and Rancher Harvester
Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers
Proactively monitor system health using DCGM, NetQ, Grafana, and Exivity dashboards
Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers
Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues
Maintain operational documentation, runbooks, and incident logs
Manage cluster lifecycle through Ansible, AWX, HPE Performance Cluster Manager (HPCM), and SLURM
Oversee automation for provisioning, scaling, and patch management of Compute and Containerized workloads

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

AI Cluster & Data Center Design Engineer

We are seeking a highly skilled systems engineer to architect and design scalabl...

Location

United States , Austin

Salary:

139440.00 - 209160.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Experience in HPC, AI infrastructure, or data center systems engineering
Strong understanding of rack and data center power delivery
Knowledge of GPU/CPU architectures, PCIe, UALink, InfiniBand, and Ethernet networking
Familiarity with AI/ML frameworks and workload characteristics
Excellent problem-solving, communication, and documentation skills
Bachelor's or Master's degree in Electrical Engineering, Computer Engineering, Computer Science or related field

Job Responsibility

Design scalable AI/HPC clusters including compute, storage, and networking with specific focus on power delivery
Evaluate and select CPUs, GPUs, accelerators, interconnects, and memory configurations for optimal cluster performance
Design leading-edge power delivery solutions for high-density AI/GPU deployments
Define power budgets, redundancy schemes, and fault tolerance mechanisms
Design network topologies to maximize overall cluster performance
Understand the network performance needs of different types of workloads
Understand advantages and performance trade-offs of network topologies for AI/HPC clusters
Design and optimize storage solutions to maximize AI/HPC cluster performance
Understand advantages and performance trade-offs of cluster storage solutions, e.g. Lustre, Ceph, etc.
Work across multiple organizations with subject matter experts from hardware, software, network, data center, and operations teams to deliver scalable, efficient, and reliable compute infrastructure

Select Country

AI/HPC System Performance Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

AI/HPC System Performance Engineer

AI/HPC System Performance Engineer

AI/HPC System Performance Engineer

Ai/hpc System Performance Engineer, Phd

AI/HPC Systems Performance Engineer

Software Engineer - AI/HPC Specialist

Ai/hpc Cluster Thermal Design Engineer

Pcai And Ai Factory Expert

AI Cluster & Data Center Design Engineer

Our AI answers in your language