AI/HPC System Performance Engineer Job at Meta

AI/HPC System Performance Engineer

Meta is building some of the world's largest AI and high-performance computing i...

Location

United States , Menlo Park

Salary:

154000.00 - 217000.00 USD / Year

Meta

Expiration Date

Until further notice

Requirements

Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI
Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers
Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure
Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments

Job Responsibility

Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents
Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets
Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders
Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices
Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack

What we offer

bonus + equity + benefits

Fulltime

AI/HPC System Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...

Location

United States , Austin

Salary:

219000.00 - 301000.00 USD / Year

Meta

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Experience with developing, evaluating and debugging host networking protocols such as RDMA
10+ years of experience in designing, deploying and operating networks
Experience with triaging performance issues in complex scale-out distributed applications

Job Responsibility

Lead multi-disciplinary teams to develop solutions for large scale training systems. Assess trade-offs of various solutions and make pragmatic decisions
Ensure timely milestone delivery with teamwork and close collaboration
Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
Defining technical vision and driving a multi-year roadmap to make progress towards the related objectives
Work with cross functional teams and provide guidance on the AI network architecture including topologies, transport, congestion control techniques

What we offer

bonus
equity
benefits

Ai/hpc System Performance Engineer, Phd

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...

Location

United States , Menlo Park

Salary:

122000.00 - 181000.00 USD / Year

Meta

Expiration Date

Until further notice

Requirements

Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
BS/MS/PhD in relevant fields (EE, CS), with 2+ years work experience
Experience with using communication libraries, such as MPI, NCCL, and UCX
Experience with developing, evaluating and debugging host networking protocols such as RDMA
Experience with triaging performance issues in complex scale-out distributed applications
Must obtain work authorization in country of employment at the time of hire and maintain ongoing work authorization during employment

Job Responsibility

Active member of a multi-disciplinary team to develop solutions for large scale training systems
Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
Identify potential performance issues across the stack: comms lib, RDMA transport, host networking, scheduling and network fabric. Develop and deploy innovative solutions to address the performance issues

What we offer

bonus
equity
benefits

AI/HPC Systems Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...

Location

United States , Menlo Park

Salary:

122000.00 - 181000.00 USD / Year

Meta

Expiration Date

Until further notice

Requirements

Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
Bachelor's degree in Computer Science, Computer Engineering, or other relevant technical field, with 2+ years work experience
Experience with using communication libraries, such as MPI, NCCL, and UCX
Experience with developing, evaluating and debugging host networking protocols such as RDMA
Experience with triaging performance issues in complex scale-out distributed applications

Job Responsibility

Collaborate with hardware and software teams to optimize end-to-end communication pathways for large-scale distributed training workloads, ensuring seamless integration between compute, storage, and networking components
Design, implement, and validate new collective communication algorithms tailored for AI/HPC workloads, leveraging RDMA and advanced networking technologies to maximize throughput and minimize latency
Develop and maintain automated performance testing frameworks for continuous benchmarking of communication libraries and RDMA transport layers, enabling rapid identification of regressions and bottlenecks
Analyze and profile communication patterns in real-world training jobs, using telemetry and tracing tools to uncover inefficiencies and recommend architectural improvements
Drive adoption of best practices for scalable, fault-tolerant communication in production environments, including tuning RDMA parameters, optimizing network fabric configurations, and ensuring robust error handling
Work closely with vendors and internal teams to evaluate and integrate new hardware features (e.g., NICs, switches, accelerators) that can enhance communication performance for AI/HPC clusters
Contribute to documentation and knowledge sharing by authoring technical guides, performance reports, and internal wiki pages to educate peers and stakeholders on communication system optimizations
Participate in code reviews and design discussions to ensure high-quality, maintainable solutions that meet the evolving needs of large-scale AI/HPC infrastructure

What we offer

bonus
equity
benefits

Sr AI/HPC Applications and Performance Engineer

Sr AI/HPC Applications and Performance Engineer role at Hewlett Packard Enterpri...

Location

United States

Salary:

161500.00 - 370500.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

15+ years' experience
Deep expertise in AI and HPC applications and performance engineering including simulation, modeling and emulation capabilities
Expertise in large-scale AI and HPC systems
Experience architecting, designing, and developing innovative software system design tools and languages
Excellent analytical and problem-solving skills
Experience in leading overall architecture of software systems for products and solutions
Designing and integrating efficient and scalable software systems running on multiple platform types into overall architecture
Evaluating and selecting forms and processes for software systems testing and methodology
History of innovation with multiple patents or deployed solutions in the field of software design
Excellent written and verbal communication skills

Job Responsibility

Develops organization-wide architectures, strategies, and methodologies for software systems design and development across multiple platforms and organizations
Identifies and makes informed recommendations regarding new technologies, innovations, and outsourced development partner relationships
Reviews, evaluates, and influences designs and project activities for compliance with development guidelines and standards
Provides tangible solutions that improve product quality and mitigate failure risk
Contributes to domain expertise, business acumen, and experience to influence decisions of executive business leadership
Brings creativity and innovation to the organization
Provides guidance and mentoring to less-experienced team members
Acts as an internal authority on software systems design
Contributes to the external technical community through whitepapers, patents, or other significant innovations

What we offer

Health & Wellbeing benefits
Personal & Professional Development programs
Unconditional Inclusion environment
Comprehensive benefits suite supporting physical, financial and emotional wellbeing

Fulltime

Software Engineer - AI/HPC Specialist

We are looking for software engineers to help scale and improve the efficiency o...

Location

Norway , Oslo

Salary:

Not provided

Meta

Expiration Date

Until further notice

Requirements

3+ years of experience developing in C++/C and Python
Experience with High Performance Computing/Networking or AI systems applications frameworks
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Specialized experience in one or more of the following machine learning/deep learning domains: Hardware accelerators, AI Infrastructure, or high performance networking
Solid experience in debugging of distributed systems, revision control systems, testing, and CI pipelines

Job Responsibility

Work on collective communications stacks to optimise networking operations, leading to improved AI inference and training model performance
Drive implementation of latency and bandwidth critical networking operations, as well as out-of-band signalling
Debug custom and third party multi-host, accelerator enabled AI platforms
Software development using C++/C and Python
Work closely with other teams to deliver impact
develop & improve features and innovations
Extend and optimize large scale learning collective operations

Ai/hpc Cluster Thermal Design Engineer

We are seeking a Cluster Thermal Engineer to help architect and deliver scalable...

Location

United States , Austin

Salary:

143120.00 - 214680.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Strong understanding of fundamentals: thermodynamics, fluid dynamics, and heat transfer
Familiarity with electronics cooling concepts (heat sinks, cold plates, TIMs, heat exchangers, pumps, valves, fans)
Exposure to data center or cluster thermal concepts such as: rack/row layout considerations, CDU/coolant distribution, RDHx/AHU/fan-wall concepts, chilled water interfaces and heat rejection
Exposure to one or more thermal/CFD simulation tools (ANSYS, COMSOL, FloTHERM, OpenFOAM, or similar)
Familiarity with measurement and validation practices (instrumentation, uncertainty, sensor placement, data analysis)
Comfort working in cross-functional engineering environments and communicating technical ideas clearly
Bonus: understanding of PUE/WUE drivers, economizers/free cooling, or waste-heat reuse concepts
Bonus: coursework in heat transfer, thermodynamics, two-phase flow and heat transfer, refrigeration
Bonus: projects, internships, or research related to HPC/AI infrastructure, data centers, or high-power electronics cooling
Bachelor’s, Master’s or Ph.D. degrees in Mechanical Engineering with expertise in thermal management of electronics

Job Responsibility

Support the thermal design of AI/HPC cluster solutions, including compute racks, cooling loops, and facility interfaces
Assist in evaluating cooling architectures (air cooling, direct liquid cooling, hybrid approaches) and identifying trade-offs in performance, cost, complexity, and reliability
Build and refine thermal and airflow models for system/cluster/data center concepts using industry tools (e.g., OpenFOAM, ANSYS, FloTHERM, or similar)
Contribute to flow-network modeling for liquid cooling and coolant distribution analyses to ensure adequate flow, pressure, and temperature margins
Help define and execute test plans to validate thermal performance at component, system, and rack/cluster levels
Support integration of cooling solutions and thermal telemetry at the cluster level in collaboration with power, networking, platform, firmware, and controls teams
Participate in design reviews with internal stakeholders and external partners/customers
summarize findings and track action items
Assist with experimental setup, instrumentation, data collection, and analysis in partnership with validation and lab teams
Create and maintain technical documentation, including requirements, design notes, modeling assumptions, test reports, and user/customer-facing summaries

Fulltime

Pcai And Ai Factory Expert

We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI...

Location

India , Bengaluru

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor’s / Master’s Degree in Computer Science, IT, or equivalent field
8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPUbased environments
Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments
Hands-on experience in automation and orchestration across bare metal and containerized infrastructure

Job Responsibility

Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance
Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks
Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Ezmeral Runtime Enterprise, Kubernetes, and Rancher Harvester
Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers
Proactively monitor system health using DCGM, NetQ, Grafana, and Exivity dashboards
Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers
Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues
Maintain operational documentation, runbooks, and incident logs
Manage cluster lifecycle through Ansible, AWX, HPE Performance Cluster Manager (HPCM), and SLURM
Oversee automation for provisioning, scaling, and patch management of Compute and Containerized workloads

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Select Country

AI/HPC System Performance Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

AI/HPC System Performance Engineer

AI/HPC System Performance Engineer

AI/HPC System Performance Engineer

Ai/hpc System Performance Engineer, Phd

AI/HPC Systems Performance Engineer

Sr AI/HPC Applications and Performance Engineer

Software Engineer - AI/HPC Specialist

Ai/hpc Cluster Thermal Design Engineer

Pcai And Ai Factory Expert

Our AI answers in your language