This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Meta is building some of the world's largest AI and high-performance computing infrastructure to power next-generation AI research and products. As an AI/HPC System Performance Engineer on the Network Infrastructure Engineering team, you will drive end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. In this role, you will work at the intersection of network fabric design, distributed computing, and AI workload behavior to ensure Meta's HPC systems deliver maximum throughput and efficiency for frontier model development.
Job Responsibility:
Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents
Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets
Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders
Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices
Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack
Requirements:
Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI
Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers
Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure
Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments
Nice to have:
Experience in developing systems software in languages like C++
Experience with machine learning frameworks such as PyTorch and TensorFlow
Understanding of RDMA congestion control mechanisms on IB and RoCE Networks
Understanding of the latest artificial intelligence (AI) technologies
Understanding of AI training workloads and demands they exert on networks
Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies