CrawlJobs Logo

AI/HPC Systems Performance Engineer

meta.com Logo

Meta

Location Icon

Location:
United States , Menlo Park

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

122000.00 - 181000.00 USD / Year

Job Description:

Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across stack: network fabric and host networking, comms lib and scheduling infrastructure.

Job Responsibility:

  • Collaborate with hardware and software teams to optimize end-to-end communication pathways for large-scale distributed training workloads, ensuring seamless integration between compute, storage, and networking components
  • Design, implement, and validate new collective communication algorithms tailored for AI/HPC workloads, leveraging RDMA and advanced networking technologies to maximize throughput and minimize latency
  • Develop and maintain automated performance testing frameworks for continuous benchmarking of communication libraries and RDMA transport layers, enabling rapid identification of regressions and bottlenecks
  • Analyze and profile communication patterns in real-world training jobs, using telemetry and tracing tools to uncover inefficiencies and recommend architectural improvements
  • Drive adoption of best practices for scalable, fault-tolerant communication in production environments, including tuning RDMA parameters, optimizing network fabric configurations, and ensuring robust error handling
  • Work closely with vendors and internal teams to evaluate and integrate new hardware features (e.g., NICs, switches, accelerators) that can enhance communication performance for AI/HPC clusters
  • Contribute to documentation and knowledge sharing by authoring technical guides, performance reports, and internal wiki pages to educate peers and stakeholders on communication system optimizations
  • Participate in code reviews and design discussions to ensure high-quality, maintainable solutions that meet the evolving needs of large-scale AI/HPC infrastructure

Requirements:

  • Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
  • Bachelor's degree in Computer Science, Computer Engineering, or other relevant technical field, with 2+ years work experience
  • Experience with using communication libraries, such as MPI, NCCL, and UCX
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • Experience with triaging performance issues in complex scale-out distributed applications

Nice to have:

  • Understanding of AI training workloads and demands they exert on networks
  • Understanding of RDMA congestion control mechanisms on IB and RoCE Networks
  • Understanding of the latest artificial intelligence (AI) technologies
  • Experience with machine learning frameworks such as PyTorch and TensorFlow
  • Experience in developing systems software in languages like C++
What we offer:
  • bonus
  • equity
  • benefits

Additional Information:

Job Posted:
February 21, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AI/HPC Systems Performance Engineer

Sr AI/HPC Applications and Performance Engineer

Sr AI/HPC Applications and Performance Engineer role at Hewlett Packard Enterpri...
Location
Location
United States
Salary
Salary:
161500.00 - 370500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years' experience
  • Deep expertise in AI and HPC applications and performance engineering including simulation, modeling and emulation capabilities
  • Expertise in large-scale AI and HPC systems
  • Experience architecting, designing, and developing innovative software system design tools and languages
  • Excellent analytical and problem-solving skills
  • Experience in leading overall architecture of software systems for products and solutions
  • Designing and integrating efficient and scalable software systems running on multiple platform types into overall architecture
  • Evaluating and selecting forms and processes for software systems testing and methodology
  • History of innovation with multiple patents or deployed solutions in the field of software design
  • Excellent written and verbal communication skills
Job Responsibility
Job Responsibility
  • Develops organization-wide architectures, strategies, and methodologies for software systems design and development across multiple platforms and organizations
  • Identifies and makes informed recommendations regarding new technologies, innovations, and outsourced development partner relationships
  • Reviews, evaluates, and influences designs and project activities for compliance with development guidelines and standards
  • Provides tangible solutions that improve product quality and mitigate failure risk
  • Contributes to domain expertise, business acumen, and experience to influence decisions of executive business leadership
  • Brings creativity and innovation to the organization
  • Provides guidance and mentoring to less-experienced team members
  • Acts as an internal authority on software systems design
  • Contributes to the external technical community through whitepapers, patents, or other significant innovations
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right
New

Ai/hpc System Performance Engineer, Phd

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...
Location
Location
United States , Menlo Park
Salary
Salary:
122000.00 - 181000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • BS/MS/PhD in relevant fields (EE, CS), with 2+ years work experience
  • Experience with using communication libraries, such as MPI, NCCL, and UCX
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • Experience with triaging performance issues in complex scale-out distributed applications
  • Must obtain work authorization in country of employment at the time of hire and maintain ongoing work authorization during employment
Job Responsibility
Job Responsibility
  • Active member of a multi-disciplinary team to develop solutions for large scale training systems
  • Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
  • Identify potential performance issues across the stack: comms lib, RDMA transport, host networking, scheduling and network fabric. Develop and deploy innovative solutions to address the performance issues
What we offer
What we offer
  • bonus
  • equity
  • benefits
Read More
Arrow Right

AI/HPC System Performance Engineer

Meta's AI Training and Inference Infrastructure is growing exponentially to supp...
Location
Location
United States , Austin
Salary
Salary:
219000.00 - 301000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • 10+ years of experience in designing, deploying and operating networks
  • Experience with triaging performance issues in complex scale-out distributed applications
Job Responsibility
Job Responsibility
  • Lead multi-disciplinary teams to develop solutions for large scale training systems. Assess trade-offs of various solutions and make pragmatic decisions
  • Ensure timely milestone delivery with teamwork and close collaboration
  • Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
  • Defining technical vision and driving a multi-year roadmap to make progress towards the related objectives
  • Work with cross functional teams and provide guidance on the AI network architecture including topologies, transport, congestion control techniques
What we offer
What we offer
  • bonus
  • equity
  • benefits
Read More
Arrow Right

Software Engineer - AI/HPC Specialist

We are looking for software engineers to help scale and improve the efficiency o...
Location
Location
Norway , Oslo
Salary
Salary:
Not provided
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience developing in C++/C and Python
  • Experience with High Performance Computing/Networking or AI systems applications frameworks
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Specialized experience in one or more of the following machine learning/deep learning domains: Hardware accelerators, AI Infrastructure, or high performance networking
  • Solid experience in debugging of distributed systems, revision control systems, testing, and CI pipelines
Job Responsibility
Job Responsibility
  • Work on collective communications stacks to optimise networking operations, leading to improved AI inference and training model performance
  • Drive implementation of latency and bandwidth critical networking operations, as well as out-of-band signalling
  • Debug custom and third party multi-host, accelerator enabled AI platforms
  • Software development using C++/C and Python
  • Work closely with other teams to deliver impact
  • develop & improve features and innovations
  • Extend and optimize large scale learning collective operations
Read More
Arrow Right

Senior Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python OR equivalent experience
  • 3+ years of experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure
  • 2+ years of specialized experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Collaborates with appropriate stakeholders to determine user requirements for a scenario
  • Drives identification of dependencies and the development of design documents for a product, application, service, or platform
  • Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI)
  • Leverages subject-matter expertise of product features and partners with appropriate stakeholders (e.g., project managers) to drive a workgroup's project plans, release plans, and work items
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
  • Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale
  • Fulltime
Read More
Arrow Right
New

Systems Software Engineer

The Crusoe Cloud Software Development team is seeking a passionate and experienc...
Location
Location
United States , San Francisco
Salary
Salary:
137000.00 - 161000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Linux Systems Familiarity: Experience building applications on Linux kernels, specifically pertaining to virtualization, device drivers, memory management, and process scheduling
  • Hardware Integration: Solid understanding of hardware devices such as GPUs, CPUs, Infiniband and Ethernet NICs, Ephemeral Disks, and PCI Express
  • Systems Design: Strong grasp of distributed applications and highly-scalable systems design. Specific focus around communications protocols (GRPC, REST, TCP/IP, etc.), databases (Postgres, Redis), and systems design applications (Pub/Sub, Kafka)
  • Software Architecture: Strong experience building software applications, both at the higher (Golang, Java, Python) and lower (C, C++, Rust) levels. Keen eye for clean, maintainable code, and a unit-test driven mindset
  • Excellent Communication Skills: Ability to collaborate with teams across an organization, blocking out noise, and focusing on what needs to get done to get a project across the line
  • Rapid and Agile Learner: Capable of adapting quickly, eager to research new technology and not get overwhelmed by unfamiliar tech stacks
  • Virtualization Concepts: General knowledge of hypervisors, virtual machine lifecycles, and Linux KVM tooling
  • CI/CD and Validation: Understanding of how to build Gitlab or Github CI/CD pipelines that deliver bug-free code across a multitude of compute platforms
Job Responsibility
Job Responsibility
  • Compute Application Development & Scaleout: Design highly reliable and performant Linux applications used to manage our virtualization stack across thousands of AI compute servers in multiple global datacenters
  • AI Hardware Platform Integration: Integrate Crusoe applications with a wide variety of hardware and software AI chip-vendor stacks. Build solutions to optimize and monitor virtualized hardware (GPUs, Infiniband/ROCe NICs, Ephemeral Storage, etc.) in cutting-edge AI/HPC environments
  • Kernel & Hypervisor Integration - Work side by side with our Linux Kernel and Hypervisor teams to ensure our Crusoe applications are seamlessly integrated with a variety of kernels and hypervisors
  • Performance Analysis & Tuning: Analyze and enhance the performance of the entire virtualization stack, from the hypervisor to the virtualized guest OS, with a specific focus on optimizing AI/ML workloads. This includes profiling, bottleneck identification, and implementing low-level optimizations
  • System-Level Troubleshooting: Diagnose and resolve complex system issues across our virtualization stack (drivers, kernel, hypervisor, guest OS, and crusoe applications). Work closely with kernel and hypervisor teams to debug and resolve integration challenges
  • Code Review and Quality Assurance: Conduct thorough code reviews to ensure the highest level of software quality, reliability, and security within compute applications and virtualization stack
  • Cross-Functional Collaboration: Collaborate with other engineering teams, including hardware design, OS development, and AI/ML application teams, to ensure cohesive and integrated product development
  • Technical Leadership: Provide technical guidance and mentorship to junior engineers, fostering a culture of technical excellence and collaborative problem-solving within the compute applications team
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
  • 5+ years hands on experience designing and developing high volume low latency pipelines using products such as AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus or equivalent products
  • 3+ years of experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
  • Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
  • Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
  • Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
  • Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities
  • Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact
  • Fulltime
Read More
Arrow Right

AI Research Lab Research Associate

We are currently seeking highly qualified interns to accelerate research towards...
Location
Location
United States , Milpitas
Salary
Salary:
43.27 - 93.15 USD / Hour
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
May 26, 2026
Flip Icon
Requirements
Requirements
  • Pursuing PhD degree (or other degree with significant research and innovation experience) in a relevant discipline (e.g. machine learning, computer science, electrical engineering, math, statistics, etc.)
  • Track record of world-class innovative contributions and ideas in machine learning
  • Experience with innovative solution development, such as developing proofs-of-concept, first-of-a-kind solutions, and/or technology transfer
  • Experience in deep learning research
  • Experience in developing deep learning software with high proficiency in data structures and algorithms
  • Strong programming skills and experience with Python, C/C++, and preferably Java
  • Software development experience in Deep Learning, GPU acceleration, and Model Optimization
  • Experience in Deep Learning and Machine Learning frameworks and models like Tensorflow, PyTorch
  • Experience in Transformer Neural Network architectures for Generative AI and natural language processing
  • Experience with Agentic AI and Generative AI workflows - desired
Job Responsibility
Job Responsibility
  • Conduct research and come up with solutions with a fast turnaround time
  • Build the software and applications for Neural Networks and Machine Learning
  • Work with system programming, Deep Learning frameworks and models, GPU acceleration, Model optimization, real-time streaming data, distributed computing, and deployment
  • Provide thought leadership and technical influence both internally and externally to HPE
  • Collaborate with HPE Labs research teams as well as external partners
  • Work in alignment with HPE's broader innovation community.
What we offer
What we offer
  • Health & Wellbeing benefits including physical, financial and emotional wellbeing support
  • Personal and professional development programs
  • Unconditional inclusion and flexibility to manage work and personal needs.
  • Fulltime
Read More
Arrow Right