This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Product Manager - AI Data Center Infrastructure. We are seeking a Product Line Manager (PLM) for AI Data Center Infrastructure to define and deliver next-generation data center networking platforms for large-scale GPU clusters. This role is ideal for a visionary, hands-on leader who understands how AI workloads stress networks at scale and can translate that insight into clear product requirements and roadmaps.
Job Responsibility:
AI Data Center & Fabric Architecture: Define product requirements for AI data center network architectures supporting thousands of GPUs
Develop requirements for low-latency Ethernet fabrics using Juniper QFX platforms and Apstra-based automation
Enable high-bandwidth GPU and NIC interconnects optimized for large-scale distributed training and inference workloads
GPU, NIC & Interconnect Strategy: Lead requirements definition for next-generation GPUs, NICs, and interconnect technologies, staying ahead of industry roadmaps
Drive alignment with NVIDIA and AMD ecosystems
Ensure interoperability across DAC, AEC, ACC, and optical transceivers between switches and NIC endpoints
Define scale-up paths using PCIe, NVLink, NVSwitch, ensuring GPU-to-GPU symmetry, consistency, and bandwidth determinism
Switching, Routing & Telemetry: Specify and optimize L2/L3 architectures, including EVPN-VXLAN, Class-E IPv4, and AI-optimized buffer tuning
Leverage hardware telemetry, streaming sensors, and analytics for proactive performance assurance
Drive automation using Python, Ansible, Apstra, Terraform, and related tools to enforce configuration consistency and compliance
Performance Optimization & Troubleshooting: Analyze GPU job performance to identify network hotspots, congestion, packet loss, and microbursts
Tune ECN, RDMA/ROCEv2, PFC, and traffic-engineering policies for AI workloads
Optimize server-to-switch interactions, including BIOS and firmware alignment, NIC queue and link-training parameters, Cable selection and management (AEC/ACC/optics)
Cross-Functional & Ecosystem Collaboration: Partner closely with AI platform teams, GPU system architects, data center operations, and strategic vendors (NVIDIA, AMD, Juniper)
Lead and participate in root-cause analysis for Link flaps and training failures, FEC and PCS errors, Thermal or power-related performance degradation
Drive lab validation, scale testing, and certification of new optics, NIC firmware, and switch software releases
Requirements:
5–10+ years of experience in data center networking, AI infrastructure, or HPC environments
Strong hands-on experience with Juniper QFX platforms and JunOS
Deep understanding of GPU architectures: NVIDIA: H100/H200, GB200/GB300, NVLink/NVSwitch AMD: MI300/MI400, Pollara NICs, Infinity Fabric
Proven expertise in scale-up GPU interconnects and scale-out Ethernet fabrics
Strong knowledge of RDMA/ROCEv2, ECN, PFC, and buffer management
Familiarity with distributed AI workloads, collective operations (NCCL, RCCL)
Hands-on troubleshooting experience with high-speed optics, AEC cables, link training, and NIC firmware
Proficiency in automation and scripting (Python, Ansible, Bash, Terraform)