This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The HPC/AI (High performance Computing and Artificial Intelligence) team is on a mission to build the next-generation distributed AI supercomputer, enabling breakthroughs in artificial intelligence by delivering unmatched computational power, scalability and reliability. We design and develop cutting-edge infrastructure that supports high-performance AI model training at scale, laying the foundation for innovations that redefine what AI can achieve. We are seeking passionate and innovative engineers to design, build and manage cutting-edge networking infrastructure that powers large-scale AI training. This role focuses on developing next-generation networking capabilities to ensure high performance, low latency, and minimal jitter for distributed AI workloads. You will play a critical role in enabling state-of-the-art AI systems to achieve their full potential.
Job Responsibility:
Demonstrates some knowledge of data — knows what data is needed, knows how to find new or missing data, and can describe defects and their relevance to product and service targets. Identifies patterns and trends in data and interprets them to inform decisions related to products and/or services
Collaborates with teams across the organization to support and manage safe and secure network deployments
Works with machine-readable definitions to manage deployments
Supports the management of incidents by applying technical knowledge to diagnose and triage issues with a commitment to maintaining the quality of products and services. Takes notes during incidents and participates in postmortem and root cause analysis processes
Performs testing and validation of network devices, firmware, and configurations. Defines and implements test cases with existing automation tools, and exposes test coverage gaps
Triages, troubleshoots, and repairs live site issues by applying an understanding of network components and features (e.g., device operating systems) as well as problem management tools (e.g., root cause analysis, trend analysis, postmortems), to discover and drive solutions with minimal or no disruption to customers. Actively participates in on-call/DRI duties to troubleshoot and may actively resolve incidents in production
Monitors network telemetry and performs analyses to identify patterns that reveal errors and unexpected problems. Makes suggestions on improvements to monitoring based on observations and experience
Provides instructions to datacenter or network site staff/technicians on how to securely repair, replace, and maintain physical network hardware and components deployed in production. Identifies gaps and inefficiencies in processes related to securely installing and deploying new hardware and components and provides instructions to address gaps
Requirements:
Master's Degree in Electrical Engineering, Optical Engineering, Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in network design, development, and automation
Bachelor's Degree in Electrical Engineering, Optical Engineering, Computer Science, Information Technology, or related field AND 2+ years technical experience in network design, development, and automation
equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Nice to have:
Doctorate Degree in Electrical Engineering, Optical Engineering, Computer Science, Information Technology, or related field
Master's Degree in Electrical Engineering, Optical Engineering, Computer Science, Information Technology, or related field AND 3+ years technical experience in network design, development, and automation
Bachelor's Degree in Electrical Engineering, Optical Engineering, Computer Science, Information Technology, or related field AND 5+ years technical experience in network design, development, and automation