This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The HPC/AI (High-Performance Computing and Artificial Intelligence) organization is on a mission to build the next generation of distributed AI supercomputers—systems that deliver unprecedented computational power, scalability, and reliability to accelerate breakthroughs in artificial intelligence. Our teams design and develop world-class AI infrastructure that enables large-scale model training and inference, forming the backbone of Microsoft’s AI innovation. As a Principal Software Engineering Manager, you will lead a team building foundational components of Azure’s AI networking infrastructure—powering some of the largest and most complex distributed training systems in the world. This is a rare opportunity to work at the intersection of AI, cloud infrastructure, and high-performance networking, driving innovation across hardware and software boundaries. With the explosive growth of generative AI and the demand for low-latency, high-bandwidth systems, your work will directly impact the scale, performance, and reliability of Microsoft’s AI platforms.You will lead the design, development, and deployment of high-performance, scalable, and observable networking systems that connect AI accelerators at massive scale. The role requires deep technical acumen, strategic thinking, and a passion for engineering excellence. You’ll collaborate across Microsoft teams to define architecture, deliver solutions to complex infrastructure challenges, and ensure our systems meet the evolving needs of AI workloads.If you’re passionate about building large-scale distributed systems, pushing the boundaries of AI infrastructure, and leading teams that shape the future of supercomputing, we invite you to join us on this journey to define the next era of AI at Microsoft.
Job Responsibility:
Hire, manage, and grow a high-performing team of software engineers, fostering a culture of excellence, inclusion, and innovation
Lead the design and development of large-scale distributed systems and services that power Azure’s AI infrastructure
Drive engineering planning and execution while ensuring alignment with organizational OKRs and long-term strategy
Establish lean, scalable, and efficient processes that promote innovation and engineering rigor
Deliver best-in-class engineering by ensuring services and components are modular, secure, reliable, diagnosable, observable, and reusable
Improve test coverage, automation, and integration testing to proactively identify and resolve reliability gaps
Ensure live-site reliability and service health through robust monitoring, telemetry, and automation
Collaborate across Microsoft and partner organizations to deliver cohesive, end-to-end infrastructure solutions
Apply data-driven insights to optimize performance, scalability, and customer satisfaction
Champion Microsoft’s culture by modeling, coaching, and caring—nurturing diversity, inclusion, and continuous growth for your team and peers
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
4+ years people management experience
10+ years of professional software design and development experience in large-scale distributed systems
Experience building and operating networking infrastructure for hyperscale datacenters or AI clusters
Hands-on experience with networking technologies in AI-specific hardware (e.g., InfiniBand, ROCE, MRC, NVLink)
In-depth understanding of networking protocols (e.g., Ethernet, TCP/IP, RDMA, gRPC) and distributed systems
Familiarity with network virtualization, software-defined networking (SDN), or network performance tuning
Familiarity with AI accelerators such as GPUs (NVIDIA, AMD) or TPUs, and how they interact with networking infrastructure
Experience with telemetry and observability tools for network monitoring at scale
Background in building scalable and fault-tolerant systems in large, distributed environments