This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
CoreAI is at the forefront of Microsoft’s mission to redefine how software is built and experienced. We are responsible for building the foundational platforms, services, programming models, and developer experiences that power the next generation of applications using Generative AI. Our work enables developers and enterprises to harness the full potential of AI to create intelligent, adaptive, and transformative software. The AI Core Infrastructure team, part of AI Platform team in CoreAI Organization is responsible for large-scale, highly reliable and efficient GPU management infrastructure and the inference and training platforms that powers all of Microsoft’s AI workloads, such as M365 CoPilot, Github CoPilot, Microsoft CoPilot, AI Foundry’s Inference and Fine-Tuning offering of OAI and OSS models, and many more. As a Principal Software Engineer on the AI Core Infrastructure team, you will work on cutting edge infrastructure and tools to design, build, and support large scale training and inference platform built on top of latest generation of NVIDIA and AMD GPUs in Azure and Microsoft partner clouds on some of the world’s largest AI Supercomputers.
Job Responsibility:
Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models
Design, build, and manage compute, storage and networking sub-system on large-scale GPU clusters to support LLM training, customization, and inference workloads
Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds
Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices
Support development and troubleshooting from the frontline, resolving complex issues impacting large-scale services
Collaborate closely with engineers, data scientists within the team, internal Microsoft Research teams and external enterprises to build better solutions together
Provide vision, expertise, and technical leadership to other team members
Help to grow talent in these areas
Requirements:
Bachelor’s or master’s degree in computer science or a related field
10+ years designing, developing, and shipping high quality software
4+ years of experience with distributed systems and cloud based infrastructure
2+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.)
Passionate and self-motivated
Strong ability in self-learning, entering new domain, managing through uncertainty in an innovative team environment
10+ years of software development experience in C#, C++, Python, or similar languages
6+ years of experience with containerization tools (e.g., Docker, Kubernetes)
Knowledge and hands on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter