This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
M365 Copilot Inference is a high-impact engineering team advancing applied AI and large-scale machine learning across Microsoft. The team designs and operates the platform powering Microsoft 365 Copilot experiences, running at massive GPU scale across multiple regions and SKUs in global datacenters. It builds core LLM API, routing, and capacity control plane services to deliver low-latency, highly available Copilot experiences. We’re hiring a Principal Software Engineering Manager to lead a team focused on control plane automations for capacity buildout. This is a hands-on technical leadership role centered on how Copilot capacity is requested, planned, deployed, and operated. The manager will contribute to capacity planning and custom model deployment automation, partnering closely with peer managers and adjacent areas to shape how the broader control plane evolves. The space spans intake, planning, deployment, fleet health, and unified control plane surfaces. This role is based out of Redmond, WA and employees are expected to work from a designated Microsoft office at least three days a week. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Job Responsibility
Lead and grow a team of software engineers building control plane services and automations across the capacity buildout area
Drive technical design and execution for capacity automation — intake, planning, deployment, fleet health, and control plane components — prioritizing the highest-impact work for Copilot capacity
Replace manual, ticket-driven capacity workflows with automated, data-driven systems
reduce time from capacity request to production traffic for priority workloads
Own live-site, reliability, and operational excellence for the services your team builds
establish SLAs, metrics, and on-call practices
Partner with peer engineering managers on adjacent capacity areas, and with partner teams across M365 Core, AI Core, Azure, and Microsoft Research to align on dependencies and unblock execution
Coach and grow senior and mid-level engineers
raise the engineering bar
recruit strong platform talent into the team
Help shape how the capacity automation area is sliced and scoped over time as the platform and the org evolve
Requirements
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Nice to have
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
4+ years people management experience
Experience as an engineering manager leading IC teams building distributed systems, platform services, or cloud infrastructure at scale
Technical depth — able to participate in design reviews, debug live-site issues, and raise the engineering bar through code and design feedback
Track record shipping production services with live-site and on-call ownership
Experience building automation and tooling that replaces manual operational work
Ability to work across team and org boundaries to align on dependencies, surface trade-offs, and drive execution
Hiring, coaching, and people-development track record
Ability to take an ambiguous charter and turn it into a focused roadmap with clear priorities
Experience with AI/ML infrastructure, GPU fleets, or large-scale inference or training systems
Experience with capacity planning, fleet management, or supply/demand optimization at scale
Familiarity with Azure, M365, or AI workload cost models (COGS, utilization, throughput)
Background building control planes, orchestration platforms, or automation systems from 0→1
Experience hiring and growing IC teams in a high-growth platform org