This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft 365 Copilot is transforming productivity by integrating large language models with user data, Microsoft Graph, and the web. At the core of this innovation is the Substrate Intelligence Platform (DSX) team, which powers personalized, secure, and scalable Copilot experiences across Microsoft 365—Teams, Word, Excel, PowerPoint, OneNote, and beyond. Our team is pioneering the infrastructure for tenant‑isolated fine‑tuning, a foundational platform capability that enables customers to safely personalize Copilot agents using their own data. This includes support for leading OpenAI models (e.g., GPT‑5, O4 Mini) and open‑source models such as Qwen, Mistral, and GPT‑OSS. We own the end‑to‑end fine‑tuning platform via Heron, spanning: Data extraction and isolation; Secure training and evaluation workflows; Model deployment, migration, and lifecycle management. Our systems operate at massive scale in multi‑tenant environments, enforce strict security and compliance boundaries, manage shared GPU resources effectively, and enable seamless onboarding of new models and customers.
Job Responsibility:
Architect and lead the design of large‑scale, distributed services that power tenant‑isolated fine‑tuning and evaluation workflows
Drive end‑to‑end technical ownership of critical platform areas, from data ingestion and training orchestration to deployment, rollback, and monitoring
Define and evolve secure data movement patterns across tenant boundaries, ensuring compliance with Microsoft security, privacy, and governance requirements
Establish long‑term technical vision and roadmap for the Heron fine‑tuning platform, balancing scalability, reliability, cost, and developer velocity
Lead cross‑team technical reviews, influencing designs and driving alignment across multiple organizations
Build frameworks and abstractions that improve operational excellence, including observability, quota management, failure recovery, and developer ergonomics
Act as a technical mentor for senior and junior engineers, raising the bar on design quality, code health, and engineering rigor
Partner with engineering managers and product leaders to translate business goals into executable technical strategies
Proactively identify and resolve systemic production issues, driving durable fixes rather than tactical mitigations
Requirements:
Bachelor's Degree in Computer Science or related technical field AND hands on technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Proven experience designing and operating large‑scale distributed systems in production
Demonstrated ability to lead technical decisions across multiple teams or services
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Nice to have:
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND extensive technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Experience building platform or infrastructure services in cloud environments (Azure preferred)
Deep understanding of multi‑tenant architectures, security boundaries, and privacy‑compliant system design
Hands‑on experience with Azure Machine Learning, Kubernetes, GPU‑backed workloads, or large‑scale data pipelines
Track record of driving architecture simplification, reliability improvements, and cost efficiency at scale
Ability to operate effectively in high ambiguity, influencing without authority and earning trust across org boundaries