This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Principal Software Engineering Manager - AI Frameworks on the team, you will lead and grow a group of engineers working across multiple layers of the AI software serving stack, including fundamental abstractions, runtimes, libraries, and application programming interfaces (APIs). You will be responsible for setting technical direction, prioritizing investments, and ensuring the team delivers high-impact performance improvements that enable large-scale model training and inference. In this role, you will guide the team’s work on benchmarking OpenAI and other large language models (LLMs) across GPUs and Microsoft hardware, driving performance optimization, monitoring regressions, and accelerating time-to-deployment. You will partner closely with researchers, product teams, and platform owners to translate performance insights into production-ready improvements that reduce hardware footprint and support Microsoft Azure’s capex efficiency goals.
Job Responsibility:
Lead and develop a team of engineers working across multiple layers of the AI software stack to enable large-scale training and inference
Set technical vision and execution strategy for model performance benchmarking, optimization, and deployment across GPUs and Microsoft hardware
Drive performance outcomes by prioritizing and overseeing efforts to benchmark, profile, debug, and optimize training and inference workloads
Own performance health by establishing mechanisms to monitor regressions, measure impact, and continuously improve time-to-deploy and hardware efficiency
Partner cross-functionally with research, product, infrastructure, and hardware teams to deliver scalable, production-ready AI performance improvements
Balance short-term delivery and long-term investments, ensuring the team’s work aligns with organizational goals, platform roadmaps, and Azure capex objectives
Build a strong engineering culture through coaching, feedback, hiring, and career development, enabling the team to operate with increasing autonomy and impact
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Master’s Degree in Computer Science or related technical field AND 10+ years of software engineering experience, including 6+ years in engineering management, OR Bachelor’s Degree in Computer Science or related technical field AND 12+ years of software engineering experience, including 6+ years in engineering management, or equivalent experience
Strong technical foundation in software engineering principles, computer architecture, GPU architecture, and hardware acceleration for neural networks, with the ability to guide teams working in these areas
Experience leading teams responsible for end-to-end performance analysis and optimization of LLMs, AI systems, or HPC workloads, including use of GPU profiling and performance analysis tools
Demonstrated ability to lead cross-team initiatives, align stakeholders, and translate research or platform capabilities into scalable, production-ready solutions
Proven people leadership skills, including hiring, coaching, performance management, and career development, with a track record of building high-performing, inclusive teams
Exposure to AI / ML infrastructure, including DNN or LLM training and/or inference systems, and experience with at least one modern deep learning framework (e.g., PyTorch, TensorFlow, ONNX Runtime)
Familiarity with GPU software stacks and acceleration technologies such as CUDA, ROCm, Triton, or equivalent, sufficient to guide technical direction and evaluate tradeoffs