This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Online Advertising is one of the fastest‑growing businesses on the Internet. Microsoft Ads powers large‑scale deep learning workloads across Search, Recommendations, Click Prediction, and Relevance. Deep learning sits at the core of how Ads drives business performance and delivers high‑quality user experiences. We are building a unified, high‑performance inference platform to serve Ads deep learning models at extreme scale. This platform serves billions of requests daily, with strict requirements on latency, throughput, reliability, and cost. We are seeking a Principal Software Engineer with solid expertise in high‑performance C++ systems and large‑scale distributed serving, with preferred experience in GPU inference and acceleration technologies. You will be a senior technical leader driving the architecture, performance, and reliability of the next‑generation serving stack for Ads. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Starting January 26, 2026, Microsoft AI (MAI) employees who live within a 50- mile commute of a designated Microsoft office in the U.S. or 25-mile commute of a non-U.S., country-specific location are expected to work from the office at least four days per week. This expectation is subject to local law and may vary by jurisdiction.
Job Responsibility:
Design and build a unified inference platform for Ads, ensuring scalability, reliability, and efficiency
Optimize model inference via batching, quantization, scheduling, memory management, runtime optimization, and other performance improvements
Develop, optimize, and maintain performance‑critical components for high‑throughput, low‑latency production inference, including GPU‑accelerated paths when applicable
Collaborate with algorithm/model teams to co‑design serving‑aware model architectures and optimizations
Profile and improve end‑to‑end system performance: concurrency, memory footprint, throughput, and latency
Provide senior technical leadership across teams
elevate engineering best practices and influence long‑term technical strategy
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
4+ years' experience building high‑performance, large‑scale distributed systems or ML infrastructure
Expert‑level proficiency in C++, with strong understanding of data structures, algorithms, and system design
Experience building and optimizing performance‑critical production systems
Experience working in Ads, Search, Recommendation systems, or other large‑scale online serving systems
Nice to have:
8+ years developing high‑performance distributed systems in C++
Experience with GPU inference runtimes such as TensorRT, ONNX Runtime, Triton, TRT‑LLM, or vLLM
Expertise in CUDA kernel development and GPU performance engineering