This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft Ads powers experiences at global scale through large-scale machine learning systems that operate under strict latency, reliability, freshness, and cost constraints. As Ads expands the use of advanced ML and LLM-based systems, inference has become a core production challenge across low-latency online serving, near-real-time decisioning, and large-scale batch workflows. We are looking for a Senior Applied Scientist / Machine Learning Engineer to optimize end-to-end inference workflows for large-scale Ads models. This role is ideal for someone who is deeply technical, hands-on, and excited to work at the intersection of ML and Systems. In this role, you will partner closely with applied scientists and engineers to translate model innovation into efficient, reliable, and cost-effective production systems. You will work across the inference stack, including runtime optimization, batching, scheduling, routing, caching, parallelism, observability, and resource management, with the goal of improving production impact across Ads scenarios. The role also includes supporting emerging agentic workloads that rely on multi-turn reasoning, tool use, structured generation, and long-context inference.
Job Responsibility:
Design and optimize end-to-end ML/LLM inference workflows across online low-latency serving, near-real-time inference, and large-scale batch inference scenarios
Build scalable serving and execution systems for large-scale models, including scheduling, batching, routing, admission control, and resource-aware execution
Improve inference performance and efficiency across compute, memory, storage, network, and concurrency dimensions, with strong focus on latency, throughput, reliability, and cost
Develop and apply modern serving techniques such as continuous or dynamic batching, prefix caching, KV-cache optimization, request shaping, tail-latency reduction, and runtime-level performance tuning
Optimize systems for key generative inference metrics such as time to first token, inter-token latency, throughput, accelerator utilization, and cost per request
Work on runtime and serving optimizations for modern inference stacks such as vLLM, TensorRT-LLM, SGLang, Triton, ONNX Runtime, and PyTorch-based serving systems
Partner with applied scientists to productionize new models and inference patterns, including agentic workflows with tool use, structured outputs, and long-context workloads, and evaluate quality-latency-cost tradeoffs in real production scenarios
Design and improve scheduling and resource management for heterogeneous and multi-tenant inference workloads, including GPU-aware placement, admission control, burst handling, and workload isolation
Build strong observability and diagnostics for inference services, including bottleneck analysis, performance regression detection, and end-to-end latency and cost measurement
Requirements:
Bachelor’s /Masters Degree in Computer Science, Mathematics, Software Engineering, Computer Engineering, or related technical field, and 5+ years of related experience in machine learning systems, distributed systems, inference infrastructure, or software engineering
OR Doctorate in Computer Science, Mathematics, Software Engineering, Computer Engineering, or related technical field, and 2+ years of related experience
Strong programming skills in Python, C++, or C#
Hands-on experience in one or more of the following areas: Large-scale ML/LLM inference serving in production
MLSys for model deployment, serving, or runtime optimization
Experience building or optimizing systems for online inference, batch inference, or near-real-time inference
Strong understanding of inference bottlenecks such as batching, queuing, tail latency, KV-cache pressure, memory bandwidth limits, caching, and heterogeneous resource utilization
Experience with one or more modern inference stacks or runtimes such as vLLM, TensorRT-LLM, SGLang, Triton, ONNX Runtime, DeepSpeed, or PyTorch inference tooling
Experience with modern LLM inference and serving techniques, including areas such as KV-cache management, prefix caching, speculative decoding, quantization, prefill/decode disaggregation, or MoE inference optimization
Experience with production-scale model serving platforms and distributed inference systems, including multi-node or multi-tenant deployments, resource-aware scheduling, and optimization across heterogeneous workloads