This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The CoreAI Workloads team builds the foundational inference engines and APIs that power largescale AI inference across Azure - from cutting-edge startups to Fortune 500 enterprises and Microsoft Copilots and agents. Our mission is to deliver secure, reliable, and highly efficient GPU inference that enable multitenant AI systems at global scale while maximizing utilization, performance, and developer productivity. We own inference serving and performance of OpenAI and other state of the art large language model (LLM) models and work directly with OpenAI serving some of the largest workloads on the planet with trillions of inferences per day. Our converged AI fabric and engines deliver inference capabilities for all LLMs in Microsoft catalog, including OpenAI, Anthropic, Mistral, Cohere, Llama, and more.
Job Responsibility:
Optimize inference engines for OpenAI and open-source models by implementing and shipping performance/efficiency improvements across runtime, scheduling, and serving paths (latency, throughput, utilization, availability, and cost)
Run experiments end-to-end: formulate hypotheses, implement engine changes (including Python/PyTorch integration points where relevant), analyze results, and ship improvements behind guardrails
Build and use experimentation capabilities for large-scale AI inference (experiment lifecycle, tracking, metric modeling, comparability standards, automated analysis) so the team can iterate quickly and safely
Own serving availability and efficiency for Azure OpenAI Service workloads through tiered experimentation, lean segmentation, and multi-modal utilization across heterogeneous fleets—turning findings into shipped engine improvements
Design and evolve inference serving architectures to improve utilization and latency using techniques such as disaggregated serving, multi-token prediction, KV offload/retrieval, and quantization—validated via staged rollouts and production guardrails
Extend AI infrastructure abstractions to support elastic, heterogeneous inference engines reliably at scale (e.g., dynamic scaling across model families, modalities, and workload classes while maintaining isolation and SLOs)
Tune and scale inference engines across NVIDIA GPU generations (A100, H100, H200) for state-of-the-art OpenAI models, focusing on serving efficiency, utilization, and reliability (not hardware bring-up)
Partner with networking and storage teams to leverage high-performance interconnects (e.g., RDMA/InfiniBand-class fabrics such as RoCE over IB) for distributed inference, without owning low-level kernel/driver enablement
Drive end-to-end features from design through production: observability, diagnostics, performance regression detection, and operational excellence for inference serving
Influence platform architecture and technical direction across teams through design reviews, clear metrics, and technical leadership focused on experimentation velocity and production reliability
Work across multiple layers of the AI software stack (abstractions, programming models, engine runtimes, libraries, and APIs) to enable large-scale model inference
Benchmark OpenAI and other LLMs for performance across Azure OpenAI Service workload tiers and segments, and translate results into production improvements
Debug, profile, and optimize production inference performance across the stack (abstractions, runtime, scheduling, and serving pipelines) to improve latency, throughput, and utilization
Monitor performance regressions and drive continuous improvements to reduce time-to-deploy and hardware footprint
Collaborate across engineering teams to deliver scalable, production-ready serving efficiency and availability improvements, using experimentation results to guide prioritization and rollout
Build durable engine interfaces that enable fast experimentation and safe shipping of new strategies for class of service (QoS), replica load balancing, KV management (including offload/retrieval), quantization, and sampling (e.g., multi-token prediction and constrained sampling)
Requirements:
Bachelor's Degree in Computer Science or related technical field and 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience
Proven ability to design and operate large-scale, production inference services with high reliability and performance requirements, and to ship performance improvements safely via disciplined experimentation
Strong skills in performance analysis: benchmarking, profiling, diagnosing regressions, and turning results into concrete engine/runtime changes
Strong problem-solving skills and the ability to debug complex, cross layer systems issues
Demonstrated technical leadership, including mentoring engineers, driving cross-team architectural alignment, and leveraging AI tools and AI-assisted workflows to accelerate engineering velocity and quality
Hands-on experience with Kubernetes (building and operating services on k8s), including debugging production issues and designing platform abstractions (e.g., custom resources/controllers) and scheduling-aware deployments (e.g., node affinity, taints/tolerations, resource requests/limits)
Strong collaboration and communication skills, with the ability to work across organizational boundaries
Nice to have:
Experience optimizing LLM inference in practice (e.g., PyTorch inference, serving runtimes, model execution, or inference orchestration) in production environments
Familiarity with high performance networking and low latency communication stacks
Familiarity with GPU-accelerated inference stacks (e.g., CUDA at the application/runtime level, device plugins, or runtime integration)
Experience building or using experimentation systems (A/B, canarying, tiered rollout), including metric definition and comparability for performance and reliability
Familiarity with distributed inference stacks (e.g., NCCL-style collectives, model/tensor parallelism) and performance tradeoffs in large-scale serving