This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The CoreAI GPU Infrastructure team builds the foundational accelerated compute platforms that power large-scale AI training and inference across Azure. Our mission is to deliver secure, reliable, and highly efficient GPU infrastructure that enables multi-tenant AI systems at global scale while maximizing utilization, performance, and developer productivity. This role sits at the intersection of cloud infrastructure, systems software, virtualization, and container platforms, working closely with CoreAI, Azure Infrastructure, OS, Networking, and Hardware teams to deliver end-to-end platform capabilities.
Job Responsibility:
Design and build GPU accelerated infrastructure for training and inference workloads, spanning bare metal, virtual machines, and containerized environments
Develop systems for GPU device management, scheduling, isolation, and sharing (e.g., partial GPU allocation, multi-tenant usage)
Build and operate advanced orchestration and resource governance scenarios using platforms such as AKS, Dynamic Resource Allocation (DRA), and related Kubernetes ecosystem capabilities to enable fair sharing, isolation, and efficient utilization of accelerated resources
Build and evolve virtualization and container stacks to support modern AI workloads, including secure and confidential compute scenarios
Optimize performance, reliability, and utilization across large GPU fleets, including scale-up and scale-out configurations
Partner with networking and storage teams to enable high-performance interconnects (e.g., RDMA/InfiniBand class networking) for distributed workloads
Drive end-to-end platform features from design through production, including observability, diagnostics, and operational excellence
Influence platform architecture and technical direction across teams through design reviews and technical leadership
Requirements:
Bachelor's Degree in Computer Science or related technical field and 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience
Proven ability to design and operate large-scale, production infrastructure with high reliability and performance requirements
Strong problem-solving skills and the ability to debug complex, cross-layer systems issues
Demonstrated technical leadership, including mentoring engineers and driving cross-team architectural alignment