This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Do you want to be at the forefront of innovating the latest hardware designs to propel Microsoft’s cloud growth? Are you seeking a unique career opportunity that combines technical capabilities, cross-team collaboration with business insight and strategy? Join the Systems Planning and Architecture (SPARC) team within Microsoft’s Azure Hardware Systems and Infrastructure (AHSI) organization, the team behind Microsoft’s expanding Cloud Infrastructure and for powering Microsoft’s “Intelligent Cloud” mission. We are seeking a highly skilled Senior AI Software Architect to join our team focused on model enablement and performance optimization for Maia accelerators. This role is ideal for someone with strong experience in PyTorch-based model development, quantization techniques, and parallelization strategies at the framework level. You will work closely with hardware and software teams to bring up models on Maia and ensure they run efficiently.
Job Responsibility:
Port and optimize large-scale AI models (e.g., foundation models, diffusion models, YOLO) to run efficiently on Maia hardware
Integrate models using frameworks such as PyTorch, ONNX, vLLM, and SGLang
Apply techniques like KV cache quantization (e.g., BF16 → FP8), checkpointing, and re-sharding for efficient inference and training
Experiment with parallelism strategies (TP, PP) and analyze performance impacts across interconnects (NVLink vs PCIe)
Collaborate on improving inference pipelines, including KV caching in sglang/vllm and performance tuning at the PyTorch level
Work with Triton kernels for basic operations (e.g., FP8 dequantization) and assist in kernel performance analysis
Partner with hardware architects and kernel developers for co-design discussions
Communicate effectively with multiple stakeholders to align on performance goals and deliverables
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Nice to have:
Bachelor's Degree in Computer Science or Engineering
3+ years of strong hands-on experience with PyTorch and model optimization techniques
Practical knowledge of quantization techniques like PTQ/QAT especially for KV cache quantization
Familiarity with parallelization strategies and distributed training concepts (e.g., sharding, allreduce)
2+ years of experience with AI inference stacks like SGLang/vLLM and performance profiling
Excellent problem-solving and communication skills
ability to work in a collaborative team environment
3+ years of experience in Triton kernels and CUDA programming (basic understanding is acceptable but willingness to learn is essential)
Experience with AI accelerator hardware and embedded systems
3+ years of prior work on efficient model checkpointing, resharding scripts, and large-scale model deployments for serving at scale