CrawlJobs Logo

Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Redmond

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

119800.00 - 234700.00 USD / Year

Job Description:

The AI Platform organization builds the end-to-end Azure AI stack, from the infrastructure layer to the PaaS and user experience offerings for AI application builders, researchers, and major partner groups across Microsoft. The platform is core to Azure’s innovation, differentiation and operational efficiency, as well as the AI-related capabilities of all of Microsoft’s flagship products, from M365 and Teams to GitHub Copilot and Bing Copilot. We are the team building the Azure OpenAI service, AI Foundry, Azure ML Studio, Cognitive Services, and the global Azure infrastructure for managing the GPU and NPU capacity running the largest AI workloads on the planet. One of the major, mature offerings of AI Platform is Azure ML Services. It provides data scientists and developers a rich experience for defining, training, fine-tuning, deploying, monitoring, and consuming machine learning models. We provide the infrastructure and workload management capabilities powering Azure ML Services, and we engage directly with some of the major internal research and applied ML groups using these services, including Microsoft Research and the Bing WebXT team. As part of AI Platform, the AI Infra team is looking for a Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI. The scheduler is the “brains” of the AI Infra control plane. It governs access to the GPU and NPU capacity of the platform according to a complex system of workload preference rules, placement constraints, optimization objectives, and dynamically interacting policies aimed to maximize hardware utilization and fulfill greatly varying needs of users and the AI Platform partner services in terms of workload types, prioritization, and capacity targeting flexibility. The scheduler’s set of capabilities is broad and ambitions. It manages quota, capacity reservations, SLA tiers, preemption, auto-scaling, and a wide range of configurable policies. Global scheduling is a distinctive major feature that overcomes the regional segmentation of the Azure compute fleet by treating the GPU capacity as a single global virtual pool, which greatly increases capacity availability and utilization for major classes of ML workload. We have achieved this capability without allowing a major global single point of failure, based on regional instances of the scheduler service interacting via peer-to-peer protocols for sharing capacity inventory and coordinating handoff of jobs for scheduling. Our system manages significant amount of GPU capacity even outside Azure datacenters, through a unified model and operational process and highly generalized, flexible workload scheduling capabilities. To be able to manage the inherent complexity of the Scheduler subsystem and enable it to meet the stringent expectations of high service reliability, availability, and throughput, we emphasize rigorous engineering, utmost precision and quality, and ownership—from feature design to livesite. Quality mindset, attention to detail, development process rigor, and data-driven design and problem-solving skills are key for success in our mission-critical control plane space.

Job Responsibility:

  • Work on the design and development of the core AI Infrastructure distributed and in-cluster services that support large scale AI training and inferencing
  • Develop, test, and maintain control plane services written in C#, hosted on Service Fabric or Kubernetes (AKS) clusters
  • Enhance systems and applications to ensure high stability, efficiency and maintainability, low latency, tight cloud security
  • Provide operational support and DRI (on-call) responsibilities for the service
  • Develop and foster a deep understanding of the machine learning concepts, use cases, and relevant services used by our customers
  • Collaborate closely with service engineers, product managers, and internal applied research and data science teams within Microsoft to build better solutions together
  • Provide vision, expertise, and technical leadership to other team members
  • Help to grow talent in these areas
  • Embody our culture and values

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java, Scala, Rust, Go, TypeScript | OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Nice to have:

  • Master's degree in Computer Science or a related technical field
  • OOP proficiency and practical familiarity with common code design patterns
  • 3+ years of experience with large-scale services in a distributed environment, including concurrency management and stateful resource management
  • Hands-on experience with public cloud services at the IaaS level
  • Advanced knowledge of C# and .Net
  • Proficiency with use of complex data structures and algorithms, preferably in the setting of a resource allocator/scheduler, workflow/execution orchestration engine, database engine, or similar
  • Experience with managing the evolution of a large, complex codebase
  • Proficiency and thoroughness in unit testing and testability techniques
  • Knowledge of AI infrastructure, major use cases, and AI workload management
  • Demonstrated major design contributions and technical leadership
  • Excellent technical communication skills: verbal and written
  • product documentation experience
  • First-hand experience with building large-scale, multi-tenant global services with high availability
  • Experience with building and operating “stateful” and critical control plane services
  • handling challenges with data size and data partitioning
  • advanced use of a NoSQL cloud database
  • Experience with mapping complex object models to relational and non-relational datastores
  • Dev-ops experience with microservices architecture in a complex infrastructure and operational environment
  • Service reliability and fundamentals engineering
  • instrumentation for KPIs or performance analysis
  • demonstrated service and code quality mindset
  • Performance engineering: work on scalability, profiling
  • CPU, memory and I/O use optimization techniques
  • Applied cryptography and compliant handling of customer data
  • Network security: endpoint protection, federated authentication, RBAC
  • Applied knowledge of Kubernetes: service model, workload packaging and deployment, programmatic extensibility (CRDs, operators)
  • or equivalent knowledge of Service Fabric
  • experience with any service mesh
  • Server-side Windows programming and performance engineering
  • Data analytics skills, in particular with Kusto
  • Work in a geo-distributed team

Additional Information:

Job Posted:
April 05, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
PREMIUM
More languages and countries
+ Unlock 31694 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI

Senior Software Engineer, CoreAI Workload Engines

The CoreAI Workloads team builds the foundational inference engines and APIs tha...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field and 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.
  • Proven ability to design and operate large-scale, production inference services with high reliability and performance requirements, and to ship performance improvements safely via disciplined experimentation.
  • Strong skills in performance analysis: benchmarking, profiling, diagnosing regressions, and turning results into concrete engine/runtime changes.
  • Strong problem-solving skills and the ability to debug complex, cross layer systems issues.
  • Demonstrated technical leadership, including mentoring engineers, driving cross-team architectural alignment, and leveraging AI tools and AI-assisted workflows to accelerate engineering velocity and quality.
  • Hands-on experience with Kubernetes (building and operating services on k8s), including debugging production issues and designing platform abstractions (e.g., custom resources/controllers) and scheduling-aware deployments (e.g., node affinity, taints/tolerations, resource requests/limits).
  • Strong collaboration and communication skills, with the ability to work across organizational boundaries.
Job Responsibility
Job Responsibility
  • Optimize inference engines for OpenAI and open-source models by implementing and shipping performance/efficiency improvements across runtime, scheduling, and serving paths (latency, throughput, utilization, availability, and cost).
  • Run experiments end-to-end: formulate hypotheses, implement engine changes (including Python/PyTorch integration points where relevant), analyze results, and ship improvements behind guardrails.
  • Build and use experimentation capabilities for large-scale AI inference (experiment lifecycle, tracking, metric modeling, comparability standards, automated analysis) so the team can iterate quickly and safely.
  • Own serving availability and efficiency for Azure OpenAI Service workloads through tiered experimentation, lean segmentation, and multi-modal utilization across heterogeneous fleets—turning findings into shipped engine improvements.
  • Design and evolve inference serving architectures to improve utilization and latency using techniques such as disaggregated serving, multi-token prediction, KV offload/retrieval, and quantization—validated via staged rollouts and production guardrails.
  • Extend AI infrastructure abstractions to support elastic, heterogeneous inference engines reliably at scale (e.g., dynamic scaling across model families, modalities, and workload classes while maintaining isolation and SLOs).
  • Tune and scale inference engines across NVIDIA GPU generations (A100, H100, H200) for state-of-the-art OpenAI models, focusing on serving efficiency, utilization, and reliability (not hardware bring-up).
  • Partner with networking and storage teams to leverage high-performance interconnects (e.g., RDMA/InfiniBand-class fabrics such as RoCE over IB) for distributed inference, without owning low-level kernel/driver enablement.
  • Drive end-to-end features from design through production: observability, diagnostics, performance regression detection, and operational excellence for inference serving.
  • Influence platform architecture and technical direction across teams through design reviews, clear metrics, and technical leadership focused on experimentation velocity and production reliability.
What we offer
What we offer
  • Benefits and other compensation
  • Fulltime
Read More
Arrow Right

Principal Product Manager/Architect - Foundry Inference Platform (CoreAI)

We are seeking a Principal Product Manager/Architect to define and guide the tec...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree AND 10+ years experience in product/service/program management or software development OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
Job Responsibility
Job Responsibility
  • 1. Product Reliability: Own the product direction for Microsoft Foundry inference, with a primary mandate to make the platform the most reliable enterprise inferencing service available. This includes defining architectural standards for global serving, multi-region resiliency, automated failover, and platform-managed disaster recovery
  • Drive architectural alignment across global routing, capacity pooling, observability, and control plane abstractions to ensure consistent availability, predictable recovery behavior, and simplified customer operations at scale
  • Partner with engineering, infrastructure, and security leaders to ensure reliability targets, SLAs, SLOs and recovery objectives are designed into the platform by default
  • 2. GPU Fleet Efficiency & Capacity: Set the product direction for GPU fleet efficiency and capacity management, guiding platform-level design decisions that maximize utilization, minimize fragmentation, and accelerate timetomonetization of new hardware and models
  • This includes shaping the architecture for global capacity pooling, intelligent scheduling, fungibility across workloads, automated demand forecasting, and softwaredefined allocation
  • The Product Manager/Architect is expected to influence architectural investments across inference utilization, model serving, and hardware/system performance
  • 3. Strategic Customer & Innovation Engagement: Act as a senior technical advisor and architect for Foundry’s most innovative and strategic customers
  • Engage directly with customers on deep technical challenges, including largescale model migrations, reliabilitysensitive production deployments, and advanced serving architectures
  • Support competitive and strategic initiatives by articulating Foundry’s architectural advantages, turning bespoke requests into scalable features
  • 4. Cross-Company Technical Leadership: Serve as a unifying architectural voice across product management, engineering, infrastructure, and partner teams
  • Fulltime
Read More
Arrow Right

Principal Product Manager - Microsoft Foundry (CoreAI)

The Foundry Inference & Training team is responsible for advancing Microsoft’s m...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree AND 8+ years experience in product/service/program management or software development OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Strategic Execution and Operating Rhythm
  • Translate leadership priorities into clear execution plans, milestones, and success metrics across Foundry Inference & Training
  • Establish and run the operating cadence including planning cycles, reviews, executive readouts, and follow-ups
  • Track commitments and dependencies across engineering, research, infrastructure, and partner teams, ensuring risks and gaps are surfaced early
  • Cross-Team Alignment and Influence
  • Act as a connective layer across teams working on model training, data, infrastructure, and platform integration
  • Drive alignment on goals, timelines, and decision points across multiple senior stakeholders
  • Resolve ambiguity by framing tradeoffs, options, and recommendations grounded in technical and business context
  • Program Leadership and Delivery
  • Lead complex, multi-quarter programs with high visibility and executive attention
  • Fulltime
Read More
Arrow Right
New

Senior Lecturer/Associate Professor in Literacy

As a Senior Lecturer / Associate Professor in Literacy, you will play a key role...
Location
Location
Australia , Albury-Wodonga, Bathurst, Port Macquarie, Wagga Wagga
Salary
Salary:
Not provided
csu.edu.au Logo
Charles Sturt University
Expiration Date
June 08, 2026
Flip Icon
Requirements
Requirements
  • A doctoral qualification relevant to literacy or education, with a recognised teaching qualification
  • A strong record of high-quality teaching and student-centred learning
  • An established or emerging research profile aligned to literacy, curriculum or pedagogy
  • The ability to build productive partnerships and contribute to academic leadership
Job Responsibility
Job Responsibility
  • Lead impactful literacy teaching and research
  • Teach across online and on-campus environments
  • Shape future teachers and education practice
  • Contribute to curriculum innovation
  • Build strong relationships with students and partners
  • Provide academic leadership in literacy education
  • Contribute to the School's research profile
  • Supervise higher degree research students
  • Actively engage with professional, community and government stakeholders
  • At Associate Professor level: significant academic leadership, research impact, and contribution to the broader discipline at national/international level
What we offer
What we offer
  • 17% superannuation
  • Fulltime
Read More
Arrow Right
New

Program Manager - Controls and Avionics Solutions

This position is based in Endicott, New York. New York and on-site work will be ...
Location
Location
United States , Endicott
Salary
Salary:
120874.00 - 205486.00 USD / Year
baesystems.com Logo
Baesystems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering, engineering or manufacturing management, or other discipline
  • Demonstrated ability for building strong customer/ stakeholder relationships
  • Strong communication, negotiation, and presentation skills
  • Ability to interpret data and make data-driven decisions
  • Highly adaptable with strong initiative
  • Demonstrated ability to lead and motivate cross-functional teams
  • Knowledge of the global aviation market and regulatory requirements and/ or military aviation market
Job Responsibility
Job Responsibility
  • Maintaining strong customer relationships and leading a multidisciplinary team to execute complex development programs within schedule and budget
  • Leadership and management oversight of a project team assuring that project’s financials, schedule, and technical objectives are met and that the highest level of customer satisfaction is achieved while meeting all contractual commitments
  • Work effectively and collaboratively with Engineering, Operations, and all Program Office functional leadership to assure deliveries continue to exceed customer commitments and achievement of financial commitments to the company
  • Manages, coordinates, plans, organizes, controls, integrates, and executes projects within the Military Aircraft Systems portfolio
  • Participates in the support of new business and in the development of proposals
What we offer
What we offer
  • Health insurance
  • Dental insurance
  • Vision insurance
  • Health savings accounts
  • 401(k) savings plan
  • Disability coverage
  • Life and accident insurance
  • Employee assistance program
  • Legal plan
  • Discounts on home, auto, and pet insurance
  • Fulltime
Read More
Arrow Right
New

Finance Business Partner (Research)

Full Time, Fixed Term (12 months). Level 7 - $101,421 to $110,819 p.a. (plus 17%...
Location
Location
Australia , Wagga Wagga
Salary
Salary:
101421.00 - 110819.00 AUD / Year
csu.edu.au Logo
Charles Sturt University
Expiration Date
June 02, 2026
Flip Icon
Requirements
Requirements
  • A degree in Accounting or Finance (professional accounting body membership is desirable)
  • Experience in project budgeting, forecasting and financial analysis
  • Background in management accounting or business partnering within complex environments
  • exposure to government funding or higher education is advantageous
  • Excellent stakeholder engagement skills, with the ability to work effectively with academics and researchers
  • Familiarity with business intelligence systems and dashboard reporting
Job Responsibility
Job Responsibility
  • Partner with academics to deliver strategic financial insights that enable research success
  • Directly influence world-class projects and decisions shaping the future of education and innovation
  • Lead initiatives that enhance financial governance, deliver accurate and timely reporting, and support key projects such as cost-pricing systems and research budgeting
  • Help build financial capability across the University, fostering collaboration and continuous improvement
What we offer
What we offer
  • Flexibility with a 35-hour work week
  • Access to hybrid work arrangements
  • 17% superannuation
  • Fulltime
Read More
Arrow Right
New

Associate Lecturer/ Lecturer in Oral Health

Make a real impact by educating future oral health professionals to serve the ur...
Location
Location
Australia , Wagga Wagga
Salary
Salary:
80046.00 - 134965.00 AUD / Year
csu.edu.au Logo
Charles Sturt University
Expiration Date
June 16, 2026
Flip Icon
Requirements
Requirements
  • A qualification relevant to the discipline and appropriate to the level being applied for
  • Full registration (for teaching/research) as a Dentist or Oral Health Therapist with the Australian Health Practitioner Regulation Agency (Ahpra)
  • Excellent understanding of the clinical practice of oral health therapy, supported by a record of teaching and subject coordination relevant to the discipline and appropriate to the level being applied for
  • Evidence of the delivery of high quality student-centred learning and teaching in oral health therapy and/or general dentistry
  • A record of research activity or capability relevant to the discipline and appropriate to the level being applied for, as outlined in the position descriptions, may facilitate the progression of research opportunities
Job Responsibility
Job Responsibility
  • deliver high-quality teaching, clinical supervision and learning experiences in Oral Health
  • work with students in both clinical and preclinical settings while contributing to curriculum development, industry engagement and community partnerships
What we offer
What we offer
  • Generous support provided to assist with relocating to Riverina’s beautiful Wagga Wagga or surrounds
  • 17% superannuation
  • Fulltime
Read More
Arrow Right
New

Change Analyst

As Change Analyst you will provide specialist change management expertise to sup...
Location
Location
Australia , Albury-Wodonga, Bathurst, Dubbo, Orange, Wagga Wagga
Salary
Salary:
101421.00 - 110819.00 AUD / Year
csu.edu.au Logo
Charles Sturt University
Expiration Date
June 03, 2026
Flip Icon
Requirements
Requirements
  • Relevant qualifications and/or equivalent experience in organisational change and transformation
  • Experienced in applying change management frameworks and methodologies to large-scale/complex organisational initiatives
  • Skilled in analysing change impacts and shaping clear, targeted responses in policy-driven environments
  • Strong communication and interpersonal skills
Job Responsibility
Job Responsibility
  • Provide specialist change management expertise to support the successful planning and implementation of the Models of Engagement and Assessment initiative
  • Lead change analysis, stakeholder engagement planning and adoption activities to enable a sustainable transition to new models of course delivery and assessment.
What we offer
What we offer
  • Competitive salary and benefits including 17% super
  • Flexible working arrangements that support a healthy work-life balance
  • Fulltime
Read More
Arrow Right