This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Joining the CoreAI organization at Microsoft means becoming part of the team that builds the end-to-end AI stack powering Azure’s innovation. As a member of the FIT training team within CoreAI, you will help develop the AI infrastructure that accelerates the creation of agentic AI systems across Microsoft. This role is dedicated to advancing scientific methods and scalable infrastructure for training agentic models to achieve frontier-level performance. You will contribute to LLMs, SLMs, and agentic models using both proprietary and open-source frameworks, all aimed at delivering reliable, enterprise-grade agentic workflows. We are seeking a curious, independent, adaptable problem-solver who thrives on continuous learning, embraces changing priorities, and is motivated by creating meaningful impact. Candidates must be able to lead and role models for team that is driven, able to write efficient code, debug complex training jobs, document findings, and demonstrate a track record of continuous improvement. In addition, we value an agile, startup-style mindset - someone who can iterate quickly, pivot when needed, and collaborate effectively in fast-paced, dynamic environments.
Job Responsibility
Collaboration with engineers and researchers to build and optimize training infrastructure and tools for LLMs, SLMs, multimodal, and code-specific models
Design, build and improve services with high scalability and reliability
Design and implement the services to serve the prod traffic and fulfill the security and privacy requirements
Participate in efforts to deliver and improve engineering systems and practices to ensure service quality in complex cloud environments
Contribute to the deployment and monitoring of services in production environments
Requirements
Bachelor's Degree in Computer Science or related technical field and 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience
5+ years of software engineering experience, with significant ownership of production services, cloud platforms, distributed systems, or developer infrastructure
Strong experience building and operating containerized platforms using Kubernetes or similar orchestration systems
Strong coding skills in one or more systems or backend languages such as Python, Go, Rust, C++, C#, or Java
Experience designing reliable production APIs, backend services, or control-plane systems that manage compute, storage, networking, or runtime environments
Solid understanding of cloud infrastructure fundamentals, including identity, networking, storage, observability, capacity planning, security, and safe deployment practices
Experience diagnosing production issues using logs, metrics, traces, dashboards, and incident response processes
Demonstrated ability to lead technical design, drive ambiguous projects to completion, mentor other engineers, and collaborate across teams
Nice to have
Experience with Microsoft Azure, AWS, or Google Cloud, especially managed Kubernetes, container registries, object storage, private networking, identity, secrets, and monitoring services
Experience building multi-tenant platforms where reliability, fairness, quota management, isolation, and security are important
Experience with sandboxed execution environments, remote development environments, hosted notebook/tool environments, evaluation infrastructure, or ephemeral compute platforms
Experience with container image build systems, registry authentication, image caching, package caching, artifact distribution, or startup-latency optimization
Experience with cloud networking concepts such as ingress, DNS, proxies, egress control, private endpoints, service routing, and traffic management
Experience with secure runtime design, including authentication, authorization, workload identity, secret handling, network isolation, and protecting shared infrastructure from untrusted workloads
Experience with AI infrastructure, agent execution, evaluation platforms, GPU workloads, Windows/Linux runtime environments, or VM/container hybrid systems
Experience improving service operability through structured logging, distributed tracing, dashboards, alerting, automated validation, and incident playbooks