Job Description:
Within Microsoft CoreAI, the Azure SRE Agent Platform team builds and operates production AI agents that help engineers detect, diagnose, and mitigate reliability issues across Azure services and customer workloads. Our work turns modern AI systems into reliable, production-grade operational tooling for AI ops scenarios, with a focus on quality, safety, and real-world impact. We are looking for a Software Engineer II to help build the next generation of agentic systems for cloud operations. This role is for engineers who care deeply about product quality, end-to-end ownership, and the details that separate an exciting prototype from a system people trust during critical moments. Our work spans the full lifecycle of agentic systems in production. We design and improve the core capabilities that shape agent behavior, including tool design, planning and execution loops, orchestration, evaluation, and safety guardrails. We also build the operational foundations that make those systems dependable in practice, including observability, progressive delivery, reliability engineering, and live-site learning. Engineers on this team operate with high autonomy in a highly agile environment: short cycles, thin slices, feature flags, progressive delivery, and constant learning. We are looking for people with high agency and a strong bias for action - engineers who take ownership of ambiguous problems, move quickly, learn from production, and continuously raise the quality bar as they ship.