This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
You're joining Core AI, the team at the forefront of redefining how software is built and experienced. We create the foundational platforms, services, and developer experiences that power next-generation applications using Generative AI, enabling developers and enterprises to unlock the full potential of AI to build intelligent, adaptive, and transformative software. You will be a technical contributor driving the applied science foundation for observability in AI agents and multi-agent systems running at scale. This role focuses on understanding how intelligent agents behave in production—their quality, safety, reliability, cost, and evolution over time. You will develop and apply scientific methods, evaluation frameworks, and measurement systems that help teams understand, benchmark, diagnose, and safely improve agent-based systems with confidence. AI agents introduce fundamentally new observability challenges: non-deterministic execution, tool- and model-driven decision paths, emergent multi-agent behaviors, and quality signals that go far beyond traditional uptime metrics. We are hiring multiple Senior and Principal Applied Scientists who will operate at the intersection of agent architecture, telemetry, evaluation science, and responsible AI, shaping how Microsoft measures and improves observable AI systems.
Job Responsibility:
Develop evaluation and measurement frameworks for single-agent and multi-agent systems, spanning quality, safety, reliability, cost, and behavioral consistency
Design methodologies that connect offline evals, online signals, and production telemetry to explain how prompt, tool, model, or orchestration changes affect real-world agent performance
Define scientifically grounded quality signals and benchmarks for agent systems, including task success, tool-use effectiveness, plan quality, failure modes, coordination quality, and user-perceived outcomes
Build models and analysis techniques that help detect regressions, identify root causes, and characterize agent behavior across diverse workflows and environments
Advance observability for AI systems through new approaches to trace analysis, agent health modeling, behavioral clustering, anomaly detection, and multi-agent coordination analysis
Partner with engineering teams to operationalize evaluation and observability methods in production systems, enabling safe iteration through staged rollouts, experimentation, A/B testing, and automated regression detection
Contribute to instrumentation and semantic standards for agent observability, helping make agent execution more explainable, diagnosable, and comparable across systems
Collaborate deeply with product and platform teams across Foundry, Azure Monitor, and agent runtimes to shape end-to-end experiences for evaluation, benchmarking, monitoring, and investigation
Act as a technical leader by setting scientific direction, driving research-informed product decisions, mentoring others, and raising the technical bar across the organization
Requirements:
Bachelor's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 4+ years related experience (e.g., statistics predictive analytics, research)
OR Master's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 3+ years related experience (e.g., statistics, predictive analytics, research)
OR Doctorate in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 1+ year(s) related experience (e.g., statistics, predictive analytics, research)
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Nice to have:
Bachelor's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 9+ years related experience (e.g., statistics, predictive analytics, research)
OR Master's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 6+ years related experience (e.g., statistics, predictive analytics, research)
OR Doctorate in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 4+ years related experience (e.g., statistics, predictive analytics, research)
OR equivalent experience
Experience designing evaluation methodologies, experiments, or measurement systems for complex intelligent or distributed systems
Experience analyzing large-scale production or experimental data to derive actionable insights and drive product or system improvements
Strong coding and prototyping skills in Python or similar languages, with the ability to work closely with engineering teams on production-facing systems
Demonstrated ability to lead cross-team technical direction through scientific depth, influence, and strong problem framing
Advanced degree in Computer Science, Machine Learning, Statistics, Applied Mathematics, or related field
Experience building or evaluating LLM- or agent-based systems in production
Familiarity with agent frameworks such as LangChain, LangGraph, OpenAI SDK, or equivalent orchestration frameworks
Experience with evaluation frameworks for AI systems, including benchmarking, regression analysis, and human-in-the-loop assessment
Experience with observability systems, telemetry analysis, or distributed tracing data in large-scale environments
Background in AI safety, guardrails, and responsible AI measurement
Experience with experimentation platforms, causal inference, or statistical methods for product and model evaluation
Experience working with cloud-scale monitoring platforms such as Azure Monitor / Application Insights or equivalent