This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Build and operate core infrastructure services that power Fabric Data Engineering on Spark; Improve scalability, resiliency, and observability across Spark-based services; Partner closely with product, client/UX, and runtime teams to ship end-to-end experiences; Drive engineering excellence through design reviews, testing, incident learnings, and performance tuning; Intelligent job/session orchestration and scheduling improvements; Runtime performance optimizations (caching, adaptive execution, cost/perf tuning); Debuggability & observability (logs/metrics/traces, diagnostics experiences); Reliability tooling (auto-heal, safe rollouts, incident reduction); Data engineering developer experience improvements (config, templates, integrations).
Job Responsibility:
Lead the design and delivery of world-class experiences for a new big data cloud offering, with emphasis on scale, reliability, and performance.
Manage and grow a team building core infrastructure services for data engineering and analytics workloads (compute, runtime services, job/session management, configuration, platform integrations).
Own technical strategy and execution end-to-end: translate product requirements into architecture, milestones, and high-quality production outcomes.
Drive operational excellence by establishing troubleshooting practices (logs, metrics, traces), guiding root-cause analysis, and converting operational learnings into engineering improvements.
Improve platform scalability, resiliency, and observability, including automation to reduce operational toil
ensure best practices are adopted consistently across the team.
Partner cross-functionally with product and engineering leaders to deliver end-to-end features, align priorities, and continuously raise the quality bar.
Coach and mentor engineers, provide technical guidance and performance feedback, and foster a culture of ownership, high standards, and continuous learning.
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
4+ years people management experience.
Software engineering foundation (data structures, algorithms, testing, debugging, performance) with the ability to guide and review technical decisions.
Demonstrated experience leading teams that build and ship production infrastructure (backend services, distributed systems, platform components) in cloud environments.
Understanding of distributed systems concepts, including fault tolerance, scaling, scheduling, and resource management, and ability to apply them to team-level architecture.
Proficiency in at least one backend/system language (e.g., Java, Scala, C#, C++, Python) and the ability to stay hands-on enough to unblock teams and assess designs.
Proven ability to ramp up quickly in new domains, tools, and codebases
growth mindset and learning agility.
Ability to operate effectively in an AI-powered engineering environment—adopting AI-assisted workflows (copilots/agents), improving team productivity, and elevating quality.
Experience leading large-scale infrastructure efforts for data platforms or compute services (e.g., orchestration, runtime services, cluster/resource management, multi-tenant systems).
Experience establishing and running observability and operations programs (SLOs/SLIs, alerting, incident response, postmortems) and improving reliability over time.
Background in performance and reliability engineering (profiling, optimization, capacity planning, cost/performance tradeoffs).
Familiarity with cloud-native operating models: service ownership, CI/CD, safe deployments, automation, and modern incident management practices.
Nice to have:
Experience with Spark and/or big data systems is a big plus (but not required if you’re eager to learn).