This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The IDEAS organization’s mission is to unlock the power of data to deliver actionable insights and personalized experiences at scale. Our work supports Microsoft 365, Azure, Windows, and other platforms by enabling reliable, secure, and compliant data services. As part of this team, you will collaborate with partners across the company—including product engineering, data science, and operations—to solve complex problems using modern data platforms, cloud analytics, and AI-assisted tooling. As a Site Reliability Engineer (SRE), you will focus on automation, incident response, and data-driven reliability improvements for services operating in regulated government cloud environments. You will contribute to live site operations, partner closely with engineering teams, and help evolve systems to operate reliably and at scale.
Job Responsibility:
Participate as a Designated Responsible Individual (DRI) in a 24x7 on-call rotation, monitoring service health, responding to incidents within defined SLAs, and contributing to post-incident reviews and learning
Design, build, and maintain automation for deployment, operations, and incident mitigation to improve reliability and reduce manual effort
Instrument services for observability
collect and analyze telemetry and health signals
and use data to guide reliability and performance improvements
Collaborate with engineering partners and stakeholders to align on goals, share operational insights, and deliver user-focused solutions
Apply engineering best practices for development, scaling, and operational excellence to meet performance and customer requirements
Support compliance with security, privacy, and accessibility requirements throughout service onboarding and ongoing operations
Continuously learn and adopt industry practices and internal tools to improve reliability, performance, and observability
Requirements:
Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
Bachelor's Degree in Computer Science, or related technical discipline with proven experience coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Experience with automation, live site operations, and incident response in large-scale cloud or distributed systems
Proficiency in at least one programming or scripting language (for example: C#, Java, Python, or PowerShell)
Strong analytical and problem-solving skills, including experience using telemetry and operational data to inform decisions
Effective written and verbal communication skills, and experience collaborating across teams and disciplines
Ability to meet Microsoft, customer, and/or government security screening requirements, including passing the Microsoft Cloud Background Check upon hire and periodically thereafter
The successful candidate must have an active U.S. Government Secret Security Clearance
This position requires verification of U.S. citizenship due to citizenship-based legal restrictions
Nice to have:
Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience, with a minimum of 4 years of experience in Site Reliability Engineering or a closely related role
Experience with observability and monitoring systems, including MELT (Metrics, Events, Logs, and Traces) practices
Experience automating aspects of incident diagnosis, root cause analysis, or mitigation
Familiarity with compliance processes and standards in cloud or regulated environments