Site Reliability Engineer II Job at Microsoft Corporation (Dublin)

Job Description

Site Reliability Engineer II - (Microsoft 365 Enterprise + Cloud). We are looking for a Site Reliability Engineers (SRE) with the right mix of systems engineering, data science, software development, AI, on-line services experience, and passion for quality to envision, design, and deliver Microsoft 365 (M365) Enterprise + Cloud service offerings. Team Overview: Within the vast framework of M365 Office Engineering Direct (OED), our SRE team is instrumental to the success of Exchange Online. With the service spanning hundreds of components, our goal is clear: ensure unmatched service availability and continually elevate user satisfaction. What We Do & Our Impact: Our approach is layered and precise. By implementing proactive engineering solutions, we identify and tackle incidents head-on, ensuring limited disruptions. Monitoring, both comprehensive and nuanced, remains our cornerstone, adeptly capturing anomalies beyond the scope of conventional systems. As swift diagnostics steer our course, we channel our efforts towards automation, efficiently managing the incident lifecycle from detection to resolution. Additionally, with a commitment rooted in understanding our users, we meticulously prioritize and execute Design Change Requests, ensuring Exchange Online's evolution aligns with user expectations. The Future – Artificial Intelligence (AI) & Machine Learning (ML) in Focus: As we look to the horizon, the fusion of AI and ML with our SRE practices beckons a transformative era for Online Cloud Services in M365. We are in the initial stages of integrating predictive analytics to anticipate issues before they manifest, allowing us to stay a step ahead. Customized ML models are being developed to intelligently sift through vast data lakes, identifying patterns and correlations previously overlooked. Our journey with AI and ML is not just about enhancement; it is about redefining reliability, precision, and the user experience in the M365 suite.

Job Responsibility

Researches and maintains deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies
identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance
Drives the adoption of innovative solutions across engineering teams working with related products within an organization
Apply advanced statistical and machine learning techniques to analyze large datasets and extract meaningful insights
Experience working with all service aspects of high throughput and multi-tenant services, ability to understand and design workflows carefully, properly handle errors, write clean and well-factored code with good tests and good maintainability
Engages with product engineering teams by partaking in code/design reviews, participating in on-call rotations and incident responses throughout product development and operations cycles
leverages end-to-end technical expertise on underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/business objectives and incident prevention
Develops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale
reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization
Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale
Contributes to the development of new tooling and/or predictive models to identify and test potential improvements in product development and/or operations and monitors the impact of changes on operations metrics (e.g., Time-to-X) within an organization
Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting complex issues, and deploying appropriate fixes to resolve root cause(s)
alerts product teams, owners, and leadership to issues with major customer/business impact and escalates resolution of the overly complex, ambiguous, and impactful issues to include other engineering teams and/or subject matter experts as needed
Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings
Mentors and coaches less experienced engineers to help them identify and propose relevant solutions

Requirements

Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Mid-level years of software development: automation-related experience is most valued
Scripting languages such as bash, python, and PowerShell, or compiled languages such as C, C# are most relevant, but others are acceptable
Awareness of, and ability to reason about, modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, microservices, and so on
Associated troubleshooting skills, including the ability to follow RPC (Remote Procedure Call) call-chains across arbitrary network steps
Consequent understanding of monitoring in distributed systems
Deep understanding of operating system level concepts such as processes, memory allocation, and the network stack
understanding of how applications are affected by the above, and ability to debug same
Experience with working in a team, including coordinating large projects, communicating well, and exercising initiative when presented with problems
Practical experience running large scale online systems is always an advantage
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Nice to have

Master's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND mid-level technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Microsoft Corporation - All Job Offers

Select Country

Site Reliability Engineer II

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?