Site Reliability Engineer Job at Microsoft Corporation (Redmond)

Job Description

The Silver Edge team brings the power of Azure to the edge for our customers, tackling some of the most complex and mission-critical challenges in cloud and edge computing. Our mission is to provide stellar customer service so that their mission can succeed. We support the new Azure Local product that brings cloud computing to local hardware. We’re looking for a new member of our team that relishes solving complex, ambiguous problems at scale and is passionate about building resilient systems that matter. As a Site Reliability Engineer on the Silver Edge team, you will work on building out and ensuring the dependability of Azure Local services in 3 different sovereign clouds. You will be required to solve tough technical problems, and thrive in dynamic, sometimes chaotic environments. In this role, you will accelerate your career, deepen your expertise in sovereign cloud solutions and help implement the future of Azure edge solutions. We offer flexible work arrangements, including partial remote options, to support your best work.

Job Responsibility

Support customer deployments and use of Azure Local and Azure Local disconnected operations
Maintain Azure Service reliability including deployment, availability, security, performance and customer satisfaction for sovereign environments
Leverages technical expertise in cloud technologies and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or the automation to improve the availability, security, quality, observability, reliability, efficiency, observability, and performance of product components or features supported by their team
Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles
Utilizes technical knowledge of systems/platforms and insights drawn from product engineering teams, security best practices, artificial intelligence (AI)/machine learning (ML), and telemetry analyses to suggest potential improvements in code base and designs across components and features of one or more products
Leverages technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation
Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale
Shares insights and best practices via documented artifacts that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams
Develops alerts and instrumentation across components and features to monitor product capacity, related security risk, and resource demands and analyze telemetry data using existing capacity planning models
Draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters
Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features, leveraging the artificial intelligence (AI) and machine learning (ML) capabilities
Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams
Utilizes insights from performance and resource monitoring tools to identify whether there is a need to optimize the efficiency of component and feature code, or if changes to compute resources are required
Models the predicted effect of changes to code and/or compute resources across components or features to document the efficacy of proposed solutions
Proposes changes and drives implementation of solutions to identified performance and resource challenges
Identifies opportunities to leverage existing tools and automation, including the safe deployment process (SDP), to enable product engineering teams to increase the velocity in which they can reliably and safely implement changes in production
Monitors the effects of changes across multiple components or features within a single platform or system
Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s)
Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed
Communicates details and resolutions through post-mortem reports and review meetings
Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale
Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations
Monitors the impact of changes on operations metrics (e.g., Time-to-X)
Demonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures
Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the security, quality, reliability, and operability of supported products with minimal guidance from other engineers
Researches and maintains an awareness in industry trends, advances in cloud technologies, new tools, and/or processes for maintaining and improving product availability, security, quality, observability, reliability, efficiency, observability, and/or performance
Contributes to the implementation of new solutions within their team by identifying ways they can be applied to solve persistent problems
Develops technical expertise in the code, features, and operations of specific products as required to identify opportunities to improve product availability, security, quality, observability, reliability, efficiency, observability, and/or performance
Actively participates in on-boarding, code/design reviews, and regular meetings with engineering teams that develop and/or manage those products

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role
The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role
This position requires successful verification of the stated security clearance to meet federal government customer requirements
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
This position requires verification of U.S citizenship due to citizenship-based legal restrictions

Nice to have

Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
2+ years technical experience working with large-scale cloud or distributed systems

Microsoft Corporation - All Job Offers

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?