This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are part of the Microsoft Specialized Cloud organization, delivering Azure to customers on their premises. Our team is responsible for building innovative products and services ecosystems that bring Azure Edge computing to locations where customers are running their business. As a Senior Software Reliability Engineer in the Microsoft Specialized Cloud team, you will leverage end-to-end technical expertise in large scale distributed systems' infrastructure, code, inter- and intra-service dependencies, and operations to proactively and continuously improve the reliability, performance, efficiency, latency, and scalability of Edge services and products operating at scale. You will partner with software engineering product teams by suggesting scalable ways to optimize code, sharing expertise and insights drawn from working across related services or products, and participating in incident response throughout development and operations lifecycles. You will develop code, scripts, systems, and/or tools that reduce operational burden by automating complex and repetitive tasks, enable product engineering teams to increase the velocity at which they can safely deploy changes to production, and monitor the effects of changes across systems, services, and/or products. You will analyse telemetry data to identify patterns and trends that drive continuous improvement, and highlight opportunities to improve quality and reliability of our products and services. You will participate in on-call rotations to resolve live site incidents, minimize customer impact, and document solutions and insights that inform ongoing improvements to infrastructure, code, tools, and/or processes that prevent the recurrence of similar issues.
Job Responsibility:
Acts as a Designated Responsible Individual (DRI) working on call to monitor service for degradation, downtime, or interruptions
Contributes to efforts to collect, classify, and analyze data with little oversight on a range of metrics
Contributes to the development of automation within production and deployment of a complex product feature
Maintains communication with key partners across the Microsoft ecosystem of engineers
Maintains operations of live service as issues arise on a rotational, on-call basis
Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions
Requirements:
8+ years of experience in Software Development/Software Reliability Engineer
Bachelor’s/master's degree or equivalent in Computer science or related field required
A strong Computer Science background with solid C#, Java, C/C++ programming (mostly scripting and automation)
Debugging skills is highly desired
Experience with AI/ML and LLMs is highly preferred
Knowledge of Microsoft Azure, AWS or similar cloud computing platforms is preferred
Strong skills in Networking, Storage, and Virtualization
Prior experience in working in hyperconverged infra
Prior experience in working with fortune 500 customers
Nice to have:
Debugging skills
Experience with AI/ML and LLMs
Knowledge of Microsoft Azure, AWS or similar cloud computing platforms