This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team is responsible for managing the core platform & fleet of AI & High Performance Computing products that customers use to run their most performant and demanding workloads. The AI Customer Experience (AICE) engineering team within the HPC & AI Eng. team is on the frontlines managing the flagship supercomputers and infrastructure used by top tier AI customers that enable breakthroughs such as ChatGPT and are highlighted in Top500, MLPerf and Graph500 rankings. We run lean, obsess about customer experience and use evidence-based approach to decision making. We have live-site first, metrics-driven culture that prevents us from accumulating debt and necessity to put out fires on daily basis. You will be in a position that carries a ton of responsibility and provides opportunities to directly impact customers satisfaction. As a Supercomputing Software Engineer on the AICE team, you will design & develop capabilities needed to monitor & efficiently operate across the infrastructure & fleet of supercomputers at scale. To enable first to know of critical incidents impacting customer capacity, you will create end to end data pipelines that process & synthesize large volume of telemetry, log files and other data sources to create actionable alerts. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Job Responsibility:
Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers
Manages operations of supercomputers by responding quickly to mitigate issues
Implements systemic solutions and mitigations to more complex issues impacting performance or functionality of supercomputers
Reviews and writes incident postmortem and presents insights that drive changes to reduce or eliminate incidents
Independently improves troubleshooting guides (TSGs), wikis, tests, and telemetry, adding comprehensive observability and monitoring capabilities
Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of supercomputers while also driving consistency in monitoring and operations at scale
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Nice to have:
Bachelor's Degree in Computer Science OR related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
OR Master's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python