This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team is looking for software engineers to enable customers in deploying, monitoring, profiling, and debugging their application on hyperscale cloud infrastructure. Azure is enabling the largest supercomputing deployments to tackle complex computational problems in public cloud, evident from the various HPC products that have already made the mark on Top500, MLPerf and Graph500 rankings. At this supercomputing scale, we need specialized tools and techniques to maintain the reliability, runtime performance, health of the system and running jobs continuing to meet the Service Level Agreements (SLAs) of customers. Your job would be to build and use state-of-the-art cloud applications and services to find operational gaps and instrument features to achieve the smooth operation and management of cloud-native supercomputers. As a Senior Supercomputing Engineer, you would also bring to the table establishing best practices drive architectural changes and influence roadmap of relevant software and hardware components. Your work will directly impact business goals of a wide range of users and facilitate the next wave of growth and innovation in AI, and HPC in the cloud in general.
Job Responsibility:
Collaborate with appropriate stakeholders to determine user requirements for a scenario
Drive identification of dependencies and the development of design documents for a product, application, service, or platform
Independently uses appropriate artificial intelligence tools and practices across the software development lifecycle to create, implement, optimize, debug, refactor, and reuse code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI)
Leverage subject-matter expertise of product features and partners with appropriate stakeholders (e.g., project managers) to drive a workgroup's project plans, release plans, and work items
Act as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
Proactively seek new knowledge and adapt to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Nice to have:
Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience
1+ years previous experience with running and troubleshooting machine learning workloads on GPU-based HPC systems
1+ years experience with Cloud Computing, Virtualization and Container Technologies
Familiarity with AI/HPC workloads, GPU-based systems, AI assisted software development and secure software design practices