This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
This Incident Manager role is critical for upholding service reliability and customer trust, directly impacting company success by minimizing downtime and resolving critical issues. You will spearhead the management of high-visibility incidents and customer escalations, ensuring rapid and effective responses to complex technical challenges. Beyond immediate resolution, we are looking to sharpen our incident management practices to ensure a superior customer experience during "storms" as well as robust preventative measures afterward. You will leverage data analytics to drive greater resiliency and reliability, ensuring that every incident translates into a stronger product and process.
Job Responsibility:
Lead incident responses for high-visibility issues, ensuring minimal disruption to customer operations
Utilize data analytics to identify trends in incidents, translating these insights into actionable strategies for greater system resiliency and reliability
Develop robust incident response strategies and designs
Conduct deep post-incident reviews to ensure root causes are addressed and recurrences are eliminated
Diagnose and resolve complex technical issues related to Infiniband, containerization, and distributed training
Guide and assist customers in implementing and optimizing their HPC infrastructure to achieve maximum performance and efficiency
Develop and deliver training materials, including internal training sessions, documentation, and knowledge base articles
Work closely with internal engineering and product teams to provide valuable customer feedback
Act as a key technical resource, helping our Customer Support Engineers (CSEs) and Customer Success Managers (CSMs) understand and resolve complex product issues
Requirements:
Strong technical experience with Linux, Virtualization, Kubernetes, and handling customer incidents
NVIDIA, Linux, and Kubernetes certifications are strongly preferred
Solid understanding of the TCP/IP stack and Infrastructure-as-Code (IaC) practices
Programming skills with one or more programming languages
4-5 years of customer-facing experience
3-5+ years’ experience in a team leadership role acting as a liaison with external/internal customers
A proven track record in crisis management
A proven problem-solving mindset with the ability to diagnose and resolve complex technical issues
Excellent communication skills, both written and verbal