This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As an Incident Manager at Crusoe, you will be the frontline defender of our service reliability and customer trust. This role is pivotal to our mission, directly impacting the company’s success by minimizing downtime and orchestrating rapid resolutions to critical technical challenges. You will spearhead the management of high-visibility incidents and customer escalations, ensuring that our innovative climate-aligned computing platform remains robust and dependable.
Job Responsibility:
Incident Response Leadership: Lead the end-to-end management of high-visibility technical incidents and customer escalations, ensuring rapid restoration of services and effective communication throughout the lifecycle
Complex Troubleshooting: Diagnose and resolve sophisticated technical issues involving Infiniband, containerization, and distributed training to maintain peak operational efficiency for our customers
Infrastructure Optimization: Guide and assist customers in implementing and fine-tuning their HPC infrastructure, directly contributing to their performance goals and technical success
Strategic Collaboration: Act as a critical bridge between customers and internal engineering/product teams, translating frontline feedback into actionable product enhancements and quality improvements
Knowledge Empowerment: Develop and deliver high-impact training materials, internal documentation, and knowledge base articles to empower both teammates and customers to navigate our solutions effectively
Process Innovation: Design and implement robust incident response strategies and self-serve support processes to scale our ability to handle complex technical challenges
Risk Mitigation: Participate in and manage on-call rotations, providing a reliable safety net for our infrastructure and ensuring 24/7 readiness for critical service interruptions
Requirements:
Technical Linux & Virtualization Expertise: Demonstrate deep technical experience with Linux, Virtualization, and Kubernetes to effectively manage and resolve infrastructure incidents
Network Fundamentals: Apply a solid understanding of the TCP/IP stack to troubleshoot connectivity and performance issues across distributed systems
Infrastructure-as-Code (IaC) Knowledge: Utilize your understanding of IaC practices to navigate and support modern automated environments
Proven Customer Leadership: Bring 4-5 years of customer-facing experience, including 3-5+ years in a leadership role acting as a primary liaison between technical teams and stakeholders
Exceptional Communication: Leverage elite written and verbal communication skills to translate complex technical concepts into clear, actionable updates for diverse audiences
Analytic Problem-Solving: Apply a rigorous problem-solving mindset to diagnose, isolate, and resolve multifaceted technical issues under pressure
Nice to have:
Programming Proficiency: Experience writing or debugging code in one or more programming languages
HPC Familiarity: Prior experience working with High-Performance Computing environments or large-scale distributed systems
Advanced Certifications: Industry-recognized certifications in Linux administration, Kubernetes (CKA), or Incident Management frameworks
Scalability Mindset: Experience scaling support or incident functions within a high-growth technology startup