This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
At Crusoe, our Site Reliability Engineering (SRE) team plays a pivotal role in ensuring the reliability and performance of our infrastructure. SRE at Crusoe is dedicated to detecting, analyzing, and preventing issues to maintain high Service Level Agreements (SLAs) through Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Through automation and proactive remediation, our SREs resolve common errors automatically, then advise various engineering teams how to build resilient code. We anticipate and resolve issues before they impact our customers, conduct thorough post-mortems, and drive continuous improvement. Our customer-centric approach ensures that clients always have access to the virtual machines they depend on. This role is crucial for maintaining the 'gold standard' reliability and performance of Crusoe's AI platform.
Job Responsibility:
Automate routine processes and build Crusoe’s internal infrastructure platform
Collaborate with the team in morning stand-up meetings to discuss ongoing projects, recent incidents, and priorities
Collaborate on action plans for deploying new data centers or retrofitting existing ones
Work closely with software engineers, advising on best practices for resilient code and reviewing changes before deployment
Review overnight alerts and system performance metrics
Analyze system logs and develop tools to enhance our monitoring capabilities
Engage in incident response drills, post-mortems, and root cause analysis sessions
Resolve common errors automatically through automation and proactive remediation
Stay focused on maintaining high SLIs and SLOs
Document work, share insights with the team, and plan for the next day's challenges
Requirements:
1-3 years of professional SRE experience
Exposure to server-class hardware & provisioning
Understanding of distributed system architecture
Basic understanding of infrastructure design
Proficiency with at least one programming language (Python, Go, or similar)
Familiarity with infrastructure tools: Use of Docker, Kubernetes, Ansible, Cloud Formation, Terraform
Appreciation of CI/CD practices: Familiarity with tools such as Jenkins, Gitlab workflows, CircleCI, GitHub Actions, etc.
Exposure to Observability tooling and philosophy: logging, monitoring, and alerting tools
Experience with Unix/Linux environments
Understanding of network fundamentals: Basics of TCP/IP and network programming
Awareness of basic information security best practices
Bachelor's Degree in Computer Science, related field, or self-educated in computer science fundamentals