CrawlJobs Logo

Senior Site Reliability Engineer, Storage

crusoe.ai Logo

Crusoe

Location Icon

Location:
United States , San Francisco, Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

166000.00 - 201000.00 USD / Year

Job Description:

At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a mission-critical role in maintaining the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused SRE role is responsible for ensuring the availability, performance, and scalability of Crusoe’s cloud storage products and services, which power compute-intensive, latency-sensitive workloads for AI and HPC use cases. This role directly supports our vertically integrated, sustainable cloud platform by building and optimizing distributed, fault-tolerant storage systems at scale.

Job Responsibility:

  • Build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure
  • Drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms
  • Help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters
  • Support user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets
  • Investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling
  • Partner with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems
  • Contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments

Requirements:

  • 5+ years of professional experience in SRE, systems, or storage engineering
  • Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms
  • Proficiency in a programming language such as Python, Go, Java, or C
  • Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet
  • Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling
  • Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF
  • Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker)
  • Excellent incident response, troubleshooting, and documentation practices
  • Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)
  • Excellent communication skills
  • Must be able to pass a background check
  • Embody the Company values

Nice to have:

  • Contributions to open-source storage projects or the Linux storage stack
  • Experience with hybrid storage models across on-prem and cloud environments
  • Familiarity with high-throughput network topologies for storage backplanes (e.g., RoCE, RDMA, InfiniBand)
What we offer:
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
PREMIUM
More languages and countries
Unlock 29494 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer, Storage

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

FX Applications Support Senior Analyst

This hybrid role involves working as part of the FX Applications Support team to...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-8 years experience in an Application Support role
  • experience installing, configuring or supporting business applications
  • experience with some programming languages and willingness/ability to learn
  • advanced execution capabilities and ability to adjust quickly to changes and re-prioritization
  • effective written and verbal communications including ability to explain technical issues in simple terms that non-IT staff can understand
  • demonstrated analytical skills
  • issue tracking and reporting using tools
  • knowledge/experience of problem management tools
  • good all-round technical skills
  • ability to effectively share information with other support team members and with other technology teams
Job Responsibility
Job Responsibility
  • provides technical and business support for users of Citi applications
  • maintains application systems running in daily operations
  • manages, maintains and supports applications and their environments
  • performs start-of-day checks, continuous monitoring, and regional handovers
  • performs same day risk reconciliations
  • develops and maintains technical support documentation
  • assesses risk and impact and escalates in a timely manner
  • ensures storage and archiving procedures are functioning correctly
  • participates in application releases, from development to post-implementation analysis
  • identifies risks, vulnerabilities and security issues
What we offer
What we offer
  • rewarding work
  • supportive environment
  • clear opportunities for progression
  • exciting company benefits
  • Fulltime
Read More
Arrow Right

FX Applications Support Senior Analyst

As an FX Application Support Analyst, you will play a key role in running and ma...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-8 years’ experience in an Application Support role
  • experience installing, configuring or supporting business applications
  • experience with some programming languages and willingness/ability to learn
  • advanced execution capabilities and ability to adjust quickly to changes and re-prioritization
  • effective written and verbal communications including ability to explain technical issues in simple terms that non-IT staff can understand
  • demonstrated analytical skills
  • issue tracking and reporting using tools
  • knowledge/experience of problem management tools
  • good all-round technical skills
  • ability to effectively share information with other support team members and with other technology teams
Job Responsibility
Job Responsibility
  • provides technical and business support for users of Citi Applications
  • maintains application systems that have completed development stage and are running in daily operations
  • manages, maintains and supports applications and their operating environments, focusing on stability, quality and functionality
  • start of day checks, continuous monitoring, and regional handover
  • perform same day risk reconciliations
  • develop and maintain technical support documentation
  • identifies ways to maximize potential of applications used
  • assess risk and impact of production issues and escalate to business and technology management
  • ensures storage and archiving procedures are in place and functioning correctly
  • formulates and defines scope and objectives for complex application enhancements and problem resolution
What we offer
What we offer
  • rewarding work in a supportive environment
  • clear opportunities for progression
  • exciting company benefits
  • diverse team of professionals
  • global network of people, data and relationships
  • Fulltime
Read More
Arrow Right

Principal Software Engineer, Trusted Data Platform

As a Principal Software Engineer, you will be a technical leader and hands-on co...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field
  • 10+ years of experience in backend software development, focusing on distributed systems and storage solutions
  • 5+ years of experience working with AWS storage services (S3, DynamoDB, EBS, EFS, FSx, Glacier)
  • Strong expertise in system design, architecture, and scalability for large-scale storage solutions
  • Proficiency in at least one major backend programming language (Kotlin, Java, Go, Rust, or Python)
  • Experience designing and implementing highly available, fault-tolerant, and cost-efficient storage architectures
  • Deep understanding of distributed systems, replication strategies, sharding, and caching
  • Knowledge of data security, encryption best practices, and compliance requirements (SOC2, GDPR, HIPAA)
  • Experience leading engineering teams, mentoring senior engineers, and driving technical roadmaps
  • Proficiency with observability tools, performance monitoring, and troubleshooting at scale
Job Responsibility
Job Responsibility
  • Designing and optimizing high-scale, distributed storage systems built on AWS storage technologies
  • Shaping the architecture, performance, and reliability of backend storage solutions that power critical applications at scale
  • Designing, implementing, and optimizing backend storage services that support high throughput, low latency, and fault tolerance
  • Working closely with senior engineers, architects, and cross-functional teams to drive scalability, availability, and efficiency improvements in large-scale storage solutions
  • Leading technical deep dives, architecture reviews, and root cause analyses to resolve complex production issues related to storage performance, consistency, and durability
  • Driving best practices in distributed system design, security, and cloud cost optimization
  • Mentoring senior engineers, contributing to technical roadmaps, and helping shape the long-term storage strategy
  • Collaborating with Site Reliability Engineers (SREs) to implement observability, monitoring, and disaster recovery strategies, ensuring high availability and compliance with industry standards
  • Advocating for automation, Infrastructure-as-Code (IaC), and DevOps best practices, leveraging tools like Terraform, AWS CloudFormation, Kubernetes (EKS), and CI/CD pipelines to enable scalable deployments and operational excellence
What we offer
What we offer
  • Atlassians can choose where they work – whether in an office, from home, or a combination of the two
  • Atlassians have more control over supporting their family, personal goals, and other priorities
  • We can hire people in any country where we have a legal entity
  • Interviews and onboarding are conducted virtually
  • Whatever your preference - working from home, an office, or in between - you can choose the place that's best for your work and your lifestyle
Read More
Arrow Right

Senior Site Reliability Engineer - GM Motorsports

We are hiring a Senior Site Reliability Engineer (SRE) to join the GM Motorsport...
Location
Location
United States , Austin; Concord
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in Site Reliability Engineering (SRE), DevOps, or Platform Engineering supporting large-scale distributed systems
  • Strong experience with Linux systems administration and cloud-native infrastructure
  • Experience operating high-throughput data platforms or streaming systems (Kafka, Flink, Spark, etc.)
  • Hands-on experience with Infrastructure as Code tools such as Terraform or similar frameworks
  • Experience implementing observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.)
  • Strong debugging and troubleshooting skills across distributed systems
  • Ability to break down complex reliability challenges into clear, implementation-ready initiatives
  • A growth mindset and commitment to continuous learning in a fast-paced engineering environment
Job Responsibility
Job Responsibility
  • Design and implement reliability practices across the motorsports data platform, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for streaming and analytics workloads
  • Ensure reliability and performance of high-throughput streaming and batch data pipelines supporting telemetry ingestion, analytics processing, and simulation workloads using technologies such as Kafka, Flink, and Databricks
  • Build and maintain comprehensive observability frameworks including metrics, logs, and tracing across the platform. Develop dashboards, alerts, and automated responses that detect system degradation before it impacts engineering workflows
  • Drive the automation of platform infrastructure using Infrastructure as Code (IaC) and platform engineering best practices to enable consistent, reproducible environments across development, testing, and production
  • Identify operational friction and eliminate manual processes by implementing self-healing infrastructure, automation frameworks, and developer self-service capabilities
  • Own the reliability of data ingestion, transformation, and storage layers, ensuring stable and performant integration across distributed data systems
  • Continuously evaluate platform performance and scalability, ensuring the data platform can support high-frequency telemetry ingestion, real-time analytics, and large-scale historical analysis
  • Provide mentorship and peer review to engineers across the platform team, promoting strong operational discipline, resilient system design, and high-quality engineering practices
What we offer
What we offer
  • Relocation benefits may be eligible
  • Fulltime
Read More
Arrow Right

FLEX Senior Solutions Architect

Accountable for the research, analysis, design, creation and implementation of P...
Location
Location
United States , Bethesda
Salary
Salary:
83.17 - 101.11 USD / Hour
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in an IT operational role supporting mission critical solutions or applications with 5+ years leading an infrastructure organization
  • Bachelor's Degree in IT-related field with five (5)+ years of equivalent combination of education and experience and training
  • 3+ years of experience providing operations and sustainment support for cloud infrastructure service on Amazon or Azure or Ali cloud
  • 5+ years’ experience in any of the following: Public Clouds/Virtual Deployment using ESXi, Amazon Web Services (AWS) / EC2/EKS, Microsoft Azure, Oracle Cloud, Ali cloud, SaaS
  • Graduate degree in technical discipline
  • Strong diagnostic skills with regards to identification and classification of malicious BOT traffic
  • SaFe agile delivery framework
  • Experience supporting modern operating models (Site Reliability engineering)
  • Experience in System Engineering of servers, storage, network, etc.
  • Familiarity with large scale cloud infrastructure, including network architectures, routing, DNS, TCP/IP protocols, and SSL/TLS ciphers
Job Responsibility
Job Responsibility
  • Provides leadership, oversight, governance, and strategic direction related to Infrastructure services to enable the delivery of IT services
  • Defines the Marriott infrastructure architecture and governance model
  • Provides technical leadership, oversight, standardization, and validation of the effectiveness for the Enterprise Infrastructure environment
  • Research, designs, and implements high-performing software components that are standards-based, highly available and secured, delivering the required business functionality
  • Educates internal and external users of the technologies to continually improve the knowledge and skill-base of the organization on how best to operate and support the infrastructure services
  • Develops documents with a focus on how services will be leveraged in the solution architecture
  • Participates in the evaluation and selection of Infrastructure based products
  • Work closely with the EA team to facilitate alignment of plans with what is being delivered
  • Institutes governance based on best practices and ensure proper alignment to projects and major initiatives
  • Leads the analysis of the current environment to detect critical deficiencies and recommends solutions for improvement
What we offer
What we offer
  • bonus program
  • comprehensive health care benefits
  • 401(k) plan with up to 5% company match
  • employee stock purchase plan at 15% discount
  • accrued paid time off
  • life insurance
  • group disability insurance
  • travel discounts
  • adoption assistance
  • paid parental leave
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking a highly skilled and passionate Senior Site Reliability Engineer ...
Location
Location
Spain; Portugal; United Kingdom
Salary
Salary:
Not provided
parserdigital.com Logo
Parser Limited
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep SRE Expertise: Proven experience as a Senior Site Reliability Engineer or a similar role, with a strong understanding of SRE principles (error budgets, SLOs/SLIs, toil reduction)
  • Azure Cloud Proficiency: Extensive hands-on experience designing, deploying, and operating highly available and scalable applications on Microsoft Azure
  • Azure Kubernetes Service (AKS) Expertise: Mandatory extensive hands-on experience with AKS for container orchestration, including deployment, scaling, monitoring, and troubleshooting
  • Java Ecosystem Mastery: Expert-level proficiency with Java, including experience with modern frameworks (ideally Micronaut, Spring Boot, or similar) and JVM performance tuning
  • Distributed Systems Knowledge: Solid understanding and practical experience with distributed systems, microservices architecture, and associated challenges (e.g., consistency, fault tolerance)
  • Messaging & Database Expertise: Hands-on experience with an event streaming platform (ideally Kafka) and NoSQL data storage (ideally Couchbase), including operational best practices
  • Automation First Mindset: Strong scripting skills (e.g., Python, Bash) and experience with Infrastructure as Code tools (e.g., Terraform, ARM templates) and CI/CD pipelines (e.g., Azure DevOps, Jenkins)
  • Observability Tools: Experience with monitoring, logging, and alerting tools (e.g., Azure Monitor, Prometheus, Grafana, ELK Stack, Splunk)
  • Problem-Solving Acumen: Exceptional analytical and troubleshooting skills, with a methodical approach to diagnosing and resolving complex production issues
  • Communication & Collaboration: Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences and collaborate effectively with cross-functional teams
Job Responsibility
Job Responsibility
  • Architect and Implement Reliability: Design, build, and maintain highly scalable, resilient, and performant systems on Azure, focusing on our Java, Kafka, and Couchbase stack
  • Drive Modernisation: Work hands-on as part of the team spearheading the adoption of Micronaut, standardising application templates, and transitioning to managed cloud services
  • Enhance Operational Excellence: Develop and implement strategies for improving system observability (standardised logging, metrics, tracing), alerting, and on-call practices
  • Automate Everything: Champion automation across the software development lifecycle (SDLC), from CI/CD pipelines to infrastructure provisioning, focusing on accelerating delivery and de-risking deployments
  • Incident Management & Learning: Contribute to our mature, blameless post-incident review process, identifying root causes and implementing preventative measures to reduce incident hours
  • Tooling & Standards: Develop, maintain, and drive the adoption of shared, standardised SRE tooling and best practices across engineering teams, including containerisation (e.g., Docker, Kubernetes on Azure), infrastructure as code (e.g., Terraform), and configuration management
  • Mentorship & Collaboration: Provide technical leadership and mentorship to junior engineers, fostering a culture of SRE principles and operational excellence across the wider engineering organisation
  • Strategic Input: Contribute to the overall technical strategy and roadmap for our SRE and platform initiatives, ensuring alignment with business objectives
What we offer
What we offer
  • The chance to join an organization with triple-digit growth that is changing the paradigm on how software products are built
  • The opportunity to form part of an amazing, multicultural community of tech experts
  • A highly competitive compensation package
  • Medical insurance
  • English lessons
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right