CrawlJobs Logo

Senior Site Reliability Engineer, Storage

United States, San Francisco, Sunnyvale 166000.00 - 201000.00 USD / Year · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a mission-critical role in maintaining the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused SRE role is responsible for ensuring the availability, performance, and scalability of Crusoe’s cloud storage products and services, which power compute-intensive, latency-sensitive workloads for AI and HPC use cases. This role directly supports our vertically integrated, sustainable cloud platform by building and optimizing distributed, fault-tolerant storage systems at scale.

Job Responsibility

  • Build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure
  • Drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms
  • Help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters
  • Support user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets
  • Investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling
  • Partner with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems
  • Contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments

Requirements

  • 5+ years of professional experience in SRE, systems, or storage engineering
  • Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms
  • Proficiency in a programming language such as Python, Go, Java, or C
  • Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet
  • Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling
  • Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF
  • Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker)
  • Excellent incident response, troubleshooting, and documentation practices
  • Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)
  • Excellent communication skills
  • Must be able to pass a background check
  • Embody the Company values

Nice to have

  • Contributions to open-source storage projects or the Linux storage stack
  • Experience with hybrid storage models across on-prem and cloud environments
  • Familiarity with high-throughput network topologies for storage backplanes (e.g., RoCE, RDMA, InfiniBand)

What we offer

  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineer, Storage

8 matching positions

New

Senior Site Reliability Engineer

We are seeking a Senior Site Reliability Engineer with deep expertise in Kuberne...
Location
Location
Denmark , Copenhagen
Salary
Salary:
Not provided
keepit.com Logo
Keepit
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in a Site Reliability, Platform, or DevOps Engineering role
  • Hands-on Kubernetes experience, including storage (Rook-Ceph or equivalent)
  • Solid Linux fundamentals
  • Proactive mindset
  • Clear communicator
Job Responsibility
Job Responsibility
  • Participate in the daily operation of our existing stack
  • Evolve and take part in designing our next generation infrastructure setup
  • Define and enforce reliability standards, runbooks, and operational best practices across the platform
  • Collaborate with Development and Operations teams to identify and resolve bottlenecks before they become incidents
  • Champion automation
  • if something is done twice, it should be scripted the third time
What we offer
What we offer
  • Competitive salary
  • Pension scheme
  • A modern, energetic global work environment
  • Flexible work-life balance supported by a hybrid working model
  • Regular team-building activities
  • Opportunities for professional development and career advancement
  • Compensation is based on experience and skill set
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Senior Site Reliability Engineer (SRE). This role has been designed as ‘’Onsite’...
Location
Location
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field
  • 6–10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles
  • Strong hands-on experience with cloud platforms (AWS or GCP) including services like EC2/GCE, IAM, and object storage (S3/GCS)
  • Experience with containerization and orchestration technologies, especially Docker and Kubernetes
  • Experience building and managing CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab
  • Experience with monitoring and observability tools such as Prometheus, CloudWatch, or Stackdriver
  • Strong understanding of Linux systems administration and configuration management tools like Ansible
  • Experience managing distributed systems and streaming platforms such as Kafka, Cassandra, Elasticsearch, Spark, Flink, or Storm
  • Strong automation and scripting skills using Python, Go, Rust, or Shell scripting
  • Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation
Job Responsibility
Job Responsibility
  • Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP environments
  • Operate and support infrastructure components and distributed data platforms such as Kubernetes, Kafka, Flink, Storm, and Spark
  • Manage and maintain databases including Cassandra, Elasticsearch, Redis, Postgres, and ArangoDB
  • Monitor systems, troubleshoot issues, and resolve production incidents across microservices and distributed systems
  • Collaborate closely with software engineering teams to debug and resolve complex production problems
  • Participate in 24x7 on-call rotation supporting multi-cloud production environments
  • Monitor system metrics, application performance, and infrastructure health using observability tools
  • Own the incident management lifecycle, including detection, mitigation, Root Cause Analysis (RCA), and post-incident reviews
  • Develop and maintain runbooks, automation, and operational processes to improve reliability and efficiency
  • Perform capacity planning using system usage and performance data
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking a highly skilled and passionate Senior Site Reliability Engineer ...
Location
Location
Spain; Portugal; United Kingdom
Salary
Salary:
Not provided
parserdigital.com Logo
Parser Limited
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep SRE Expertise: Proven experience as a Senior Site Reliability Engineer or a similar role, with a strong understanding of SRE principles (error budgets, SLOs/SLIs, toil reduction)
  • Azure Cloud Proficiency: Extensive hands-on experience designing, deploying, and operating highly available and scalable applications on Microsoft Azure
  • Azure Kubernetes Service (AKS) Expertise: Mandatory extensive hands-on experience with AKS for container orchestration, including deployment, scaling, monitoring, and troubleshooting
  • Java Ecosystem Mastery: Expert-level proficiency with Java, including experience with modern frameworks (ideally Micronaut, Spring Boot, or similar) and JVM performance tuning
  • Distributed Systems Knowledge: Solid understanding and practical experience with distributed systems, microservices architecture, and associated challenges (e.g., consistency, fault tolerance)
  • Messaging & Database Expertise: Hands-on experience with an event streaming platform (ideally Kafka) and NoSQL data storage (ideally Couchbase), including operational best practices
  • Automation First Mindset: Strong scripting skills (e.g., Python, Bash) and experience with Infrastructure as Code tools (e.g., Terraform, ARM templates) and CI/CD pipelines (e.g., Azure DevOps, Jenkins)
  • Observability Tools: Experience with monitoring, logging, and alerting tools (e.g., Azure Monitor, Prometheus, Grafana, ELK Stack, Splunk)
  • Problem-Solving Acumen: Exceptional analytical and troubleshooting skills, with a methodical approach to diagnosing and resolving complex production issues
  • Communication & Collaboration: Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences and collaborate effectively with cross-functional teams
Job Responsibility
Job Responsibility
  • Architect and Implement Reliability: Design, build, and maintain highly scalable, resilient, and performant systems on Azure, focusing on our Java, Kafka, and Couchbase stack
  • Drive Modernisation: Work hands-on as part of the team spearheading the adoption of Micronaut, standardising application templates, and transitioning to managed cloud services
  • Enhance Operational Excellence: Develop and implement strategies for improving system observability (standardised logging, metrics, tracing), alerting, and on-call practices
  • Automate Everything: Champion automation across the software development lifecycle (SDLC), from CI/CD pipelines to infrastructure provisioning, focusing on accelerating delivery and de-risking deployments
  • Incident Management & Learning: Contribute to our mature, blameless post-incident review process, identifying root causes and implementing preventative measures to reduce incident hours
  • Tooling & Standards: Develop, maintain, and drive the adoption of shared, standardised SRE tooling and best practices across engineering teams, including containerisation (e.g., Docker, Kubernetes on Azure), infrastructure as code (e.g., Terraform), and configuration management
  • Mentorship & Collaboration: Provide technical leadership and mentorship to junior engineers, fostering a culture of SRE principles and operational excellence across the wider engineering organisation
  • Strategic Input: Contribute to the overall technical strategy and roadmap for our SRE and platform initiatives, ensuring alignment with business objectives
What we offer
What we offer
  • The chance to join an organization with triple-digit growth that is changing the paradigm on how software products are built
  • The opportunity to form part of an amazing, multicultural community of tech experts
  • A highly competitive compensation package
  • Medical insurance
  • English lessons
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - GM Motorsports

We are hiring a Senior Site Reliability Engineer (SRE) to join the GM Motorsport...
Location
Location
United States , Austin; Concord
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in Site Reliability Engineering (SRE), DevOps, or Platform Engineering supporting large-scale distributed systems
  • Strong experience with Linux systems administration and cloud-native infrastructure
  • Experience operating high-throughput data platforms or streaming systems (Kafka, Flink, Spark, etc.)
  • Hands-on experience with Infrastructure as Code tools such as Terraform or similar frameworks
  • Experience implementing observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.)
  • Strong debugging and troubleshooting skills across distributed systems
  • Ability to break down complex reliability challenges into clear, implementation-ready initiatives
  • A growth mindset and commitment to continuous learning in a fast-paced engineering environment
Job Responsibility
Job Responsibility
  • Design and implement reliability practices across the motorsports data platform, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for streaming and analytics workloads
  • Ensure reliability and performance of high-throughput streaming and batch data pipelines supporting telemetry ingestion, analytics processing, and simulation workloads using technologies such as Kafka, Flink, and Databricks
  • Build and maintain comprehensive observability frameworks including metrics, logs, and tracing across the platform. Develop dashboards, alerts, and automated responses that detect system degradation before it impacts engineering workflows
  • Drive the automation of platform infrastructure using Infrastructure as Code (IaC) and platform engineering best practices to enable consistent, reproducible environments across development, testing, and production
  • Identify operational friction and eliminate manual processes by implementing self-healing infrastructure, automation frameworks, and developer self-service capabilities
  • Own the reliability of data ingestion, transformation, and storage layers, ensuring stable and performant integration across distributed data systems
  • Continuously evaluate platform performance and scalability, ensuring the data platform can support high-frequency telemetry ingestion, real-time analytics, and large-scale historical analysis
  • Provide mentorship and peer review to engineers across the platform team, promoting strong operational discipline, resilient system design, and high-quality engineering practices
What we offer
What we offer
  • Relocation benefits may be eligible
  • Fulltime
Read More
Arrow Right

Senior+ Site Reliability Engineer

Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platf...
Location
Location
United States , San Francisco
Salary
Salary:
172000.00 - 209000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in cloud operations, SRE, or related roles
  • Background working with GPU workloads, high-performance computing, or latency/throughput-sensitive systems
  • Strong knowledge of Unix/Linux systems (kernel/user space) and networking including debugging complex issues in live systems
  • Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)
  • Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)
  • Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn
  • Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible
  • Basic Scripting and automation experience (Go, Python, C, C++, or similar)
  • Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders
  • Ability to stay calm, focused, and effective in fast-moving or high-pressure situations
Job Responsibility
Job Responsibility
  • Collaborate with cross-functional teams to define and refine availability metrics for Crusoe’s cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs
  • Assist in incident response by identifying, diagnosing, and resolving service disruptions, and support post-incident processes through RCA documentation and participation in post-incident reviews
  • Build, operate, and monitor infrastructure health using Crusoe’s observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry)
  • Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability
  • Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self-healing capabilities
  • Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness
  • Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization
  • Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities
What we offer
What we offer
  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Arcadia’s customers rely on us to securely process and deliver high-value health...
Location
Location
Salary
Salary:
Not provided
themuse.com Logo
The Muse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
  • Strong GitOps experience with Argo CD
  • experience building delivery workflows and automation using Argo Workflows
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
  • ability to define reusable platform patterns and controls
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
  • Proficiency in Python for building automation, tooling, and reliability improvements
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
Job Responsibility
Job Responsibility
  • Act as the technical leader for reliability for one or more domains
  • set direction and standards while remaining hands-on where it matters most
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
  • Lead operational readiness and reliability reviews for new features/architectural changes
  • reinforce non-functional requirements (availability, latency, security, cost)
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
What we offer
What we offer
  • Pet Insurance
  • Health Insurance
  • Dental Insurance
  • Vision Insurance
  • FSA
  • HSA
  • HSA With Employer Contribution
  • Life Insurance
  • Short-Term Disability
  • Long-Term Disability
Read More
Arrow Right

Software engineer 2 / Senior Software engineer - Azure Data

Microsoft's Azure Data engineering team is leading the transformation of analyti...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Experience with the Azure stack including Storage, Compute, Networking, Fabric, Purview, Synapse, AKS, DevOps, Data Factory, or Power BI
  • Experience with big data technologies such as Spark, Kafka, Hadoop, or HBase
  • Experience building data lake or data engineering products, tools, or pipelines
  • Familiarity with container-based architectures (Docker, Kubernetes)
  • Ability to debug complex distributed systems on Linux and/or Windows platforms
Job Responsibility
Job Responsibility
  • Write extensible, maintainable code in C#, Java, Scala, or Python for Fabric Materialized Lake View services and HDInsight components
  • Use AI tools and coding best practices across the development lifecycle
  • Design data refresh, scheduling, and query optimisation features with minimal supervision
  • Review code from teammates for correctness, test coverage, security risks, and adherence to team standards
  • Coach junior engineers through code reviews
  • Debug complex issues in distributed systems running on Azure, Linux, and Windows
  • Run live site operations on a rotational, on-call basis
  • Integrate logging and instrumentation to gather telemetry on system health, performance, reliability, and security
  • Work with product managers, technical leads, and partners across geographies to define customer requirements for Materialized Lake View features
  • Fulltime
Read More
Arrow Right