CrawlJobs Logo

Staff Site Reliability Engineer, Storage

crusoe.ai Logo

Crusoe

Location Icon

Location:
United States , San Francisco, Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

204000.00 - 247000.00 USD / Year

Job Description:

At Crusoe Energy Systems, our SRE team plays a mission-critical role in maintaining the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused Site Reliability Engineer role is responsible for ensuring the availability, performance, and scalability of Crusoe’s cloud storage products and services, which power compute-intensive, latency-sensitive workloads for AI and HPC use cases. This role directly supports our vertically integrated, sustainable cloud platform by building and optimizing distributed, fault-tolerant storage systems at scale.

Job Responsibility:

  • Build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure, which includes block, file, and object storage systems
  • Drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms
  • Help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters
  • Support user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets
  • Investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling
  • Partner with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems
  • Contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments

Requirements:

  • 8+ years of professional experience in Storage SRE, systems engineering, storage engineering, or similar roles
  • Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms
  • Proficiency in a programming language such as, Go, Python, Java, or C
  • Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet
  • Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling
  • Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF
  • Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker)
  • Excellent incident response, troubleshooting, and documentation practices
  • Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)
  • Excellent communication skills
  • Must be able to pass a background check
  • Embody the Company values
What we offer:
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Site Reliability Engineer, Storage

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

FLEX Senior Solutions Architect

Accountable for the research, analysis, design, creation and implementation of P...
Location
Location
United States , Bethesda
Salary
Salary:
83.17 - 101.11 USD / Hour
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in an IT operational role supporting mission critical solutions or applications with 5+ years leading an infrastructure organization
  • Bachelor's Degree in IT-related field with five (5)+ years of equivalent combination of education and experience and training
  • 3+ years of experience providing operations and sustainment support for cloud infrastructure service on Amazon or Azure or Ali cloud
  • 5+ years’ experience in any of the following: Public Clouds/Virtual Deployment using ESXi, Amazon Web Services (AWS) / EC2/EKS, Microsoft Azure, Oracle Cloud, Ali cloud, SaaS
  • Graduate degree in technical discipline
  • Strong diagnostic skills with regards to identification and classification of malicious BOT traffic
  • SaFe agile delivery framework
  • Experience supporting modern operating models (Site Reliability engineering)
  • Experience in System Engineering of servers, storage, network, etc.
  • Familiarity with large scale cloud infrastructure, including network architectures, routing, DNS, TCP/IP protocols, and SSL/TLS ciphers
Job Responsibility
Job Responsibility
  • Provides leadership, oversight, governance, and strategic direction related to Infrastructure services to enable the delivery of IT services
  • Defines the Marriott infrastructure architecture and governance model
  • Provides technical leadership, oversight, standardization, and validation of the effectiveness for the Enterprise Infrastructure environment
  • Research, designs, and implements high-performing software components that are standards-based, highly available and secured, delivering the required business functionality
  • Educates internal and external users of the technologies to continually improve the knowledge and skill-base of the organization on how best to operate and support the infrastructure services
  • Develops documents with a focus on how services will be leveraged in the solution architecture
  • Participates in the evaluation and selection of Infrastructure based products
  • Work closely with the EA team to facilitate alignment of plans with what is being delivered
  • Institutes governance based on best practices and ensure proper alignment to projects and major initiatives
  • Leads the analysis of the current environment to detect critical deficiencies and recommends solutions for improvement
What we offer
What we offer
  • bonus program
  • comprehensive health care benefits
  • 401(k) plan with up to 5% company match
  • employee stock purchase plan at 15% discount
  • accrued paid time off
  • life insurance
  • group disability insurance
  • travel discounts
  • adoption assistance
  • paid parental leave
  • Fulltime
Read More
Arrow Right

FX Applications Support Senior Analyst

As an FX Application Support Analyst, you will play a key role in running and ma...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-8 years’ experience in an Application Support role
  • experience installing, configuring or supporting business applications
  • experience with some programming languages and willingness/ability to learn
  • advanced execution capabilities and ability to adjust quickly to changes and re-prioritization
  • effective written and verbal communications including ability to explain technical issues in simple terms that non-IT staff can understand
  • demonstrated analytical skills
  • issue tracking and reporting using tools
  • knowledge/experience of problem management tools
  • good all-round technical skills
  • ability to effectively share information with other support team members and with other technology teams
Job Responsibility
Job Responsibility
  • provides technical and business support for users of Citi Applications
  • maintains application systems that have completed development stage and are running in daily operations
  • manages, maintains and supports applications and their operating environments, focusing on stability, quality and functionality
  • start of day checks, continuous monitoring, and regional handover
  • perform same day risk reconciliations
  • develop and maintain technical support documentation
  • identifies ways to maximize potential of applications used
  • assess risk and impact of production issues and escalate to business and technology management
  • ensures storage and archiving procedures are in place and functioning correctly
  • formulates and defines scope and objectives for complex application enhancements and problem resolution
What we offer
What we offer
  • rewarding work in a supportive environment
  • clear opportunities for progression
  • exciting company benefits
  • diverse team of professionals
  • global network of people, data and relationships
  • Fulltime
Read More
Arrow Right

FX Applications Support Senior Analyst

This hybrid role involves working as part of the FX Applications Support team to...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-8 years experience in an Application Support role
  • experience installing, configuring or supporting business applications
  • experience with some programming languages and willingness/ability to learn
  • advanced execution capabilities and ability to adjust quickly to changes and re-prioritization
  • effective written and verbal communications including ability to explain technical issues in simple terms that non-IT staff can understand
  • demonstrated analytical skills
  • issue tracking and reporting using tools
  • knowledge/experience of problem management tools
  • good all-round technical skills
  • ability to effectively share information with other support team members and with other technology teams
Job Responsibility
Job Responsibility
  • provides technical and business support for users of Citi applications
  • maintains application systems running in daily operations
  • manages, maintains and supports applications and their environments
  • performs start-of-day checks, continuous monitoring, and regional handovers
  • performs same day risk reconciliations
  • develops and maintains technical support documentation
  • assesses risk and impact and escalates in a timely manner
  • ensures storage and archiving procedures are functioning correctly
  • participates in application releases, from development to post-implementation analysis
  • identifies risks, vulnerabilities and security issues
What we offer
What we offer
  • rewarding work
  • supportive environment
  • clear opportunities for progression
  • exciting company benefits
  • Fulltime
Read More
Arrow Right

Staff Systems Infrastructure Engineer

You will be an integral part of our engineering team, collaborating closely with...
Location
Location
United States , Palo Alto
Salary
Salary:
120000.00 - 200000.00 USD / Year
solomonpage.com Logo
Solomon Page
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering roles
  • Deep expertise in cloud platforms, with significant experience in Google Cloud Platform (GCP) services (e.g., Kubernetes (GKE), Cloud Run, Cloud SQL, AlloyDB, Pub/Sub, Cloud Storage, Compute Engine)
  • Strong proficiency with Infrastructure as Code (IaC) concepts and tools
  • Extensive experience with CI/CD pipeline development and management, specifically with GitHub Actions
  • Solid understanding of containerization technologies, especially Docker and Kubernetes
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and system management
  • Experience with monitoring, logging, and alerting tools, with a focus on OpenTelemetry
  • Demonstrated knowledge of database administration and optimization, particularly PostgreSQL, AlloyDB, and Cloud SQL
  • A strong commitment to information security and privacy, with experience in implementing and maintaining systems in compliance with frameworks like HIPAA and SOC 2
  • Excellent problem-solving skills and the ability to troubleshoot complex infrastructure issues
Job Responsibility
Job Responsibility
  • Design, implement, and maintain highly available, scalable, and secure cloud infrastructure on Google Cloud Platform (GCP) to support our Clinical Data Intelligence Platform and SMART on FHIR applications
  • Develop and implement Infrastructure as Code (IaC) solutions to automate provisioning, configuration, and management of our environments
  • Build and optimize CI/CD pipelines using tools like GitHub Actions to enable rapid and reliable deployment of our applications and services
  • Implement and manage monitoring, alerting, and logging solutions with a focus on OpenTelemetry to ensure system health, identify performance bottlenecks, and proactively address issues
  • Collaborate with engineering teams to optimize application performance, reliability, and cost efficiency
  • Ensure strict adherence to security best practices and compliance requirements (e.g., HIPAA, SOC 2) across all infrastructure components and processes
  • Manage and improve database infrastructure (e.g., PostgreSQL, AlloyDB, Cloud SQL) for performance and scalability
  • Take part in rotating on-call duties to maintain the stability and availability of our production systems
What we offer
What we offer
  • 0.05% – 0.4% and Benefits
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Arcadia’s customers rely on us to securely process and deliver high-value health...
Location
Location
Salary
Salary:
Not provided
themuse.com Logo
The Muse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
  • Strong GitOps experience with Argo CD
  • experience building delivery workflows and automation using Argo Workflows
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
  • ability to define reusable platform patterns and controls
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
  • Proficiency in Python for building automation, tooling, and reliability improvements
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
Job Responsibility
Job Responsibility
  • Act as the technical leader for reliability for one or more domains
  • set direction and standards while remaining hands-on where it matters most
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
  • Lead operational readiness and reliability reviews for new features/architectural changes
  • reinforce non-functional requirements (availability, latency, security, cost)
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
What we offer
What we offer
  • Pet Insurance
  • Health Insurance
  • Dental Insurance
  • Vision Insurance
  • FSA
  • HSA
  • HSA With Employer Contribution
  • Life Insurance
  • Short-Term Disability
  • Long-Term Disability
Read More
Arrow Right
New

Senior Software Engineer

Wells Fargo is seeking a Senior Software Engineer. We are looking for an experie...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
April 18, 2026
Flip Icon
Requirements
Requirements
  • 4+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 4+ years of software engineering experience
  • 4+ years of application production support experience
  • Education BS/BA degree or higher
  • An industry-standard technology certification
  • Strong verbal, written, and interpersonal communication skills
  • 3+ years of experience with Cloud technologies
  • Knowledge and understanding of Site Reliability Engineering (SRE) concepts
  • 3+ years of Agile experience
  • Advanced scripting skills specifically around automation, log rotation, data collection, error collection and alerting
Job Responsibility
Job Responsibility
  • Lead moderately complex initiatives and deliverables within technical domain environments
  • Contribute to large scale planning of strategies
  • Design, code, test, debug, and document for projects and programs associated with technology domain, including upgrades and deployments
  • Review moderately complex technical challenges that require an in-depth evaluation of technologies and procedures
  • Resolve moderately complex issues and lead a team to meet existing client needs or potential new clients needs while leveraging solid understanding of the function, policies, procedures, or compliance requirements
  • Collaborate and consult with peers, colleagues, and mid-level managers to resolve technical challenges and achieve goals
  • Lead projects and act as an escalation point, provide guidance and direction to less experienced staff
  • Maintain system operational knowledge (functional and technical)
  • Understand and monitor system operation, ensure optimal availability, functional health, and performance (driven by SLO/SLA)
  • Triage alerts, respond to incidents, perform root cause analysis (troubleshooting)
  • Fulltime
!
Read More
Arrow Right

Product Application Engineer - Data Center Deployment

This highly technical role supports large-scale datacenter graphics hardware and...
Location
Location
United States , Santa Clara; Austin; Secaucus
Salary
Salary:
160960.00 - 241440.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Datacenter customer support in virtualization-focused environments
  • Virtual environments (VMWare, Citrix, KVM, Microsoft, and others) and virtual machine configuration/management
  • Data storage, protection, deduplication, and storage-related network optimization especially with Weka, DDN, and VAST products
  • Working in or closely with a deployment services organization utilizing tools like Salesforce, JIRA and Confluence
  • Linux installation, configuration, debugging, and performance tuning
  • Debugging, root-cause analysis, and system-level problem solving
  • Site reliability engineering concepts and best practices
  • Server architecture, remote management, network topologies, and compute subsystem operations
  • Datacenter GPU software stacks such as ROCm™ or CUDA
  • High-performance networks for HPC and AI (RDMA/RoCE, InfiniBand)
Job Responsibility
Job Responsibility
  • Design, optimize, and troubleshoot virtualization solutions for high-performance datacenter GPU, CPU, and related platforms
  • Support customers, partners, and internal teams on virtualization topics related to AI and Machine Learning workloads
  • Build and configure datacenter networking environments for customer testing, validation, and deployment
  • Qualify and assess new virtualization capabilities to ensure alignment with customer and product requirements
  • Provide mentorship and technical guidance to junior engineering staff
  • Partner with development teams to identify and resolve hardware/software issues from early bring-up through end-of-life
  • Document and escalate technical issues following established procedures
  • Collaborate with program managers to maintain schedules, track action items, and ensure deliverables are met
  • Provide clear project status updates to internal leadership and customer stakeholders
  • Build a deep understanding of customer goals to ensure impactful technical guidance and solution delivery
  • Fulltime
Read More
Arrow Right