Senior Site Reliability Engineer, Storage Job at Crusoe (San Francisco, Sunnyvale)

New

Senior Site Reliability Engineer

We are seeking a Senior Site Reliability Engineer with deep expertise in Kuberne...

Location

Denmark , Copenhagen

Salary:

Not provided

Keepit

Expiration Date

Until further notice

Requirements

5+ years in a Site Reliability, Platform, or DevOps Engineering role
Hands-on Kubernetes experience, including storage (Rook-Ceph or equivalent)
Solid Linux fundamentals
Proactive mindset
Clear communicator

Job Responsibility

Participate in the daily operation of our existing stack
Evolve and take part in designing our next generation infrastructure setup
Define and enforce reliability standards, runbooks, and operational best practices across the platform
Collaborate with Development and Operations teams to identify and resolve bottlenecks before they become incidents
Champion automation
if something is done twice, it should be scripted the third time

What we offer

Competitive salary
Pension scheme
A modern, energetic global work environment
Flexible work-life balance supported by a hybrid working model
Regular team-building activities
Opportunities for professional development and career advancement
Compensation is based on experience and skill set

Fulltime

Senior Site Reliability Engineer

Senior Site Reliability Engineer (SRE). This role has been designed as ‘’Onsite’...

Location

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field
6–10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles
Strong hands-on experience with cloud platforms (AWS or GCP) including services like EC2/GCE, IAM, and object storage (S3/GCS)
Experience with containerization and orchestration technologies, especially Docker and Kubernetes
Experience building and managing CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab
Experience with monitoring and observability tools such as Prometheus, CloudWatch, or Stackdriver
Strong understanding of Linux systems administration and configuration management tools like Ansible
Experience managing distributed systems and streaming platforms such as Kafka, Cassandra, Elasticsearch, Spark, Flink, or Storm
Strong automation and scripting skills using Python, Go, Rust, or Shell scripting
Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation

Job Responsibility

Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP environments
Operate and support infrastructure components and distributed data platforms such as Kubernetes, Kafka, Flink, Storm, and Spark
Manage and maintain databases including Cassandra, Elasticsearch, Redis, Postgres, and ArangoDB
Monitor systems, troubleshoot issues, and resolve production incidents across microservices and distributed systems
Collaborate closely with software engineering teams to debug and resolve complex production problems
Participate in 24x7 on-call rotation supporting multi-cloud production environments
Monitor system metrics, application performance, and infrastructure health using observability tools
Own the incident management lifecycle, including detection, mitigation, Root Cause Analysis (RCA), and post-incident reviews
Develop and maintain runbooks, automation, and operational processes to improve reliability and efficiency
Perform capacity planning using system usage and performance data

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Senior Site Reliability Engineer

We are seeking a highly skilled and passionate Senior Site Reliability Engineer ...

Location

Spain; Portugal; United Kingdom

Salary:

Not provided

Parser Limited

Expiration Date

Until further notice

Requirements

Deep SRE Expertise: Proven experience as a Senior Site Reliability Engineer or a similar role, with a strong understanding of SRE principles (error budgets, SLOs/SLIs, toil reduction)
Azure Cloud Proficiency: Extensive hands-on experience designing, deploying, and operating highly available and scalable applications on Microsoft Azure
Azure Kubernetes Service (AKS) Expertise: Mandatory extensive hands-on experience with AKS for container orchestration, including deployment, scaling, monitoring, and troubleshooting
Java Ecosystem Mastery: Expert-level proficiency with Java, including experience with modern frameworks (ideally Micronaut, Spring Boot, or similar) and JVM performance tuning
Distributed Systems Knowledge: Solid understanding and practical experience with distributed systems, microservices architecture, and associated challenges (e.g., consistency, fault tolerance)
Messaging & Database Expertise: Hands-on experience with an event streaming platform (ideally Kafka) and NoSQL data storage (ideally Couchbase), including operational best practices
Automation First Mindset: Strong scripting skills (e.g., Python, Bash) and experience with Infrastructure as Code tools (e.g., Terraform, ARM templates) and CI/CD pipelines (e.g., Azure DevOps, Jenkins)
Observability Tools: Experience with monitoring, logging, and alerting tools (e.g., Azure Monitor, Prometheus, Grafana, ELK Stack, Splunk)
Problem-Solving Acumen: Exceptional analytical and troubleshooting skills, with a methodical approach to diagnosing and resolving complex production issues
Communication & Collaboration: Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences and collaborate effectively with cross-functional teams

Job Responsibility

Architect and Implement Reliability: Design, build, and maintain highly scalable, resilient, and performant systems on Azure, focusing on our Java, Kafka, and Couchbase stack
Drive Modernisation: Work hands-on as part of the team spearheading the adoption of Micronaut, standardising application templates, and transitioning to managed cloud services
Enhance Operational Excellence: Develop and implement strategies for improving system observability (standardised logging, metrics, tracing), alerting, and on-call practices
Automate Everything: Champion automation across the software development lifecycle (SDLC), from CI/CD pipelines to infrastructure provisioning, focusing on accelerating delivery and de-risking deployments
Incident Management & Learning: Contribute to our mature, blameless post-incident review process, identifying root causes and implementing preventative measures to reduce incident hours
Tooling & Standards: Develop, maintain, and drive the adoption of shared, standardised SRE tooling and best practices across engineering teams, including containerisation (e.g., Docker, Kubernetes on Azure), infrastructure as code (e.g., Terraform), and configuration management
Mentorship & Collaboration: Provide technical leadership and mentorship to junior engineers, fostering a culture of SRE principles and operational excellence across the wider engineering organisation
Strategic Input: Contribute to the overall technical strategy and roadmap for our SRE and platform initiatives, ensuring alignment with business objectives

What we offer

The chance to join an organization with triple-digit growth that is changing the paradigm on how software products are built
The opportunity to form part of an amazing, multicultural community of tech experts
A highly competitive compensation package
Medical insurance
English lessons

Fulltime

Senior Site Reliability Engineer - GM Motorsports

We are hiring a Senior Site Reliability Engineer (SRE) to join the GM Motorsport...

Location

United States , Austin; Concord

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

Proven experience in Site Reliability Engineering (SRE), DevOps, or Platform Engineering supporting large-scale distributed systems
Strong experience with Linux systems administration and cloud-native infrastructure
Experience operating high-throughput data platforms or streaming systems (Kafka, Flink, Spark, etc.)
Hands-on experience with Infrastructure as Code tools such as Terraform or similar frameworks
Experience implementing observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.)
Strong debugging and troubleshooting skills across distributed systems
Ability to break down complex reliability challenges into clear, implementation-ready initiatives
A growth mindset and commitment to continuous learning in a fast-paced engineering environment

Job Responsibility

Design and implement reliability practices across the motorsports data platform, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for streaming and analytics workloads
Ensure reliability and performance of high-throughput streaming and batch data pipelines supporting telemetry ingestion, analytics processing, and simulation workloads using technologies such as Kafka, Flink, and Databricks
Build and maintain comprehensive observability frameworks including metrics, logs, and tracing across the platform. Develop dashboards, alerts, and automated responses that detect system degradation before it impacts engineering workflows
Drive the automation of platform infrastructure using Infrastructure as Code (IaC) and platform engineering best practices to enable consistent, reproducible environments across development, testing, and production
Identify operational friction and eliminate manual processes by implementing self-healing infrastructure, automation frameworks, and developer self-service capabilities
Own the reliability of data ingestion, transformation, and storage layers, ensuring stable and performant integration across distributed data systems
Continuously evaluate platform performance and scalability, ensuring the data platform can support high-frequency telemetry ingestion, real-time analytics, and large-scale historical analysis
Provide mentorship and peer review to engineers across the platform team, promoting strong operational discipline, resilient system design, and high-quality engineering practices

What we offer

Relocation benefits may be eligible

Fulltime

Senior+ Site Reliability Engineer

Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platf...

Location

United States , San Francisco

Salary:

172000.00 - 209000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

5+ years of experience in cloud operations, SRE, or related roles
Background working with GPU workloads, high-performance computing, or latency/throughput-sensitive systems
Strong knowledge of Unix/Linux systems (kernel/user space) and networking including debugging complex issues in live systems
Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)
Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)
Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn
Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible
Basic Scripting and automation experience (Go, Python, C, C++, or similar)
Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders
Ability to stay calm, focused, and effective in fast-moving or high-pressure situations

Job Responsibility

Collaborate with cross-functional teams to define and refine availability metrics for Crusoe’s cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs
Assist in incident response by identifying, diagnosing, and resolving service disruptions, and support post-incident processes through RCA documentation and participation in post-incident reviews
Build, operate, and monitor infrastructure health using Crusoe’s observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry)
Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability
Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self-healing capabilities
Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness
Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization
Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities

What we offer

Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement

Fulltime

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...

Location

Egypt , Giza

Salary:

Not provided

Rackspace

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering/computer science or equivalent
Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
Proactive approach to identifying problems and solutions
Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
Experience with Terraform or Cloud Formation scripting
Experience with configuration management tools like Ansible, Chef or Puppet
Experience with standard software development best practices and tools such as code repositories (Git preferred)
Experience executing in an agile software development environment

Job Responsibility

Work with customers and implement Observability solutions
Build and maintain scalable systems and robust automation that supports engineering goals
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
Collaborate with team members to document and share solutions
Maintain a deep understanding of the customer’s business as well as their technical environment
Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues

Fulltime

Principal Site Reliability Engineer

Arcadia’s customers rely on us to securely process and deliver high-value health...

Location

Salary:

Not provided

The Muse

Expiration Date

Until further notice

Requirements

8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
Strong GitOps experience with Argo CD
experience building delivery workflows and automation using Argo Workflows
Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
ability to define reusable platform patterns and controls
Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
Proficiency in Python for building automation, tooling, and reliability improvements
Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)

Job Responsibility

Act as the technical leader for reliability for one or more domains
set direction and standards while remaining hands-on where it matters most
Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
Lead operational readiness and reliability reviews for new features/architectural changes
reinforce non-functional requirements (availability, latency, security, cost)
Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services

What we offer

Pet Insurance
Health Insurance
Dental Insurance
Vision Insurance
FSA
HSA
HSA With Employer Contribution
Life Insurance
Short-Term Disability
Long-Term Disability

Software engineer 2 / Senior Software engineer - Azure Data

Microsoft's Azure Data engineering team is leading the transformation of analyti...

Location

India , Bangalore

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Experience with the Azure stack including Storage, Compute, Networking, Fabric, Purview, Synapse, AKS, DevOps, Data Factory, or Power BI
Experience with big data technologies such as Spark, Kafka, Hadoop, or HBase
Experience building data lake or data engineering products, tools, or pipelines
Familiarity with container-based architectures (Docker, Kubernetes)
Ability to debug complex distributed systems on Linux and/or Windows platforms

Job Responsibility

Write extensible, maintainable code in C#, Java, Scala, or Python for Fabric Materialized Lake View services and HDInsight components
Use AI tools and coding best practices across the development lifecycle
Design data refresh, scheduling, and query optimisation features with minimal supervision
Review code from teammates for correctness, test coverage, security risks, and adherence to team standards
Coach junior engineers through code reviews
Debug complex issues in distributed systems running on Azure, Linux, and Windows
Run live site operations on a rotational, on-call basis
Integrate logging and instrumentation to gather telemetry on system health, performance, reliability, and security
Work with product managers, technical leads, and partners across geographies to define customer requirements for Materialized Lake View features

Fulltime

Select Country

Senior Site Reliability Engineer, Storage

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Site Reliability Engineer, Storage

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer - GM Motorsports

Senior+ Site Reliability Engineer

Site Reliability Engineer / Observability Engineer

Principal Site Reliability Engineer

Software engineer 2 / Senior Software engineer - Azure Data

Our AI answers in your language