CrawlJobs Logo

Senior+ Site Reliability Engineer

United States, San Francisco 172000.00 - 209000.00 USD / Year · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform — and operational excellence is at the heart of that mission. As a Site Reliability Engineer focused on Operational Excellence, you will help ensure the stability, resilience, and performance of Crusoe’s GPU cloud. This role is ideal for engineers who thrive in fast-paced environments, enjoy solving operational problems, and want to grow their technical career while supporting incident response, reliability, and continuous improvement across a large-scale distributed platform. You’ll partner closely with senior SREs, infrastructure engineers, and platform teams to improve reliability, reduce operational toil, and strengthen Crusoe’s incident management practices.

Job Responsibility

  • Collaborate with cross-functional teams to define and refine availability metrics for Crusoe’s cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs
  • Assist in incident response by identifying, diagnosing, and resolving service disruptions, and support post-incident processes through RCA documentation and participation in post-incident reviews
  • Build, operate, and monitor infrastructure health using Crusoe’s observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry)
  • Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability
  • Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self-healing capabilities
  • Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness
  • Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization
  • Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities

Requirements

  • 5+ years of experience in cloud operations, SRE, or related roles
  • Background working with GPU workloads, high-performance computing, or latency/throughput-sensitive systems
  • Strong knowledge of Unix/Linux systems (kernel/user space) and networking including debugging complex issues in live systems
  • Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)
  • Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)
  • Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn
  • Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible
  • Basic Scripting and automation experience (Go, Python, C, C++, or similar)
  • Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders
  • Ability to stay calm, focused, and effective in fast-moving or high-pressure situations
  • A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement

Nice to have

  • Experience with Kubernetes, container orchestration, or large-scale distributed systems
  • Exposure to change management, operational readiness reviews, or structured RCAs
  • Familiarity with self-healing systems, automated remediation, or event-driven operations
  • Interest in scaling AI/HPC infrastructure and solving reliability challenges in GPU-heavy environments
  • Passion for learning, mentorship, and developing deeper SRE capabilities over time

What we offer

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior+ Site Reliability Engineer

8 matching positions

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right
New

Senior Site Reliability Engineer

We are seeking a Senior Site Reliability Engineer with deep expertise in Kuberne...
Location
Location
Denmark , Copenhagen
Salary
Salary:
Not provided
keepit.com Logo
Keepit
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in a Site Reliability, Platform, or DevOps Engineering role
  • Hands-on Kubernetes experience, including storage (Rook-Ceph or equivalent)
  • Solid Linux fundamentals
  • Proactive mindset
  • Clear communicator
Job Responsibility
Job Responsibility
  • Participate in the daily operation of our existing stack
  • Evolve and take part in designing our next generation infrastructure setup
  • Define and enforce reliability standards, runbooks, and operational best practices across the platform
  • Collaborate with Development and Operations teams to identify and resolve bottlenecks before they become incidents
  • Champion automation
  • if something is done twice, it should be scripted the third time
What we offer
What we offer
  • Competitive salary
  • Pension scheme
  • A modern, energetic global work environment
  • Flexible work-life balance supported by a hybrid working model
  • Regular team-building activities
  • Opportunities for professional development and career advancement
  • Compensation is based on experience and skill set
  • Fulltime
Read More
Arrow Right
New

Senior Site Reliability Engineer

The Business Operations team is seeking a highly motivated and experienced Senio...
Location
Location
Norway , Oslo
Salary
Salary:
Not provided
mastercard.com Logo
Mastercard
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Observability
  • Programming and Scripting
  • Systems and Network Administration
  • Cloud Computing and Infrastructure
  • Reliability and Scalability
  • DevOps Practices
  • Troubleshooting
  • Capacity Planning and Performance Optimization
  • IT Service Management
  • Proactive Monitoring and Improvement (SRE Applications)
Job Responsibility
Job Responsibility
  • Independently execute key elements of projects/processes within the Site Reliability Engineering area by applying in-depth knowledge of their discipline and area best practices to effectively resolve problems and roadblocks as they occur
  • Assist in evaluating operational requirements and developing technical solutions within existing frameworks
  • Support automation and scripting efforts to improve operational workflows and incident response processes
  • Troubleshoot and resolve routine and some complex system issues, escalating when necessary to maintain system health
  • Contribute to documentation, knowledge sharing, and best practices to enhance team operational procedures
  • Collaborate with development teams and stakeholders to ensure reliability solutions align with technical and business needs
  • Participate in reviews and quality assurance activities to uphold system stability standards
  • May contribute to solution development for new products/services and/or manage smaller project/initiatives as an experienced individual contributor with specialized knowledge within the Site Reliability Engineering area
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

The Senior Site Reliability Engineer establishes and maintains the infrastructur...
Location
Location
United Kingdom; United States; Canada
Salary
Salary:
Not provided
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
  • Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
  • Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
  • Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
  • Excellent async written communication skills
  • comfortable working with a geographically distributed team
  • Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
  • Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes
Job Responsibility
Job Responsibility
  • Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
  • Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
  • Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
  • Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
  • Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
  • Diagnose and debug production incidents
  • drive root-cause analysis and post-incident improvements to prevent recurring problems
  • Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
  • Contribute to runbooks, architecture documentation, and team processes
What we offer
What we offer
  • Fully remote work & schedule flexibility
  • Company-provided laptop
  • Annual bonus program
  • Monthly remote work stipend
  • Annual professional development stipend
  • Industry conferences
  • Company all-hands and team gatherings
  • 24 days PTO per year (prorated)
  • Birthday
  • Year-end company shutdown
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...
Location
Location
United States
Salary
Salary:
116633.00 - 181243.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
  • Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
  • CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
  • Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
  • SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
  • Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
  • Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
  • Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
  • Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
  • Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements
Job Responsibility
Job Responsibility
  • Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
  • Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
  • Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
  • Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
  • Partner with engineering team members to embed reliability best practices early in the development lifecycle
  • Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
  • Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
  • Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
  • Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
  • Reduce operational toil by identifying repetitive work and implementing automation-first solutions
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting language used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience with distributed caching systems: including their underlying algorithms and how to optimize their performance
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
  • Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Working closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure.
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

At bsport, the Senior Site Reliability Engineer is a role for someone who doesn’...
Location
Location
Spain; France , Barcelona; Paris
Salary
Salary:
Not provided
pro.bsport.io Logo
Bsport
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE, Platform Engineering, Infrastructure or Backend Engineering
  • Strong experience with cloud infrastructure (AWS preferred), Kubernetes and CI/CD
  • Experience building or maintaining high-availability, scalable systems
  • Solid Python experience (bonus points for Django)
  • Experience working with SQL databases, ideally PostgreSQL
  • A proactive mindset: you enjoy taking ownership and solving complex technical challenges
  • Strong communication skills and fluency in English
Job Responsibility
Job Responsibility
  • Scale infrastructure and design resilient systems supporting international growth
  • Improve deployment speed, CI/CD pipelines and developer experience
  • Shape platform architecture through modularisation and scalable deployment strategies
  • Enhance observability, reliability and incident response capabilities
  • Influence engineering practices and collaborate across teams to improve how we build and ship
What we offer
What we offer
  • Competitive salary packages based on your experience and role
  • Hybrid model with 3 days in the office per week
  • Work from anywhere: up to 15 days of remote work from abroad each year
  • Exclusive fitness perks: discounted access to Wellhub for Spain and HelloCSE membership for France
  • Private health insurance and flexible remuneration for Spain
  • Diverse fun loving team: multicultural colleagues, after-work events, team-building & more
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer (SRE)

The Senior SRE is responsible for deployment, updates, and operational support f...
Location
Location
India , Chennai
Salary
Salary:
Not provided
dalet.com Logo
Dalet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud platforms: AWS, Azure
  • Containerisation & Orchestration: Kubernetes
  • Infrastructure as Code: Terraform
  • Configuration Management: Ansible
  • Packaging & Deployment: Helm
  • Databases: MariaDB, MongoDB
  • Monitoring, observability, networking, and cloud security.
Job Responsibility
Job Responsibility
  • Act as a senior technical authority for APAC Site Reliability Engineering activities
  • Drive best practices in reliability, operations, and engineering standards
  • Promote technical excellence, collaboration, and accountability across stakeholders
  • Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
  • Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
  • Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
  • Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
  • Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
  • Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
  • Collaborate closely with engineering to improve platform components, automation, and operational processes
What we offer
What we offer
  • Great career opportunities around the world
  • Truly collaborative environment with supportive leadership
  • Cutting edge technologies (AI, Cloud, Cybersecurity...)
  • Talented and passionate team members
  • Fun working environment
  • Fulltime
Read More
Arrow Right