Senior Site Reliability Engineer, Managed AI Job at Crusoe (San Francisco, Sunnyvale)

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...

Location

United States , San Francisco

Salary:

230000.00 - 345000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation

Job Responsibility

Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

New

Senior Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to join our growing ...

Location

Italy , Milan

Salary:

50000.00 - 70000.00 EUR / Year

iGenius

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field
At least 6 years of experience as a Site Reliability Engineer or in similar roles
Strong experience with observability and monitoring systems such as Prometheus, Thanos, Grafana, and OpenTelemetry
Experience with low-level system instrumentation and performance visibility using technologies such as eBPF
Experience with security monitoring and threat detection tools such as Zeek, Wazuh, or equivalent SIEM / security observability platforms
Strong experience with containerized and cloud-native environments, particularly Kubernetes
Strong software development skills, particularly in Python, with the ability to build automation, integrations, and custom tooling
Experience integrating heterogeneous infrastructure systems across multiple vendors, APIs, and evolving tool ecosystems
Familiarity with modern infrastructure automation and emerging agent-based frameworks such as MCP / A2A (or equivalent technologies)
Exposure to digital twin technologies and simulation platforms such as NVIDIA Omniverse or equivalent

Job Responsibility

Design and implement observability and control mechanisms that extract operational data from infrastructure and feed it into automated systems to enable continuous optimization, including key system budgets such as power, cooling and service level, security-level objectives
Actively guard and maintain these operational budgets as part of day-to-day system reliability and performance management
Contribute to operational excellence through blameless post-mortem analysis and structured incident learning, ensuring continuous improvement of system behavior and resilience
Work closely with Platform Engineering in a shared cybersecurity model, where SRE focuses on detection and monitoring, while Platform Engineering ensures the secure design and operation of the underlying infrastructure

What we offer

Learning Friday
Training budget for books, online courses or other training materials
Smart Working (remote work opportunities)
Opportunity to receive company equity
Stock options

Fulltime

Senior Site Reliability Engineer

The Senior Site Reliability Engineer establishes and maintains the infrastructur...

Location

United Kingdom; United States; Canada

Salary:

Not provided

Mozilla

Expiration Date

Until further notice

Requirements

7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
Excellent async written communication skills
comfortable working with a geographically distributed team
Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes

Job Responsibility

Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
Diagnose and debug production incidents
drive root-cause analysis and post-incident improvements to prevent recurring problems
Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
Contribute to runbooks, architecture documentation, and team processes

What we offer

Fully remote work & schedule flexibility
Company-provided laptop
Annual bonus program
Monthly remote work stipend
Annual professional development stipend
Industry conferences
Company all-hands and team gatherings
24 days PTO per year (prorated)
Birthday
Year-end company shutdown

Fulltime

Senior Site Reliability Engineer

We're hiring a Senior Site Reliability Engineer to lead reliability strategy and...

Location

India , Chennai

Salary:

Not provided

Zuora

Expiration Date

Until further notice

Requirements

8+ years of hands-on experience in Site Reliability Engineering, DevOps, or large-scale production operations.
Advanced expertise in AWS, including architecture design across services such as EC2, EKS, VPC, IAM, RDS, S3, and CloudWatch.
Deep experience with Infrastructure-as-Code using Terraform, including complex modules, state management, and governance.
Strong programming and automation skills using Python and Shell
experience building production-grade automation systems.
Expert-level Linux systems knowledge, including performance tuning, security hardening, and deep troubleshooting.
Proven experience operating distributed systems and data streaming platforms such as Kafka in high-throughput environments.
Demonstrated ability to work independently on complex, ambiguous problems with broad organizational impact.
Proven technical leadership experience driving large, cross-team reliability or infrastructure initiatives, including setting technical direction, influencing design decisions, and mentoring engineers to deliver measurable outcomes at scale.
Practical experience designing or implementing AI/ML-driven automation in operations, reliability, or platform engineering.

Job Responsibility

Define and evolve SLOs, SLIs, and resilience patterns
Build AI-driven automation for detection, remediation, and forecasting
Lead cloud infrastructure and Kubernetes platforms
Drive incident response and operational excellence
Mentor engineers and influence org-wide reliability practices

What we offer

Competitive compensation, variable bonus and performance-based reward opportunities, and retirement programs
Medical, dental, and vision insurance
Generous, flexible time off, plus paid holidays, wellness days, and a company-wide year-end break
Paid parental leave (including fully paid leave for eligible ZEOs, subject to local policy)
Learning & development stipend to support ongoing growth
Opportunities to volunteer and give back, including charitable donation matching where available
Mental wellbeing resources and support

Fulltime

Senior Site Reliability Engineer

The Sr Site Reliability Engineer will architect, develop, and maintain cloud env...

Location

United States

Salary:

55.00 - 68.00 USD / Hour ▼

Intertech (Minnesota)

Expiration Date

Until further notice

Requirements

4 + years of experience working within a SRE engineer/cloud platform role
Experience leveraging AI tools in the software development (or product) lifecycle in order to improve quality and efficiency
Expert knowledge of a cloud service provider
Expert knowledge and hands on production experience in Kubernetes (bare metal or managed) cluster setup and management required
Experience with infrastructure as code (IaC) tools like Terraform, Pulumi
Experience with Kubernetes deployment tools like Helm, ArgoCD, Flux
Strong awareness of networking and internet protocols
Understanding of identity and access management (IAM)
Experience supporting infrastructure in production cloud environments
Knowledge of Encryption, Public Key Infrastructure (PKI), understanding of OWASP

Job Responsibility

Build, maintain, and operate IaaS and PaaS infrastructure in Azure commercial and government clouds
Work closely with dev teams to identify and measure SLOs, SLAs and SLIs
Act a strong contributor to development of platform services including architecture, provisioning, configuration, deployment, and support
Perform integrations with central logging, metrics dashboards, instrumentation, incident monitoring and management
Build/integrate/administer systems and tools that enable engineering teams to observe their applications in production with autonomy (Dashboards, APMs)
Support software and/or cloud-infrastructure in an on-call rotation basis
Assist with identification and remediation of technical problems at the root cause by continuously implementing automation, self-healing, and real-time monitoring to production systems
Maintain and improve operational tooling, frameworks, build frameworks that test the performance and resiliency of our platform services/tools
Automate alerts for metrics on performance, cost, vulnerabilities, risk, compliance violations
Improve processes and champion automation of any manual items around support

Senior Site Reliability Engineer

Plaud is building the world's most trusted AI work companion for professionals t...

Location

Singapore , Singapore

Salary:

Not provided

Plaud

Expiration Date

Until further notice

Requirements

8+ years in SRE, Infra, or Platform Engineering roles
Strong experience with cloud platforms (AWS/GCP/Azure)
Hands-on with Kubernetes and distributed systems
Experience in on-call rotation and incident management
Proficient in at least one programming language (Go, Python, Java)
Strong written and verbal communication

Job Responsibility

Ensure reliability and performance of Plaud.ai’s AI products at scale
Lead incident response and continuous reliability improvement
Build automation to reduce toil and improve system resilience
Partner with product and engineering teams on reliability design
Improve observability and operational maturity

What we offer

Meaningful Ownership An Employee Stock Ownership Plan (ESOP)
High-Impact Environment
Cutting-Edge AI Tools for Productivity
Best-in-Class Equipment
Team & Culture Annual company offsites, team events
Medical & Insurance Coverage Comprehensive benefits
Medical insurance and WICA coverage for all full-time employees

Fulltime

Senior Site Reliability Engineer, Storage

At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a mi...

Location

United States , San Francisco, Sunnyvale

Salary:

166000.00 - 201000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

5+ years of professional experience in SRE, systems, or storage engineering
Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms
Proficiency in a programming language such as Python, Go, Java, or C
Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet
Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling
Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF
Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker)
Excellent incident response, troubleshooting, and documentation practices
Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)
Excellent communication skills

Job Responsibility

Build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure
Drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms
Help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters
Support user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets
Investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling
Partner with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems
Contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments

What we offer

Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Senior Site Reliability Engineer

Zuora’s Cloud Engineering organization owns the reliability, scalability, and op...

Location

India , Chennai

Salary:

Not provided

Zuora

Expiration Date

Until further notice

Requirements

8+ years of hands-on experience in Site Reliability Engineering, DevOps, or large-scale production operations
Advanced expertise in AWS, including architecture design across services such as EC2, EKS, VPC, IAM, RDS, S3, and CloudWatch
Deep experience with Infrastructure-as-Code using Terraform, including complex modules, state management, and governance
Strong programming and automation skills using Python and Shell
experience building production-grade automation systems
Expert-level Linux systems knowledge, including performance tuning, security hardening, and deep troubleshooting
Proven experience operating distributed systems and data streaming platforms such as Kafka in high-throughput environments
Demonstrated ability to work independently on complex, ambiguous problems with broad organizational impact
Proven technical leadership experience driving large, cross-team reliability or infrastructure initiatives, including setting technical direction, influencing design decisions, and mentoring engineers to deliver measurable outcomes at scale
Practical experience designing or implementing AI/ML-driven automation in operations, reliability, or platform engineering

Job Responsibility

Reliability Architecture & Platform Strategy: Own and evolve the reliability architecture of large-scale, distributed SaaS systems by defining SLOs, SLIs, error budgets, and resilience patterns aligned with business objectives
AI-Driven Automation & Intelligent Operations: Design, build, and operationalize AI-powered automation to reduce operational toil and improve system stability
Advanced Cloud & Infrastructure Engineering: Lead the design and operation of complex AWS-based infrastructure and Kubernetes platforms, optimizing for availability, security, and cost efficiency
Incident Leadership & Operational Excellence: Act as a technical leader during high-severity production incidents, driving structured response, decision-making, and recovery
Technical Leadership & Cross-Functional Influence: Influence reliability outcomes beyond the SRE team by partnering closely with Engineering, Product, and Security stakeholders

What we offer

Competitive compensation, variable bonus and performance reward opportunities, and retirement programs
Medical Insurance
Generous, flexible time off
Paid holidays, “wellness” days and company wide end of year break
6 months fully paid parental leave
Learning & Development stipend
Opportunities to volunteer and give back, including charitable donation match
Free resources and support for your mental wellbeing

Fulltime

Select Country

Senior Site Reliability Engineer, Managed AI

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Site Reliability Engineer, Managed AI

Senior Site Reliability Engineer - Fleet Reliability

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer, Storage

Senior Site Reliability Engineer

Our AI answers in your language