CrawlJobs Logo

Staff Site Reliability Engineer, Managed AI

crusoe.ai Logo

Crusoe

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

204000.00 - 247000.00 USD / Year

Job Description:

At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe’s AI-optimized cloud platform. We’re looking for a Staff Site Reliability Engineer with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers.

Job Responsibility:

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  • Build automation and reliability tooling to support distributed AI pipelines and inference services
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

Requirements:

  • Strong software engineering background — experience building production-grade systems beyond scripting or Bash
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI/ML infrastructure
  • SRE mindset and experience (whether or not under the SRE title) including: Defining and measuring SLIs/SLOs
  • Building monitoring and observability systems
  • Driving performance and reliability improvements
  • Designing fault-tolerant systems and automated testing strategies
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
  • Ability to thrive in a fast-paced, mission-driven environment

Nice to have:

Experience scaling inference or training workloads for LLMs

What we offer:
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Site Reliability Engineer, Managed AI

Staff Platform Engineer

Join our dynamic team as a Compute Platform Engineer and play a pivotal role in ...
Location
Location
United States , Mountain View, California
Salary
Salary:
180000.00 - 280000.00 USD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7 years of experience in software engineering
  • 5 years of experience with infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Kustomize manifests/Helm charts for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Candidates must be based in the SF Bay Area or willing to relocate (you will be working on-site in our South Bay office a few days a week)
Job Responsibility
Job Responsibility
  • Work closely with backend and ML engineering teams to design, deploy, and maintain reliable, high-performance, and secure cloud infrastructure for our AI engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Conduct root cause analysis to identify critical issues and develop automated solutions to prevent recurrence
  • Develop and share best practices to improve automation and efficiency across our engineering teams
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right

Staff Software Engineer

As a Staff Forward Deployed Engineer (FDE) at Invisible, you'll lead high-impact...
Location
Location
United States , Austin; New York; San Francisco Bay Area; Washington DC–Baltimore
Salary
Salary:
213000.00 - 300000.00 USD / Year
invisible.co Logo
Invisible Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience, including significant time spent building data, ML, or backend systems
  • Deep proficiency in Python with hands-on experience using Hugging Face, LangChain, OpenAI, Pinecone, and related ecosystems
  • Skilled in full-stack and API-based deployment patterns, including Docker, FastAPI, Kubernetes, and cloud environments (GCP, AWS)
  • Experienced with workflow orchestration libraries, pub/sub systems (Kafka), and schema governance
  • Expertise in data governance and operations, including Unity Catalog and policy management, cluster/job orchestration, data contracts and quality enforcement, Delta/ETL pipelines, and replay processes
  • Strong product and system design instincts — you understand business needs and how to translate them into technical architecture
  • Experience building usable systems from messy data and ambiguous requirements
  • Excellent communication and client-facing skills
  • you’ve led conversations with technical and non-technical stakeholders alike
  • Proven experience owning projects from scoping through deployment in ambiguous, high-stakes environments
Job Responsibility
Job Responsibility
  • Partner with delivery and executive stakeholders to scope, design, and lead implementation of AI-driven solutions
  • Identify transformational opportunities in messy, ambiguous workflows and turn them into repeatable systems
  • Lead architecture design and trade-off discussions across performance, scalability, cost, and reliability
  • Own projects from first discovery call through full deployment — including client-facing delivery, internal coordination, and post-launch iteration
  • Build shared infrastructure, reusable components, and internal playbooks to level-up the team
  • Coach and mentor mid-level engineers and help shape the culture of forward-deployed AI engineering at Invisible
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right
New

Staff Software Engineer – Forward Deployed

We are seeking a skilled Software Engineer who will design, build, and maintain ...
Location
Location
China , Shanghai; Dalian; Wuhan
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field with 8-12 years of relevant experience
  • AI-Augmented Development: optimize AI tool usage, train engineers on AI-augmented workflows, evaluate new AI development tools, establish practices that balance AI speed with verification rigor
  • Business Immersion: rapidly acquire domain expertise, translate between business and engineering, mentor engineers on immersion
  • Data Integration: navigate complex enterprise data landscapes, build relationships to gain data access, handle undocumented schemas, build robust integration solutions, mentor engineers on data integration
  • Full-Stack Development: build complete applications rapidly across any technology stack, select the right tools, balance technical debt with delivery speed, mentor engineers on full-stack development
  • Multi-Audience Communication: influence through communication at all levels, handle difficult conversations skillfully, train engineers on effective communication, represent teams across the function
  • Problem Discovery: seek out undefined problems, embed with users to discover latent needs, coach engineers on problem discovery techniques, turn ambiguity into clear problem statements
  • Rapid Prototyping & Validation: lead rapid delivery initiatives, coach on prototype-first approaches, establish trust through consistent fast delivery, define clear criteria for prototype-to-production transitions
  • Site Reliability Engineering: define reliability standards, drive post-incident improvements systematically, design capacity planning processes, mentor engineers on SRE practices
  • Stakeholder Management: influence senior stakeholders, manage complex stakeholder landscapes with competing agendas, build trust rapidly with new stakeholders, shield teams from organizational friction
Job Responsibility
Job Responsibility
  • Delivery: Lead technical delivery of complex projects across multiple teams, unblock others through hands-on contributions, ensure engineering quality
  • AI: Design AI-augmented engineering workflows for your area, evaluate new AI tools, train engineers on effective AI usage, balance speed with verification
  • People: Coach multiple engineers on career growth, lead hiring for technical roles across your area, shape team technical culture
  • Business: Drive business outcomes through technical solutions across your area, influence product roadmaps, partner effectively with business stakeholders
  • Process: Drive process efficiency within your team, coordinate cross-functional technical work, lead retrospectives
  • Documentation: Design documentation strategies for your projects, ensure knowledge persists beyond individuals, write specifications that enable effective collaboration
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
United States
Salary
Salary:
150000.00 - 225000.00 USD / Year
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ years in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
What we offer
What we offer
  • Equity
  • Generous benefits program
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Delhi
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Pune
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right