CrawlJobs Logo

Senior Site Reliability Engineer, Managed AI

crusoe.ai Logo

Crusoe

Location Icon

Location:
United States , San Francisco, Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

172000.00 - 209000.00 USD / Year

Job Description:

At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe’s AI-optimized cloud platform. We’re looking for a Senior Site Reliability Engineer with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers.

Job Responsibility:

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  • Build automation and reliability tooling to support distributed AI pipelines and inference services
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

Requirements:

  • Strong software engineering background — experience building production-grade systems beyond scripting or Bash
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI/ML infrastructure
  • SRE mindset and experience (whether or not under the SRE title)
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
  • Ability to thrive in a fast-paced, mission-driven environment

Nice to have:

Experience scaling inference or training workloads for LLMs

What we offer:
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer, Managed AI

Senior AI Engineer

We are seeking an innovative AI Engineer to join a brand new team focused on pro...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience as an AI Engineer with a significant delivery history
  • Strong expertise in multiple programming languages & frameworks
  • Experience and proven experience in using quantitative testing practice applied to the field of AI/ML for actionable Go/No-Go decisions of delivering software to production
  • Demonstrated expertise of developing on a range of architectures, ideally up to and including container-based micro-services with focus on scalability, reliability, maintainability, and high performance
  • Good understanding of SQL and NoSQL databases
  • Excellent communication and collaboration skills
  • A growth mindset and willingness to learn and adapt in a fast-paced environment
  • Passion about site reliability engineering and its impact on product development
  • Being connected to latest technologies, like Generative AI, and keen to put them in practice.
Job Responsibility
Job Responsibility
  • Understand the landscape, tooling and procedures used by developers at Citi and look for opportunities to reduce toil and aid simplification using Gen AI based solutions
  • Apply classic AI and novel Gen AI evaluation methodology to raise the quality and reliability bar for the software that you will deliver, as well to manage and mitigate risks that are specific/inherent to this field
  • Advice on Evaluation metrics, devise and implement Quantitative Testing Plans, and help evolve the existing approaches to AI evaluation
  • Work with a wide variety of Citi technology teams and help them drive towards everything-as-code and a codified controls environment
  • Collaborate with product and engineering teams to design, build and maintain scalable and reliable web applications and services
  • Be hands-on with coding and software design to ensure adherence to high quality standards and best practices
  • Mentor and nurture other engineers to help them grow their skills and expertise
  • Support and drive cultural change, including instigating critical thinking about controls and processes and encouraging a culture of continuous improvement.
What we offer
What we offer
  • 27 days annual leave (plus bank holidays)
  • A discretional annual performance-related bonus
  • Private Medical Care & Life Insurance
  • Employee Assistance Program
  • Pension Plan
  • Paid Parental Leave
  • Special discounts for employees, family, and friends
  • Access to an array of learning and development resources.
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
  • Strong hands-on experience with: Terraform & Infrastructure as Code
  • AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
  • Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
  • Kubernetes troubleshooting and operations
  • Prometheus/Grafana/Datadog observability stacks
  • Proven ability to operate in high-scale, high-uptime, multi-environment production systems
  • Experience building automation via Python/Bash and reducing operational toil
  • Strong understanding of incident management, root cause analysis, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
  • Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
  • Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
  • Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
  • Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
  • Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
  • Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
  • Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
  • Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
  • Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...
Location
Location
United States; Canada
Salary
Salary:
186818.00 - 224183.00 USD; CAD / Year
babylist.com Logo
Babylist
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a Site Reliability Engineer or similar role
  • Experience supporting high-traffic consumer-facing websites
  • Proficiency with Terraform
  • Strong experience working with AWS cloud-based infrastructure and services
  • Proficiency with Docker and Kubernetes
  • Solid understanding of cloud-native systems design
  • Troubleshooting and debugging skills
  • Experience designing and supporting CI systems
  • Familiar with monitoring and alerting best practices
  • Proven experience in on-call management best practices
Job Responsibility
Job Responsibility
  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
  • Improve the speed and reliability of our Continuous Integration (CI) systems
  • Provide support to developers in troubleshooting issues
  • Establish, communicate, and support best practices for monitoring and alerting
What we offer
What we offer
  • Company-paid medical, dental, and vision insurance
  • Retirement savings plan with company matching and flexible spending accounts
  • Generous paid parental leave and PTO
  • Remote work stipend
  • Perks for physical, mental, and emotional health, parenting, childcare, and financial planning
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right
New

Senior Manager, Software Engineer - Forward Deployed Team

We are seeking a skilled Software Engineer who will design, build, and maintain ...
Location
Location
China , Shanghai
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field with 12-15 years of relevant experience
  • Expert-level skills in Business Immersion, Data Integration, Full-Stack Development, Multi-Audience Communication, Problem Discovery, Rapid Prototyping & Validation, Stakeholder Management, Team Collaboration
  • Practitioner-level skills in AI Evaluation & Verification, AI Literacy, AI-Augmented Development, Architecture & Design, Code Quality & Review, Developer Experience, Knowledge Management, Pattern Generalization, Service Management, Site Reliability Engineering, Technical Writing
  • Working-level skills in Cloud Platforms, Data Modeling, DevOps & CI/CD, Lean Thinking & Flow, Technical Debt Management, Time Management & Deep Work
Job Responsibility
Job Responsibility
  • Drive delivery of the most critical technical initiatives
  • Establish engineering delivery practices across the business unit
  • Be the technical authority on high-stakes projects
  • Develop technical leaders
  • Shape engineering talent strategy across the business unit
  • Build high-performing engineering teams
  • Shape technology-driven business strategy
  • Represent technical perspective at executive level
  • Be recognized as a bridge between engineering and business
  • Design AI-augmented engineering workflows for your area
  • Fulltime
Read More
Arrow Right
New

Sr. AI Site Reliability Engineer

At Schwab, you will build a rewarding career while making a difference in the li...
Location
Location
United States , San Francisco
Salary
Salary:
190000.00 - 270000.00 USD / Year
schwab.com Logo
Charles Schwab
Expiration Date
February 24, 2026
Flip Icon
Requirements
Requirements
  • 8+ years of software development or reliability engineering experience
  • 4+ years as a hands-on senior engineer in startups and/or large organizations
  • Bachelor’s degree in Computer Science or related field
  • 5+ years of experience building and operating complex products from scratch and running them in production
  • 3+ years of experience supporting applications that use Artificial Intelligence (AI) models to deliver real business impact
  • 3+ years of experience building and maintaining data pipelines and infrastructure for large datasets
  • 3+ years of experience with containers and cloud-native applications
  • Ability to operationalize them in the public cloud with infrastructure as code
  • Experience implementing monitoring, alerting, and incident response for large-scale distributed systems
  • Proven track record in driving reliability, scalability, and performance improvements for production AI systems
Job Responsibility
Job Responsibility
  • Design, implement, and manage the reliability and operational excellence of GenAI applications and platforms
  • Work closely with architects, engineers, and business leaders to align reliability practices with Schwab’s enterprise strategy
  • Mentor and coach junior engineers
  • Help to build strong operational practices and foster a culture of continuous improvement
  • Lead by example in solving complex reliability challenges
  • Advance SRE standards
  • Drive rapid iteration from concept to production
What we offer
What we offer
  • 401(k) with company match and Employee stock purchase plan
  • Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave and family building benefits
  • Tuition reimbursement
  • Health, dental, and vision insurance
  • Bonus or incentive opportunities
  • Fulltime
!
Read More
Arrow Right

Senior Engineering Manager, Brazil Site Lead

At Airbnb, we are expanding our global engineering presence by building a new te...
Location
Location
Brazil , São Paulo
Salary
Salary:
50000.00 - 52500.00 BRL / Month
airbnb.com Logo
Airbnb
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of engineering management experience
  • 10+ years of overall engineering experience
  • track record of leading technical teams in high-growth or startup environments
  • proven experience building or scaling new engineering offices or distributed teams
  • strong technical foundation in infrastructure, distributed systems, or large-scale backend engineering
  • building highly reliable systems and ideally tooling that improves reliability and quality
  • exceptional leadership skills with the ability to inspire, develop, and grow engineering talent
  • strong operational and organizational acumen, capable of establishing site-level processes and rhythms from the ground up
  • experience in the nuances of working in matrixed teams across multiple geographies, balancing cultural context with global alignment
  • excellent communication and relationship-building skills
Job Responsibility
Job Responsibility
  • Establish, grow, and lead the new engineering hub from the ground up
  • shape the site’s culture, attract and develop top local talent, and ensure alignment with Airbnb’s technical and organizational vision
  • build foundational infrastructure for the hub’s long-term success—developing engineering capabilities, establishing operating rhythms, and serving as a connector between Brazil-based teams and global stakeholders
  • be an important part of the Reliability Engineering leadership team helping to shape and execute roadmaps for Airbnb efforts around Reliability, Observability and Quality Engineering
  • drive automated testing through tooling and AI
  • partner with global Infra and Product Engineering leaders to define the vision, strategy, and roadmap for the Brazil hub
  • recruit and develop high-performing engineers and leaders, setting up strong hiring and onboarding practices
  • drive collaboration and alignment between Brazil-based engineers and global teams
  • establish local engineering operations including communication cadences, cultural rituals, and partnership models
  • represent Airbnb locally—acting as a bridge between the company’s mission and the broader technology community in Brazil
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Employee Travel Credits
  • Fulltime
Read More
Arrow Right
New

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right