CrawlJobs Logo

Senior Site Reliability Engineer, Managed AI

crusoe.ai Logo

Crusoe

Location Icon

Location:
United States , San Francisco, Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

172000.00 - 209000.00 USD / Year

Job Description:

At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe’s AI-optimized cloud platform. We’re looking for a Senior Site Reliability Engineer with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers.

Job Responsibility:

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  • Build automation and reliability tooling to support distributed AI pipelines and inference services
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

Requirements:

  • Strong software engineering background — experience building production-grade systems beyond scripting or Bash
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI/ML infrastructure
  • SRE mindset and experience (whether or not under the SRE title)
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
  • Ability to thrive in a fast-paced, mission-driven environment

Nice to have:

Experience scaling inference or training workloads for LLMs

What we offer:
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer, Managed AI

Senior AI Engineer

We are seeking an innovative AI Engineer to join a brand new team focused on pro...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience as an AI Engineer with a significant delivery history
  • Strong expertise in multiple programming languages & frameworks
  • Experience and proven experience in using quantitative testing practice applied to the field of AI/ML for actionable Go/No-Go decisions of delivering software to production
  • Demonstrated expertise of developing on a range of architectures, ideally up to and including container-based micro-services with focus on scalability, reliability, maintainability, and high performance
  • Good understanding of SQL and NoSQL databases
  • Excellent communication and collaboration skills
  • A growth mindset and willingness to learn and adapt in a fast-paced environment
  • Passion about site reliability engineering and its impact on product development
  • Being connected to latest technologies, like Generative AI, and keen to put them in practice.
Job Responsibility
Job Responsibility
  • Understand the landscape, tooling and procedures used by developers at Citi and look for opportunities to reduce toil and aid simplification using Gen AI based solutions
  • Apply classic AI and novel Gen AI evaluation methodology to raise the quality and reliability bar for the software that you will deliver, as well to manage and mitigate risks that are specific/inherent to this field
  • Advice on Evaluation metrics, devise and implement Quantitative Testing Plans, and help evolve the existing approaches to AI evaluation
  • Work with a wide variety of Citi technology teams and help them drive towards everything-as-code and a codified controls environment
  • Collaborate with product and engineering teams to design, build and maintain scalable and reliable web applications and services
  • Be hands-on with coding and software design to ensure adherence to high quality standards and best practices
  • Mentor and nurture other engineers to help them grow their skills and expertise
  • Support and drive cultural change, including instigating critical thinking about controls and processes and encouraging a culture of continuous improvement.
What we offer
What we offer
  • 27 days annual leave (plus bank holidays)
  • A discretional annual performance-related bonus
  • Private Medical Care & Life Insurance
  • Employee Assistance Program
  • Pension Plan
  • Paid Parental Leave
  • Special discounts for employees, family, and friends
  • Access to an array of learning and development resources.
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
  • Strong hands-on experience with: Terraform & Infrastructure as Code
  • AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
  • Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
  • Kubernetes troubleshooting and operations
  • Prometheus/Grafana/Datadog observability stacks
  • Proven ability to operate in high-scale, high-uptime, multi-environment production systems
  • Experience building automation via Python/Bash and reducing operational toil
  • Strong understanding of incident management, root cause analysis, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
  • Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
  • Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
  • Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
  • Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
  • Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
  • Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
  • Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
  • Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
  • Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...
Location
Location
United States; Canada
Salary
Salary:
186818.00 - 224183.00 USD; CAD / Year
babylist.com Logo
Babylist
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a Site Reliability Engineer or similar role
  • Experience supporting high-traffic consumer-facing websites
  • Proficiency with Terraform
  • Strong experience working with AWS cloud-based infrastructure and services
  • Proficiency with Docker and Kubernetes
  • Solid understanding of cloud-native systems design
  • Troubleshooting and debugging skills
  • Experience designing and supporting CI systems
  • Familiar with monitoring and alerting best practices
  • Proven experience in on-call management best practices
Job Responsibility
Job Responsibility
  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
  • Improve the speed and reliability of our Continuous Integration (CI) systems
  • Provide support to developers in troubleshooting issues
  • Establish, communicate, and support best practices for monitoring and alerting
What we offer
What we offer
  • Company-paid medical, dental, and vision insurance
  • Retirement savings plan with company matching and flexible spending accounts
  • Generous paid parental leave and PTO
  • Remote work stipend
  • Perks for physical, mental, and emotional health, parenting, childcare, and financial planning
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

Principal Group Engineering Manager

Microsoft Specialized Clouds combines the power of edge platforms, devices, and ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of professional software engineering experience, including designing, building, and operating distributed, cloud-scale services
  • 5+ years of engineering leadership experience, including managing managers and leading multi-team engineering organizations (M2+)
  • Deep experience with network device platforms — specifically Arista (EOS, eAPI, CloudVision) and/or Cisco (NX-OS, DCNM/NDFC) — including device programming, configuration management, and automation
  • Strong background in device programming and network automation — building systems that programmatically configure, validate, and manage network device state at scale
  • Experience with Azure Resource Provider (RP) engineering — ARM resource modeling, deployment pipelines, control-plane architecture, and resource lifecycle management
  • Solid understanding of L2/L3 networking fundamentals: spine-leaf architecture, VXLAN, overlay/underlay networking, BGP, and data center network design
  • Proven ability to set technical direction and architectural strategy for complex platforms spanning multiple components and partner teams
  • Demonstrated success owning end-to-end delivery of customer-critical services, including design, development, release, and live-site operations
  • Strong experience driving operational excellence, including reliability, incident management, automation, and cost optimization for production services
  • Proven track record of leading organizational transformation — such as quality resets, reliability turnarounds, code yellow resolution, or engineering culture change across an engineering org
Job Responsibility
Job Responsibility
  • Lead engineering teams through the design, architecture, development, testing, and operations of the Network Fabric platform — the cloud-managed networking layer for Azure Operator Nexus and Azure Local
  • Drive execution excellence across the full software lifecycle: semester planning, feature delivery, release management, and live-site operations
  • Own engineering commitments across multiple workstreams including network device programming, Azure Resource Provider development, fabric orchestration, and network configuration management
  • Ensure services meet Microsoft standards for quality, reliability, security, and operational readiness
  • Establish and enforce engineering best practices — including test-driven development, automated validation, secure development lifecycle (SDL/SFI), and continuous integration
  • Continue and accelerate the ongoing engineering transformation: driving quality resets, improving release predictability, and reducing customer-impacting incidents
  • Own the resolution of code yellow and equivalent quality escalations, driving root cause analysis and systemic remediation across the engineering organization
  • Champion a culture of engineering fundamentals — ensuring that quality, security, and operational maturity are embedded into every sprint, not treated as afterthoughts
  • Drive measurable reduction in support costs through automation, improved test coverage, and process optimization
  • Provide technical leadership across device programming (Arista EOS, Cisco NX-OS), network fabric orchestration, and Azure Resource Provider engineering
  • Fulltime
Read More
Arrow Right

Senior Manager, Software Engineer - Forward Deployed Team

We are seeking a skilled Software Engineer who will design, build, and maintain ...
Location
Location
China , Shanghai
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field with 12-15 years of relevant experience
  • Expert-level skills in Business Immersion, Data Integration, Full-Stack Development, Multi-Audience Communication, Problem Discovery, Rapid Prototyping & Validation, Stakeholder Management, Team Collaboration
  • Practitioner-level skills in AI Evaluation & Verification, AI Literacy, AI-Augmented Development, Architecture & Design, Code Quality & Review, Developer Experience, Knowledge Management, Pattern Generalization, Service Management, Site Reliability Engineering, Technical Writing
  • Working-level skills in Cloud Platforms, Data Modeling, DevOps & CI/CD, Lean Thinking & Flow, Technical Debt Management, Time Management & Deep Work
Job Responsibility
Job Responsibility
  • Drive delivery of the most critical technical initiatives
  • Establish engineering delivery practices across the business unit
  • Be the technical authority on high-stakes projects
  • Develop technical leaders
  • Shape engineering talent strategy across the business unit
  • Build high-performing engineering teams
  • Shape technology-driven business strategy
  • Represent technical perspective at executive level
  • Be recognized as a bridge between engineering and business
  • Design AI-augmented engineering workflows for your area
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Staff Software Engineer – Forward Deployed

We are seeking a skilled Software Engineer who will design, build, and maintain ...
Location
Location
China , Shanghai; Dalian; Wuhan
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field with 8-12 years of relevant experience
  • AI-Augmented Development: optimize AI tool usage, train engineers on AI-augmented workflows, evaluate new AI development tools, establish practices that balance AI speed with verification rigor
  • Business Immersion: rapidly acquire domain expertise, translate between business and engineering, mentor engineers on immersion
  • Data Integration: navigate complex enterprise data landscapes, build relationships to gain data access, handle undocumented schemas, build robust integration solutions, mentor engineers on data integration
  • Full-Stack Development: build complete applications rapidly across any technology stack, select the right tools, balance technical debt with delivery speed, mentor engineers on full-stack development
  • Multi-Audience Communication: influence through communication at all levels, handle difficult conversations skillfully, train engineers on effective communication, represent teams across the function
  • Problem Discovery: seek out undefined problems, embed with users to discover latent needs, coach engineers on problem discovery techniques, turn ambiguity into clear problem statements
  • Rapid Prototyping & Validation: lead rapid delivery initiatives, coach on prototype-first approaches, establish trust through consistent fast delivery, define clear criteria for prototype-to-production transitions
  • Site Reliability Engineering: define reliability standards, drive post-incident improvements systematically, design capacity planning processes, mentor engineers on SRE practices
  • Stakeholder Management: influence senior stakeholders, manage complex stakeholder landscapes with competing agendas, build trust rapidly with new stakeholders, shield teams from organizational friction
Job Responsibility
Job Responsibility
  • Delivery: Lead technical delivery of complex projects across multiple teams, unblock others through hands-on contributions, ensure engineering quality
  • AI: Design AI-augmented engineering workflows for your area, evaluate new AI tools, train engineers on effective AI usage, balance speed with verification
  • People: Coach multiple engineers on career growth, lead hiring for technical roles across your area, shape team technical culture
  • Business: Drive business outcomes through technical solutions across your area, influence product roadmaps, partner effectively with business stakeholders
  • Process: Drive process efficiency within your team, coordinate cross-functional technical work, lead retrospectives
  • Documentation: Design documentation strategies for your projects, ensure knowledge persists beyond individuals, write specifications that enable effective collaboration
  • Fulltime
Read More
Arrow Right