CrawlJobs Logo

Senior Site Reliability Engineer Cloud Platform

175000.00 - 225000.00 USD / Year · Job Posted December 14, 2025
Apply Position
Job Link Share

Job Description

Zilliz is a fast-growing startup developing the industry’s leading vector database company for enterprise-grade AI. Founded by the engineers behind Milvus, the world’s most popular open-source vector database, the company builds next-generation database technologies to help organizations quickly create AI applications. On a mission to democratize AI, Zilliz is committed to simplifying data management for AI applications and making vector databases accessible to every organization.

Job Responsibility

  • Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
  • Develop and implement strategies for monitoring, incident management, and disaster recovery
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
  • Collaborate with software engineers to enhance system reliability, scalability, and performance
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency

Requirements

  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
  • Proficiency in scripting languages such as Python, Go, or Java
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
  • Experience with infrastructure as code tools such as Terraform or Ansible
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously

Nice to have

Experience with Open Source Milvus Vector Database

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineer Cloud Platform

8 matching positions

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...
Location
Location
United States , Austin
Salary
Salary:
Not provided
dutechsystems.com Logo
Dutech Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, DevOps, or Systems Engineering
  • Strong expertise in Linux/Unix systems and system internals
  • Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
  • Experience designing and operating distributed systems
  • Hands-on experience with cloud platforms (AWS or GCP)
  • Experience with Docker and Kubernetes
  • Strong understanding of monitoring, alerting, and logging concepts
  • Experience managing SLIs, SLOs, and error budgets
  • Experience with incident management and RCA processes
Job Responsibility
Job Responsibility
  • Design, implement, and manage highly available, distributed systems
  • Maintain and optimize cloud infrastructure (AWS/GCP)
  • Develop automation scripts using Python, Go, Java, or Bash
  • Manage containerized environments using Docker and Kubernetes
  • Define and monitor SLIs, SLOs, and error budgets
  • Implement monitoring, logging, and alerting solutions
  • Lead incident management, root cause analysis (RCA), and postmortems
  • Ensure system security and compliance within operational workflows
  • Improve system reliability through performance tuning and optimization
  • Collaborate with engineering teams to enhance deployment and release processes
Read More
Arrow Right

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare!...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of site reliability engineering experience
  • Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
  • Proactive, curious, collaborative and eager to learn
  • Proven experience with cloud services such as AWS, Azure or Google Cloud
  • Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
  • Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles
Job Responsibility
Job Responsibility
  • Collaborating with Feature teams to ensure services align with developer needs
  • Driving improvements by evaluating new technologies and processes
  • Defining best practices (golden paths) for software development and deployment
  • Developing and maintaining tools and services that facilitate implementation of best practices
  • Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
  • Collaborating on roadmap delivery
What we offer
What we offer
  • Free Health Insurance for you
  • Up to 14 days of RTT
  • A flexible workplace policy offering both hybrid and office-based modes
  • Flexibility days allowing to work in EU countries and the UK 10 days per year
  • Wellbeing program with free mental health and coaching through moka.care
  • Special support package for caregivers and workers with disabilities
  • Lunch voucher with Swile card
  • Work Council subsidy for sport club membership or creative activities
  • Bicycle subsidy
  • Public transportation reimbursement
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare....
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of site reliability engineering experience
  • Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
  • Proactive, curious, collaborative and eager to learn
  • Proven experience with cloud services such as AWS, Azure or Google Cloud
  • Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
  • Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles
Job Responsibility
Job Responsibility
  • Collaborating with Feature teams to ensure services align with developer needs
  • Driving improvements by evaluating new technologies and processes
  • Defining best practices ("golden paths") for software development and deployment
  • Developing and maintaining tools and services that facilitate best practices
  • Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
  • Collaborating on roadmap delivery
What we offer
What we offer
  • Company health insurance through partner Allianz
  • Minimum 28 days of paid leave
  • Parent Care Program: one additional month of leave on top of legal parental leave
  • Free mental health and coaching services through partner Moka.care
  • For caregivers and workers with disabilities, a package including adaptation of remote policy, extra days off for medical reasons, and psychological support
  • Flexible workplace policy offering both hybrid and office-based mode
  • Work from EU countries and the UK for up to 10 days per year
  • Reimbursement of public transportation
  • Fulltime
Read More
Arrow Right

Senior Staff Engineer Software (Cloud Platform, Production & Reliability – Machine Identity Security)

The Production Engineering team is responsible for building, scaling, and operat...
Location
Location
United States , Santa Clara
Salary
Salary:
126000.00 - 203500.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering (SRE)
  • Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP
  • Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE)
  • Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi)
  • Proven experience designing and maintaining complex CI/CD systems (Jenkins, GitLab CI, ArgoCD, GitHub Actions)
  • Strong programming/scripting skills (Python, Go, or similar) for automation and tooling
  • Experience operating in high-scale, 24/7 production environments with ownership of incident response and reliability
  • Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS)
  • Strong problem-solving skills and ability to work across teams
Job Responsibility
Job Responsibility
  • Design, build, and evolve highly available cloud infrastructure platforms with a focus on scalability, resilience, and reliability
  • Lead improvements across production systems, including performance, availability, and incident response
  • Drive and standardize Infrastructure as Code (IaC) practices to improve consistency and reduce operational overhead
  • Design and optimize CI/CD pipelines to support fast, secure, and reliable software delivery at scale
  • Partner with development teams to improve system reliability, observability, and cloud-native design patterns
  • Define and implement monitoring, alerting, and observability strategies across distributed systems
  • Lead incident response efforts, including root cause analysis and long-term remediation strategies
  • Identify and eliminate operational toil through automation and system improvements
  • Mentor engineers and contribute to raising the bar for production engineering practices
What we offer
What we offer
  • restricted stock units
  • bonus
  • Fulltime
Read More
Arrow Right

Senior Vice President, Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent work experience
  • 8+ years of relevant work experience
  • Highly motivated self-starter with excellent interpersonal and communication skills. Able to communicate efficiently at multiple levels of seniority
  • Certification or formal training in site reliability engineering concepts and practices
  • Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
  • 5+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
  • Experience working on observability, logging and metrics toolsets
  • Experience of k8s and container technologies such as Docker, Openshift and EKS.
  • Experience with public cloud technologies such as AWS, GCP or Azure
  • Experience with Secrets products such as HashiCorp Vault or CyberArk
Job Responsibility
Job Responsibility
  • Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
  • Architecting and building tools and platforms that provide capabilities for SRE
  • Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organization
  • Actively owning production level incidents till resolution.
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Senior Cloud Platform Engineer

The Opportunity We are currently partnering with several leading technology con...
Location
Location
Salary
Salary:
Not provided
myn.co.uk Logo
Myn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • aws experience
  • azure experience
  • terraform
  • senior cloud experience
  • cloud landing zones
  • ci/cd pipelines
  • cloud security
  • sre principles
  • regulated environments
Job Responsibility
Job Responsibility
  • Design, build, and maintain robust, enterprise-scale cloud infrastructure
  • Take ownership of the end-to-end lifecycle of cloud environments
  • Ensure cloud environments remain secure, scalable, and resilient
  • Leverage Infrastructure as Code (IaC) to automate provisioning and configuration
  • Embed Site Reliability Engineering (SRE) principles to drive operational excellence, high availability, and proactive monitoring
  • Act as a key technical leader, collaborating with cross-functional teams and senior stakeholders
  • Ensure architectural governance and secure-by-default solutions
Read More
Arrow Right

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...
Location
Location
United States
Salary
Salary:
116633.00 - 181243.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
  • Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
  • CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
  • Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
  • SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
  • Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
  • Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
  • Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
  • Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
  • Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements
Job Responsibility
Job Responsibility
  • Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
  • Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
  • Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
  • Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
  • Partner with engineering team members to embed reliability best practices early in the development lifecycle
  • Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
  • Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
  • Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
  • Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
  • Reduce operational toil by identifying repetitive work and implementing automation-first solutions
  • Fulltime
Read More
Arrow Right