Senior Site Reliability Engineer Cloud Platform Job at Zilliz

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...

Location

United States , Austin

Salary:

Not provided

Dutech Systems

Expiration Date

Until further notice

Requirements

8+ years of experience in SRE, DevOps, or Systems Engineering
Strong expertise in Linux/Unix systems and system internals
Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
Experience designing and operating distributed systems
Hands-on experience with cloud platforms (AWS or GCP)
Experience with Docker and Kubernetes
Strong understanding of monitoring, alerting, and logging concepts
Experience managing SLIs, SLOs, and error budgets
Experience with incident management and RCA processes

Job Responsibility

Design, implement, and manage highly available, distributed systems
Maintain and optimize cloud infrastructure (AWS/GCP)
Develop automation scripts using Python, Go, Java, or Bash
Manage containerized environments using Docker and Kubernetes
Define and monitor SLIs, SLOs, and error budgets
Implement monitoring, logging, and alerting solutions
Lead incident management, root cause analysis (RCA), and postmortems
Ensure system security and compliance within operational workflows
Improve system reliability through performance tuning and optimization
Collaborate with engineering teams to enhance deployment and release processes

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare!...

Location

France , Paris

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of site reliability engineering experience
Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
Proactive, curious, collaborative and eager to learn
Proven experience with cloud services such as AWS, Azure or Google Cloud
Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles

Job Responsibility

Collaborating with Feature teams to ensure services align with developer needs
Driving improvements by evaluating new technologies and processes
Defining best practices (golden paths) for software development and deployment
Developing and maintaining tools and services that facilitate implementation of best practices
Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
Collaborating on roadmap delivery

What we offer

Free Health Insurance for you
Up to 14 days of RTT
A flexible workplace policy offering both hybrid and office-based modes
Flexibility days allowing to work in EU countries and the UK 10 days per year
Wellbeing program with free mental health and coaching through moka.care
Special support package for caregivers and workers with disabilities
Lunch voucher with Swile card
Work Council subsidy for sport club membership or creative activities
Bicycle subsidy
Public transportation reimbursement

Fulltime

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare....

Location

Germany , Berlin

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of site reliability engineering experience
Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
Proactive, curious, collaborative and eager to learn
Proven experience with cloud services such as AWS, Azure or Google Cloud
Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles

Job Responsibility

Collaborating with Feature teams to ensure services align with developer needs
Driving improvements by evaluating new technologies and processes
Defining best practices ("golden paths") for software development and deployment
Developing and maintaining tools and services that facilitate best practices
Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
Collaborating on roadmap delivery

What we offer

Company health insurance through partner Allianz
Minimum 28 days of paid leave
Parent Care Program: one additional month of leave on top of legal parental leave
Free mental health and coaching services through partner Moka.care
For caregivers and workers with disabilities, a package including adaptation of remote policy, extra days off for medical reasons, and psychological support
Flexible workplace policy offering both hybrid and office-based mode
Work from EU countries and the UK for up to 10 days per year
Reimbursement of public transportation

Fulltime

Senior Staff Engineer Software (Cloud Platform, Production & Reliability – Machine Identity Security)

The Production Engineering team is responsible for building, scaling, and operat...

Location

United States , Santa Clara

Salary:

126000.00 - 203500.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

5+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering (SRE)
Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP
Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE)
Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi)
Proven experience designing and maintaining complex CI/CD systems (Jenkins, GitLab CI, ArgoCD, GitHub Actions)
Strong programming/scripting skills (Python, Go, or similar) for automation and tooling
Experience operating in high-scale, 24/7 production environments with ownership of incident response and reliability
Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS)
Strong problem-solving skills and ability to work across teams

Job Responsibility

Design, build, and evolve highly available cloud infrastructure platforms with a focus on scalability, resilience, and reliability
Lead improvements across production systems, including performance, availability, and incident response
Drive and standardize Infrastructure as Code (IaC) practices to improve consistency and reduce operational overhead
Design and optimize CI/CD pipelines to support fast, secure, and reliable software delivery at scale
Partner with development teams to improve system reliability, observability, and cloud-native design patterns
Define and implement monitoring, alerting, and observability strategies across distributed systems
Lead incident response efforts, including root cause analysis and long-term remediation strategies
Identify and eliminate operational toil through automation and system improvements
Mentor engineers and contribute to raising the bar for production engineering practices

What we offer

restricted stock units
bonus

Fulltime

Senior Vice President, Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...

Location

Singapore , Singapore

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Bachelor’s degree or equivalent work experience
8+ years of relevant work experience
Highly motivated self-starter with excellent interpersonal and communication skills. Able to communicate efficiently at multiple levels of seniority
Certification or formal training in site reliability engineering concepts and practices
Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
5+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
Experience working on observability, logging and metrics toolsets
Experience of k8s and container technologies such as Docker, Openshift and EKS.
Experience with public cloud technologies such as AWS, GCP or Azure
Experience with Secrets products such as HashiCorp Vault or CyberArk

Job Responsibility

Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
Architecting and building tools and platforms that provide capabilities for SRE
Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organization
Actively owning production level incidents till resolution.

Fulltime

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...

Location

United States , San Francisco

Salary:

230000.00 - 345000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation

Job Responsibility

Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

Senior Cloud Platform Engineer

The Opportunity We are currently partnering with several leading technology con...

Location

Salary:

Not provided

Myn

Expiration Date

Until further notice

Requirements

aws experience
azure experience
terraform
senior cloud experience
cloud landing zones
ci/cd pipelines
cloud security
sre principles
regulated environments

Job Responsibility

Design, build, and maintain robust, enterprise-scale cloud infrastructure
Take ownership of the end-to-end lifecycle of cloud environments
Ensure cloud environments remain secure, scalable, and resilient
Leverage Infrastructure as Code (IaC) to automate provisioning and configuration
Embed Site Reliability Engineering (SRE) principles to drive operational excellence, high availability, and proactive monitoring
Act as a key technical leader, collaborating with cross-functional teams and senior stakeholders
Ensure architectural governance and secure-by-default solutions

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...

Location

United States

Salary:

116633.00 - 181243.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements

Job Responsibility

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
Partner with engineering team members to embed reliability best practices early in the development lifecycle
Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
Reduce operational toil by identifying repetitive work and implementing automation-first solutions

Fulltime

Select Country

Senior Site Reliability Engineer Cloud Platform

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?