Site Reliability Engineer Job at Zuora

Job Description

Join Zuora’s high-impact Operations team and help power the backbone of our industry-leading SaaS platform. In this role, you’ll be at the center of maintaining and enhancing the reliability, scalability, and performance of Zuora’s core systems — ensuring our customers around the world enjoy a seamless experience every time. We’re looking for an engineer who thrives on solving complex operational challenges, loves building automation-first solutions, and is passionate about driving innovation through AI and modern infrastructure practices.

Job Responsibility

Design and implement intelligent automation for infrastructure lifecycle management — including self-healing, anomaly detection, and automated remediation using IaC and AI-driven tooling
Apply AI/ML techniques for predictive monitoring and proactive performance optimization to prevent outages before they happen
Lead complex incident response and root cause analysis (RCA) efforts, embedding automation and learning into postmortems
Identify and remove reliability bottlenecks using dynamic scaling, telemetry instrumentation, and automated tuning
Continuously enhance runbooks and playbooks by integrating machine learning insights and automating manual tasks
Stay on the cutting edge of AIOps, distributed systems, and cloud-native reliability practices — and bring those learnings to influence strategic engineering decisions

Requirements

Strong hands-on experience in Linux Administration and Python Development
Experience working with Agentic AI or multi-agent frameworks to amplify operational capabilities
Deep expertise with Docker and Kubernetes, managing scalable, high-availability environments
Familiarity with Kafka, ActiveMQ, MySQL, Oracle, Redis, and modern caching/messaging systems
Understanding of AI/ML-based anomaly detection and predictive operations
Proven ability in incident management, RCA, and building systems that prevent recurrence
Experience designing and maintaining CI/CD pipelines, with strong observability and reliability focus
Proficiency with Prometheus, Grafana, and OpenTelemetry for real-time monitoring and anomaly detection
A continuous learning mindset and a passion for automation, innovation, and operational excellence
1+ years of experience in a SaaS or cloud-native environment

Nice to have

Experience with Jenkins, Terraform, and advanced infrastructure-as-code practices
Red Hat Certified System Administrator (RHCSA)
AWS / Azure / GCP Certifications
Python Institute PCAP (Certified Associate in Python Programming)
Docker Certified Associate (DCA) or Certified Kubernetes Administrator (CKA)
SRE or advanced operations certifications

What we offer

Competitive compensation, bonus opportunities, and retirement programs
Comprehensive medical, dental, and vision coverage
Generous, flexible time off
Paid holidays, wellness days, and a company-wide year-end break
6 months of fully paid parental leave
Learning & development stipend
Opportunities to give back, including volunteer time and donation matching
Mental wellbeing resources and support

Zuora - All Job Offers

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?