This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Join Zuora’s high-impact Operations team and help power the backbone of our industry-leading SaaS platform. In this role, you’ll be at the center of maintaining and enhancing the reliability, scalability, and performance of Zuora’s core systems — ensuring our customers around the world enjoy a seamless experience every time. We’re looking for an engineer who thrives on solving complex operational challenges, loves building automation-first solutions, and is passionate about driving innovation through AI and modern infrastructure practices.
Job Responsibility:
Design and implement intelligent automation for infrastructure lifecycle management — including self-healing, anomaly detection, and automated remediation using IaC and AI-driven tooling
Apply AI/ML techniques for predictive monitoring and proactive performance optimization to prevent outages before they happen
Lead complex incident response and root cause analysis (RCA) efforts, embedding automation and learning into postmortems
Identify and remove reliability bottlenecks using dynamic scaling, telemetry instrumentation, and automated tuning
Continuously enhance runbooks and playbooks by integrating machine learning insights and automating manual tasks
Stay on the cutting edge of AIOps, distributed systems, and cloud-native reliability practices — and bring those learnings to influence strategic engineering decisions
Requirements:
Strong hands-on experience in Linux Administration and Python Development
Experience working with Agentic AI or multi-agent frameworks to amplify operational capabilities
Deep expertise with Docker and Kubernetes, managing scalable, high-availability environments
Familiarity with Kafka, ActiveMQ, MySQL, Oracle, Redis, and modern caching/messaging systems
Understanding of AI/ML-based anomaly detection and predictive operations
Proven ability in incident management, RCA, and building systems that prevent recurrence
Experience designing and maintaining CI/CD pipelines, with strong observability and reliability focus
Proficiency with Prometheus, Grafana, and OpenTelemetry for real-time monitoring and anomaly detection
A continuous learning mindset and a passion for automation, innovation, and operational excellence
1+ years of experience in a SaaS or cloud-native environment
Nice to have:
Experience with Jenkins, Terraform, and advanced infrastructure-as-code practices
Red Hat Certified System Administrator (RHCSA)
AWS / Azure / GCP Certifications
Python Institute PCAP (Certified Associate in Python Programming)
Docker Certified Associate (DCA) or Certified Kubernetes Administrator (CKA)
SRE or advanced operations certifications
What we offer:
Competitive compensation, bonus opportunities, and retirement programs
Comprehensive medical, dental, and vision coverage
Generous, flexible time off
Paid holidays, wellness days, and a company-wide year-end break
6 months of fully paid parental leave
Learning & development stipend
Opportunities to give back, including volunteer time and donation matching