This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Zuora’s Cloud Engineering teams are responsible for Cloud infrastructures, monitoring performance and uptime, managing internal and external shared services, infrastructure services and more -for Zuora’s customer facing SaaS products and platforms. Our technologists sit across US, Beijing, India, Costa Rica and remotely, using a follow-the-sun model to provide 24x7x365 coverage for critical functions and partner closely with our Engineering, Customer Support, Security, Global Services and Sales teams on a daily basis to keep our customers front and center. We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our infrastructure team. The ideal candidate will be focused on maximizing system uptime, efficiency, and reliability while building the tools and automation necessary to scale our services. This role requires a strong balance of operational experience and development skills, with deep expertise in cloud environments and modern CI/CD practices. This is a location specific position that requires you to come into the office regularly to be most effective.
Job Responsibility:
Maintain and improve the reliability, scalability, and performance of our production systems, targeting a high-availability environment
Design, implement, and maintain automation solutions for infrastructure provisioning, deployment, configuration management, and monitoring using Terraform and Jenkins
Administer, manage, and optimize our cloud infrastructure primarily hosted on AWS, focusing on cost efficiency and secure operations
Develop and maintain infrastructure-as-code using Puppet and/or Ansible to ensure consistent and reproducible environments
Participate in on-call rotation, troubleshoot and resolve critical production incidents, and conduct comprehensive post-mortems to prevent recurrence
Apply strong Linux administration skills to manage, patch, and secure operating systems and underlying infrastructure
Manage and optimize distributed messaging systems, specifically Kafka, ensuring high throughput and data integrity
Requirements:
6-8 years of relevant experience on SRE/DevOps
Proven hands-on working experience with core AWS services (e.g., EC2, VPC, S3, RDS, IAM, CloudWatch, EKS/ECS)
Deep expertise in infrastructure-as-code principles using Terraform for provisioning and state management
Expert-level knowledge and practical experience with configuration management tools such as Puppet and/or Ansible
Strong experience setting up, maintaining, and enhancing Continuous Integration/Continuous Deployment pipelines using Jenkins
Proficiency in scripting languages, particularly Python and/or Shell scripting, for developing automation tools and performing system administration tasks
Advanced knowledge of Linux operating systems, including performance tuning, troubleshooting, security, and networking fundamentals
Working knowledge and operational experience with distributed messaging queues, specifically Kafka
Nice to have:
Experience with containerization technologies like Docker and Kubernetes (EKS)
Familiarity with logging and monitoring tools (e.g., Prometheus, Grafana, ELK stack)
Knowledge of networking (TCP/IP, Load Balancing, DNS)
Previous experience in a 24/7 high-availability production environment
What we offer:
Competitive compensation, variable bonus and performance reward opportunities, and retirement programs
Medical Insurance
Generous, flexible time off
Paid holidays, “wellness” days and company wide end of year break
Learning & Development stipend
Opportunities to volunteer and give back, including charitable donation match
Free resources and support for your mental wellbeing