This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Production Engineering team is responsible for building, scaling, and operating the cloud platform for the CyberArk Machine Identity Management products. Our solutions are trusted by the world's largest organizations to protect and manage TLS machine identities, SSH machine identities, and code signing identities. As a Senior Staff Production Engineer you will play a key role in designing and evolving the reliability, scalability, and operational excellence of our cloud platform. You will work across infrastructure, services, and engineering teams to ensure systems are resilient, observable, and able to operate at scale. This role is ideal for engineers who combine strong infrastructure expertise with a systems mindset, and who are comfortable driving improvements across production environments, tooling, and engineering practices.
Job Responsibility
Design, build, and evolve highly available cloud infrastructure platforms with a focus on scalability, resilience, and reliability
Lead improvements across production systems, including performance, availability, and incident response
Drive and standardize Infrastructure as Code (IaC) practices to improve consistency and reduce operational overhead
Design and optimize CI/CD pipelines to support fast, secure, and reliable software delivery at scale
Partner with development teams to improve system reliability, observability, and cloud-native design patterns
Define and implement monitoring, alerting, and observability strategies across distributed systems
Lead incident response efforts, including root cause analysis and long-term remediation strategies
Identify and eliminate operational toil through automation and system improvements
Mentor engineers and contribute to raising the bar for production engineering practices
Requirements
5+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering (SRE)
Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP
Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE)
Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi)
Proven experience designing and maintaining complex CI/CD systems (Jenkins, GitLab CI, ArgoCD, GitHub Actions)
Strong programming/scripting skills (Python, Go, or similar) for automation and tooling
Experience operating in high-scale, 24/7 production environments with ownership of incident response and reliability
Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS)
Strong problem-solving skills and ability to work across teams
Nice to have
Experience implementing DevSecOps practices in cloud environments
Experience building or improving observability platforms and tooling
Professional certifications (CKA/CKAD, AWS Solutions Architect, Azure Administrator)
Experience using AI-assisted development tools to improve operational workflows and automation