Site Reliability Engineer Job at JFrog (Netanya/Tel Aviv)

Job Description

We’re hiring an SRE to help improve the availability, performance, scalability, and operational excellence of our SaaS environments. You’ll work closely with Engineering and Cloud teams to automate operations, scale JFrog’s large-scale, multi-cloud, Kubernetes-based SaaS environments, strengthen observability, and improve incident response using modern SRE practices (SLOs/SLIs, error budgets, postmortems). This role is hands-on, collaborative, and impact-focused. If you're eager to make a significant impact in a fast-paced, high-growth environment, we encourage you to apply.

Job Responsibility

Support the reliability, availability, performance, and scalability of JFrog’s large-scale, multi-cloud, Kubernetes-based SaaS environments
Investigate and troubleshoot production issues across distributed systems, infrastructure, Kubernetes, and cloud environments in close collaboration with Engineering teams
Design and develop backend services, internal platforms, and production engineering tools using Python, Go, or similar technologies
Improve reliability, observability, and operational readiness through SRE practices, monitoring and alerting, capacity awareness, postmortems, and safer CI/CD and production change processes
Evaluate and contribute to AI-assisted and agentic automation solutions that improve operational efficiency, troubleshooting, and production workflows
Support resilience initiatives, including disaster recovery validation, service readiness, health checks, and production readiness reviews
Participate in on-call rotations, lead incident response when needed, and drive follow-up actions to prevent recurrence
Continuously learn and evaluate new technologies that can improve reliability, automation, and operational excellence

Requirements

2-4 years of experience in SRE, Production Engineering, DevOps, or a similar role with hands-on production exposure
Strong troubleshooting and analytical skills, with the ability to investigate production issues in a structured and methodical way
Hands-on experience with Kubernetes-based containerized workloads
Experience with at least one public cloud provider: AWS, GCP, or Azure
Experience developing backend services, internal platforms, automation, or production engineering tools using Python, Go, or another programming language
Practical understanding of Linux fundamentals, networking concepts, HTTP, DNS, service connectivity, and production troubleshooting
Familiarity with CI/CD tools such as Jenkins, ArgoCD, GitHub Actions, or similar
Exposure to observability tools covering metrics, logs, and traces, such as Prometheus, Grafana, Coralogix, New Relic, or similar platforms
Understanding of incident management processes, alerting systems, and production support workflows
Ability to learn quickly, take ownership, communicate clearly, and work well in a collaborative production environment

Nice to have

Experience using AI-assisted operational workflows such as log analysis, incident summarization, triage support, or troubleshooting
Familiarity with agentic automation frameworks such as LangGraph, LangChain, CrewAI, or similar
Experience using AI-assisted development tools such as Cursor, Claude Code, GitHub Copilot, ChatGPT, or similar tools

JFrog - All Job Offers

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?