This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re hiring a Site Reliability Engineer to help improve the availability, performance, scalability, and operational excellence of our SaaS environments. You’ll work closely with Engineering and Cloud teams to automate operations, strengthen observability, and improve incident response using modern SRE practices (SLOs/SLIs, error budgets, postmortems). This role is hands-on, collaborative, and impact-focused. If you're eager to make a significant impact in a fast-paced, high-growth environment, we encourage you to apply.
Job Responsibility:
Improve reliability, scalability, performance, and observability for JFrog SaaS services in partnership with engineering teams
Implement SRE practices: define SLOs/SLIs, run failure analysis, support capacity planning, perform service readiness reviews and drive tech-debt reliability improvements
Support day-to-day operations of our Multi Cloud Global Distributed Cloud Native Kubernetes-based SaaS environments to keep services available, performant, cost efficient and scalable
Build and enhance internal services and tools to streamline operations and reduce toil through automation
Develop and maintain Python/Go automation to improve deployment safety, incident response and operational visibility
Run PoCs, prototype, and drive implementations for agentic automation using an ADK/agent framework, leveraging AI where it meaningfully improves operational & strategic excellence
Support resilience testing/chaos experiments (as appropriate) and improve disaster recovery readiness
Participate in on-call, lead incidents to resolution, and drive postmortems and follow-up actions that prevent recurrence
Act as a primary contact for SaaS production issues, collaborating closely with Product sengineering groups
Evaluate cloud-native technologies and vendor solutions that improve SaaS reliability and lifecycle management
Requirements:
4+ years in SRE, DevOps, or Production Engineering in large-scale production environments
Production experience with Kubernetes (Docker) and at least one cloud provider (AWS, GCP, or Azure)
Working knowledge of SLO/SLI, alerting strategy, incident response, postmortems, and reliability improvements
Proficiency in Python or Go for automation, integrations, and internal tools
Hands-on with metrics/logs/traces using tools like New Relic, Coralogix, Prometheus, Grafana, OpenTelemetry (or equivalents)
Strong incident response and triage using PagerDuty/Opsgenie (or equivalent)
Exposure to chaos/resilience testing (e.g., Gremlin) and DR readiness
Practical use of AI-assisted operations (e.g., log/incident summarization, triage helpers)
familiarity building simple agents with an ADK/agent framework (e.g., LangGraph, LangChain, CrewAI, or similar)
Working knowledge of microservices delivery using Jenkins, ArgoCD, or equivalent
Strong documentation (runbooks, postmortems) and a collaborative, independent problem-solving mindset