CrawlJobs Logo

Senior Staff Site Reliability Engineer

Israel, Tel Aviv · Job Posted April 12, 2026
Apply Position
Job Link Share

Job Description

As a Site Reliability Engineer on the SASE Platform team, you will play a critical role in building and operating highly available, secure, and globally distributed services. Your mission is to ensure our cloud-native security and networking platform is reliable, scalable, and performant from day one, protecting the users, applications, and data for the world's largest enterprises as they adopt cloud, remote work, and AI.

Job Responsibility

  • Proactively collaborate with development teams to embed reliability, scalability, and operability into services from the earliest design stages
  • Design, review, and evolve cloud-native architectures to improve availability, performance, cost efficiency, and fault tolerance
  • Build and operate automation for provisioning, deploying, and managing global infrastructure using Infrastructure as Code (IaC)
  • Improve CI/CD pipelines and release processes to enable safe, fast, and repeatable deployments
  • Drive observability best practices, including metrics, logs, traces, and SLIs/SLOs to enable data-driven incident analysis
  • Participate in on-call rotations, reducing mean time to resolution (MTTR) through automation and proactive reliability improvements
  • Challenge existing processes by championing reliability, security, and operational maturity across the organization

Requirements

  • 5+ years of experience working with Unix/Linux systems, including shell, tools, networking, and kernel concepts
  • 2+ years of hands-on experience with microservices architectures running on Kubernetes and container platforms
  • Proven experience operating workloads in public cloud environments (e.g., AWS, GCP, Azure) at scale
  • Proficiency in building automation and tools in at least one scripting or programming language (e.g., Python, Go, Java)
  • Strong experience with Infrastructure as Code (IaC) tools such as Terraform or Ansible
  • Bachelor’s degree in Engineering, Computer Science, or a related technical field, or equivalent practical experience

Nice to have

  • Deep expertise in designing and operating monitoring, alerting, and observability systems (e.g., Prometheus, Grafana, ELK Stack)
  • Advanced networking expertise, including TCP/IP, DNS, BGP, routing, and cloud networking concepts relevant to SASE architectures
  • Prior experience operating or supporting SASE, SD-WAN, Zero Trust, or network security platforms
  • Familiarity with using AI/LLM technologies to improve operational workflows (e.g., incident analysis, automation)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Staff Site Reliability Engineer

8 matching positions

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Delhi
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Pune
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

As a Staff Site Reliability Engineer, you will be a technical leader and strateg...
Location
Location
Singapore; Australia , Singapore; Melbourne
Salary
Salary:
Not provided
airwallex.com Logo
Airwallex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in SRE, DevOps, or infrastructure engineering roles, with progressive responsibility
  • Proven ability to lead SRE strategy and execution for large-scale, complex, cross-functional projects
  • Deep expertise with cloud platforms (AWS/GCP), Kubernetes, container orchestration, observability, and incident response frameworks
  • Strong experience supporting production systems with stringent high availability, compliance, and security requirements
  • Demonstrated leadership in mentoring and growing technical teams
  • Excellent collaboration and communication skills, able to influence stakeholders at all levels
  • Degree in Computer Science or related field
Job Responsibility
Job Responsibility
  • Drive the strategic vision and roadmap for Site Reliability Engineering at Airwallex, aligned with business objectives and product goals
  • Architect and oversee the implementation of highly scalable, secure, and resilient cloud infrastructure for new services and platform-wide initiatives
  • Lead and mentor senior engineers and cross-functional teams in reliability engineering best practices, automation, and incident management
  • Champion and evolve operational excellence through advanced observability, SLO management, runbooks, and proactive risk mitigation
  • Lead incident response for high-severity incidents, facilitating post-mortems and driving continuous improvements
  • Collaborate closely with Product, Engineering, Security, and DevOps leadership to ensure compliance, resilience, and alignment across functions
  • Influence and shape engineering culture around reliability, scalability, and DevOps principles across multiple teams
  • Advocate for innovation in tooling, automation, and infrastructure to improve developer productivity and service uptime
  • Fulltime
Read More
Arrow Right

Senior Staff Site Reliability Engineer

Fivetran is looking for a high-performance, experienced engineer to be a part of...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
fivetran.com Logo
Fivetran
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience working with SaaS products at scale
  • Working knowledge of managed Kubernetes (EKS, AKS and GKE)
  • Knowledge of Cloud Platforms and related tooling: AWS, Azure, Google Cloud (GCP), Terraform, Ansible, Buildkite, Pulumi and ArgoCD
  • Experience in Python/Shell scripting and Go Language. Bonus if you have Java
  • Experience with Linux operating systems internals and administration
  • Experience with cloud networking like Site-to-Site VPNs, Privatelinks and Private Service connect (GCP)
Job Responsibility
Job Responsibility
  • Responsible for ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
  • Evolve systems by adding reliability into our product roadmap
  • Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
  • Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
  • Ensure scalable artifacts deployment to all environments by automation scripts
  • Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team
What we offer
What we offer
  • 100% employer-paid medical insurance
  • Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
  • RSU stock grants
  • Professional development and training opportunities
  • Company virtual happy hours, free food, and fun team-building activities
  • Monthly cell phone stipend
  • Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Arcadia’s customers rely on us to securely process and deliver high-value health...
Location
Location
Salary
Salary:
Not provided
themuse.com Logo
The Muse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
  • Strong GitOps experience with Argo CD
  • experience building delivery workflows and automation using Argo Workflows
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
  • ability to define reusable platform patterns and controls
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
  • Proficiency in Python for building automation, tooling, and reliability improvements
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
Job Responsibility
Job Responsibility
  • Act as the technical leader for reliability for one or more domains
  • set direction and standards while remaining hands-on where it matters most
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
  • Lead operational readiness and reliability reviews for new features/architectural changes
  • reinforce non-functional requirements (availability, latency, security, cost)
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
What we offer
What we offer
  • Pet Insurance
  • Health Insurance
  • Dental Insurance
  • Vision Insurance
  • FSA
  • HSA
  • HSA With Employer Contribution
  • Life Insurance
  • Short-Term Disability
  • Long-Term Disability
Read More
Arrow Right

Senior Manager, Hybrid Services & Reliability (SRE)

As the Senior Engineering Manager for Hybrid Services & Reliability (HSR) within...
Location
Location
United States , Austin, Texas; Sunnyvale, California
Salary
Salary:
201600.00 - 302000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive background in Site Reliability Engineering (SRE) and defining SLO/SLI frameworks for hybrid cloud environments
  • Technical proficiency in managing on-prem Linux utilities (DHCP/PXE/NTP) and core development services
  • Opinionated view on automated observability, incident response, and MTTR reduction
  • Proven leadership experience
Job Responsibility
Job Responsibility
  • Reliability Engineering: Define, measure, and enforce strict SLOs/SLIs for critical hybrid cloud services, including network connectivity and compute readiness
  • Foundational Utilities: Own and manage core on-prem utilities, such as DHCP, PXE, and CDN, to ensure seamless server auto-provisioning across the global fleet
  • Environment Integrity: Manage the entire data flow path, from initial ingestion at the test bench through the secure cloud network into production staging
  • HIL Readiness: Guarantee the 99%+ availability and stability of remote CI-based Hardware-in-the-Loop (HIL) benches required for AV safety validation
  • Organization Growth: Actively lead the recruitment and technical mentorship of Senior and Staff ICs as part of the team's expansion
What we offer
What we offer
  • medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts
  • relocation benefits
  • Fulltime
Read More
Arrow Right