CrawlJobs Logo

Staff Site Reliability Engineer - Cloud

United Kingdom · Job Posted June 09, 2026
Apply Position
Job Link Share

Job Description

Elevate Global Operations as our Next Cloud Site Reliability Engineer (OpenTelemetry)! Are you ready to lead an OTel-first strategy and redefine reliability for a global industrial technology leader? Trimble is looking for a visionary Cloud Site Reliability Engineer to manage our massive-scale observability platform, ensuring our digital and physical solutions remain performant and resilient. This is your chance to use cutting-edge automation and OpenTelemetry to make a tangible impact on the world's most critical industries.

Job Responsibility

  • Lead a global "OTel First" strategy, implementing OpenTelemetry at scale across a diverse technological landscape
  • Spearhead the development of automation scripts and Infrastructure as Code using Terraform to ensure seamless, reproducible platform delivery
  • Optimize platform performance and cost-efficiency, ensuring our observability tools scale economically as our data grows
  • Collaborate with engineering teams to embed reliability and security standards into new features from the ground up
  • Drive root cause analysis and problem management to proactively prevent incidents and improve the customer experience

Requirements

  • Hands-on experience with the OpenTelemetry Collector, APIs, and SDKs
  • Extensive experience with observability tools like NewRelic, Datadog, or Splunk
  • Strong proficiency in Infrastructure as Code (Terraform, Ansible) and cloud platforms (AWS, GCP, or Azure)
  • Deep understanding of containerization and orchestration using Docker and Kubernetes
  • Advanced coding skills in Python, Go, or Java for building robust automation and monitoring tools
  • Experience leveraging AI coding assistants like GitHub Co-Pilot to accelerate development

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff Site Reliability Engineer - Cloud

8 matching positions

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer - Incident Management & Reliability

We’re not just building better tech. We’re rewriting how data moves and what the...
Location
Location
Canada
Salary
Salary:
225100.00 - 264500.00 CAD / Year
confluent.io Logo
Confluent
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of relevant experience in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure
  • Experience navigating reliability/incident programs at 500+ engineer organizations
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
  • Strong understanding of distributed systems and failure modes at scale
  • Deep experience with observability: metrics, logging, tracing
  • Kubernetes and container orchestration experience
  • Understanding of CI/CD pipelines and release processes
  • Strong written communication (design docs, runbooks, post-mortems)
  • Experience driving org-wide process and cultural changes
Job Responsibility
Job Responsibility
  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  • Define and maintain SLO/SLA frameworks
  • use error budgets to guide reliability investments
  • Own standards, practices, and continuous improvement of incident response across engineering
  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
  • Develop and deliver training programs
  • coach teams through post-mortems
  • Partner with engineering leaders to elevate reliability practices org-wide
What we offer
What we offer
  • Remote-First Work
  • Robust Insurance Benefits
  • Flexible Time Away
  • The Best Teammates
  • Experience Ambassadors
  • Open and Honest Culture
  • Well-Being and Growth
  • Offers Equity
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Trimble is seeking a Staff Site Reliability Engineer (P4) to join our Corporate ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree or equivalent in Computer Science, Engineering, Information Systems, or a related field
  • OR equivalent practical experience
  • Minimum of 10 years of experience in IT operations, including deep knowledge of networking, computing, and storage
  • Minimum of 5 years of experience with AWS and/or Azure cloud computing environments, with at least 2 years in an architect/design role
  • Windows and Linux deployment experience, including common services for each platform
  • Proficiency in at least one scripting language (preferably Python or Powershell/.NET) and proficiency utilizing Git as a source control system
  • Strong background in application operations, including Incident Management, Change Management, and Capacity Management
  • Excellent troubleshooting and problem-solving skills, knowledge of security best practices, a strong desire to learn independently, and exceptional written/verbal communication skills with a customer-service mindset
Job Responsibility
Job Responsibility
  • Cloud Architecture & Enhancement: Develop new and enhance current shared public cloud services with a strict focus on Availability, Operations, Performance, Capacity, Security, and User Experience
  • Technical Leadership: Provide input and expertise relating to cloud hosting solutions (full infrastructure design and management). Transform business requirements into scalable operational designs
  • Collaboration & Planning: Attend and provide input on product planning sessions with internal development teams. Act as an expert on Business System services to communicate the value of our platform
  • Automation & Documentation: Identify and implement automation solutions. Develop and maintain critical documentation, including architecture diagrams, service descriptions, build/deploy processes, and operations run books
  • Mentorship & Support: Provide technical escalation and mentoring to other team members. Train operations teams to provide Level 1/2 support for shared public cloud services, acting as the ultimate Level 3 escalation point
  • Standards & Governance: Manage AWS/Azure best practice expectations and ensure alignment with corporate standards
  • Global Collaboration: Work effectively within a global team framework. Strike a balance between Indian and US time zones to attend business stakeholder meetings, address production issues, and serve as a reliable escalation point (including off-hours tasks when necessary)
  • Fulltime
Read More
Arrow Right

Senior Staff Site Reliability Engineer

Fivetran is looking for a high-performance, experienced engineer to be a part of...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
fivetran.com Logo
Fivetran
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience working with SaaS products at scale
  • Working knowledge of managed Kubernetes (EKS, AKS and GKE)
  • Knowledge of Cloud Platforms and related tooling: AWS, Azure, Google Cloud (GCP), Terraform, Ansible, Buildkite, Pulumi and ArgoCD
  • Experience in Python/Shell scripting and Go Language. Bonus if you have Java
  • Experience with Linux operating systems internals and administration
  • Experience with cloud networking like Site-to-Site VPNs, Privatelinks and Private Service connect (GCP)
Job Responsibility
Job Responsibility
  • Responsible for ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
  • Evolve systems by adding reliability into our product roadmap
  • Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
  • Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
  • Ensure scalable artifacts deployment to all environments by automation scripts
  • Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team
What we offer
What we offer
  • 100% employer-paid medical insurance
  • Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
  • RSU stock grants
  • Professional development and training opportunities
  • Company virtual happy hours, free food, and fun team-building activities
  • Monthly cell phone stipend
  • Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer Staff

Site Reliability Engineer Staff. This role has been designed as 'Hybrid' with an...
Location
Location
United States , San Juan
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 4 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
  • Proficiency with Linux systems, especially Debian-based distributions
  • Strong experience with cloud platforms such as AWS and GCP
  • Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
  • Solid programming skills in Python and/or Golang
  • Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
  • Experience with GitOps workflows
  • Proven track record in implementing and maintaining CI/CD pipelines
  • Strong background in security and familiarity with security programs
  • Experience with monitoring and logging tools (Prometheus, Grafana, ELK)
Job Responsibility
Job Responsibility
  • Enhance Infrastructure as Code (IAC) and enforce best practices
  • Optimize cloud infrastructure for scalability, security, and cost-effectiveness
  • Develop internal tools to support and streamline cloud platform operations
  • Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
  • Address container image vulnerabilities and standardize remediation processes
  • Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
  • Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
  • Troubleshoot complex production issues to ensure system reliability and customer satisfaction
  • Fine-tune distributed systems such as Apache Kafka and Cassandra
  • Collaborate with development, security, and operations teams to align infrastructure with application needs
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Fivetran is building data pipelines to power the modern data stack for thousands...
Location
Location
United States , Oakland
Salary
Salary:
196033.00 - 245041.50 USD / Year
fivetran.com Logo
Fivetran
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in managed Kubernetes (EKS, AKS, and GKE)
  • Expertise of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
  • Expertise in Python/Shell scripting
  • Expertise with Linux operating systems, internals, and administration
  • Expertise with cloud networking like VPNs, Privatelinks, and Private Service connect (GCP)
  • Experience with databases such as PostgreSQL
Job Responsibility
Job Responsibility
  • Responsible for ongoing reliability and robustness of Fivetran's production infrastructure by monitoring availability, capacity, and throughput
  • Evolve systems by adding reliability into our product roadmap
  • Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
  • Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
  • Ensure scalable artifacts deployment to all environments by automation scripts
  • Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team
What we offer
What we offer
  • 100% employer-paid medical insurance
  • Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
  • RSU stock grants
  • Professional development and training opportunities
  • Company virtual happy hours, free food, and fun team-building activities
  • Monthly cell phone stipend
  • Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Fivetran is looking for a high-performance engineer to join a team of Site Relia...
Location
Location
Serbia , Novi Sad
Salary
Salary:
Not provided
fivetran.com Logo
Fivetran
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience working with SaaS platforms at scale
  • Expertise in managed Kubernetes (EKS, AKS, and GKE)
  • Knowledge of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
  • Experience in Python, Shell scripting, and Go
  • Experience with Linux operating systems, internals, and administration
  • Experience with cloud networking like Managed NAT Gateways, VPNs, Privatelinks, and Private Service Connect (GCP)
  • Experience with databases such as PostgreSQL
Job Responsibility
Job Responsibility
  • Responsible for the ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
  • Collaborate with engineering teams to integrate reliability best practices into the product roadmap
  • Support the prioritization and resolution of critical bugs identified by support or sales
  • Contribute to maintaining the high reliability and availability of production infrastructure by collaborating with engineering to implement automation for scalable deployments
  • Ensure scalable artifacts deployment to all environments through automation scripts
  • Proactively monitor infrastructure vulnerabilities and collaborate with the security team to promptly address them
What we offer
What we offer
  • 100% employer-paid medical insurance
  • Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
  • RSU stock grants
  • Professional development and training opportunities
  • Company virtual happy hours, free food, and fun team-building activities
  • Monthly cell phone stipend
  • Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents
Read More
Arrow Right

Senior Staff Site Reliability Engineer

As a Site Reliability Engineer on the SASE Platform team, you will play a critic...
Location
Location
Israel , Tel Aviv
Salary
Salary:
Not provided
paloaltonetworks.it Logo
Palo Alto Networks Italia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience working with Unix/Linux systems, including shell, tools, networking, and kernel concepts
  • 2+ years of hands-on experience with microservices architectures running on Kubernetes and container platforms
  • Proven experience operating workloads in public cloud environments (e.g., AWS, GCP, Azure) at scale
  • Proficiency in building automation and tools in at least one scripting or programming language (e.g., Python, Go, Java)
  • Strong experience with Infrastructure as Code (IaC) tools such as Terraform or Ansible
  • Bachelor’s degree in Engineering, Computer Science, or a related technical field, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Proactively collaborate with development teams to embed reliability, scalability, and operability into services from the earliest design stages
  • Design, review, and evolve cloud-native architectures to improve availability, performance, cost efficiency, and fault tolerance
  • Build and operate automation for provisioning, deploying, and managing global infrastructure using Infrastructure as Code (IaC)
  • Improve CI/CD pipelines and release processes to enable safe, fast, and repeatable deployments
  • Drive observability best practices, including metrics, logs, traces, and SLIs/SLOs to enable data-driven incident analysis
  • Participate in on-call rotations, reducing mean time to resolution (MTTR) through automation and proactive reliability improvements
  • Challenge existing processes by championing reliability, security, and operational maturity across the organization
  • Fulltime
Read More
Arrow Right