CrawlJobs Logo

Staff Engineer, Site Reliability

Ireland, Dublin · Job Posted December 09, 2025
Apply Position
Job Link Share

Job Description

LearnUpon is looking for a Staff Site Reliability Engineer to join our team in Ireland. This is a flex role, working 1 day per week from LearnUpon's Dublin office. LearnUpon LMS helps organizations train their employees, partners, and customers. Businesses can manage, track, and achieve their unique learning goals — all through a single, powerful solution. As a Staff Engineer in Site Reliability Engineering you will be part of the team responsible for the scale-out of the LearnUpon infrastructure.

Job Responsibility

  • Identifying opportunities to improve and scale our infrastructure for performance, observability, maintainability, and cost, by creating innovative solutions
  • Leading our efforts to build an observability function that incorporates application metrics, application transaction tracking, and event log management
  • Driving the processes to maintain resilient, scalable and cost-effective infrastructure
  • Working with other Engineering teams to provide infrastructure solutions that meet their ongoing requirements
  • Building tools focused on measuring, monitoring and alerting, with an eye towards self-service in order to promote Engineers’ ownership of observability
  • Reacting quickly to changing customer and business needs
  • Participate in on-call rota
  • Mentoring junior talent

Requirements

  • 7+ years of experience in a software or Ops role
  • 5+ years of cloud engineering experience, with at least 2 years experience with AWS
  • Experience deploying Microservice environments, using containerisation technologies such as Kubernetes and Docker
  • Experience in designing and implementing Observability tech stacks
  • Have championed the benefits of Observability to Engineering teams
  • Can architect the design of SLO/SLI implementation that balances the needs of different teams
  • Familiar with cost analysis of Observability metrics gathering, Engineering effort, and tooling
  • Experience building and supporting large-scale distributed systems that back a consumer app or website with associated requirements of performance, security and disaster recovery
  • Experience with implementing IaaC (e.g. CloudFormation, Terraform etc.), automation tooling (e.g. Puppet, Ansible etc.), CI/CD (e.g. Jenkins, Travis CI, GitLab etc.)
  • Able to effectively communicate technical ideas to and collaborate with both technical and non-technical peers

Nice to have

  • Certification in AWS, any PaaS, and/or related technologies
  • Experience with database scaling would be a strong plus

What we offer

  • Work in a fun and supportive environment with regular team events
  • Excellent career progression
  • Structured learning environment
  • Competitive salary and company ESOP
  • Private health insurance
  • 26 days annual leave

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff Engineer, Site Reliability

8 matching positions

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer - Incident Management & Reliability

We’re not just building better tech. We’re rewriting how data moves and what the...
Location
Location
Canada
Salary
Salary:
225100.00 - 264500.00 CAD / Year
confluent.io Logo
Confluent
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of relevant experience in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure
  • Experience navigating reliability/incident programs at 500+ engineer organizations
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
  • Strong understanding of distributed systems and failure modes at scale
  • Deep experience with observability: metrics, logging, tracing
  • Kubernetes and container orchestration experience
  • Understanding of CI/CD pipelines and release processes
  • Strong written communication (design docs, runbooks, post-mortems)
  • Experience driving org-wide process and cultural changes
Job Responsibility
Job Responsibility
  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  • Define and maintain SLO/SLA frameworks
  • use error budgets to guide reliability investments
  • Own standards, practices, and continuous improvement of incident response across engineering
  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
  • Develop and deliver training programs
  • coach teams through post-mortems
  • Partner with engineering leaders to elevate reliability practices org-wide
What we offer
What we offer
  • Remote-First Work
  • Robust Insurance Benefits
  • Flexible Time Away
  • The Best Teammates
  • Experience Ambassadors
  • Open and Honest Culture
  • Well-Being and Growth
  • Offers Equity
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Trimble is seeking a Staff Site Reliability Engineer (P4) to join our Corporate ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree or equivalent in Computer Science, Engineering, Information Systems, or a related field
  • OR equivalent practical experience
  • Minimum of 10 years of experience in IT operations, including deep knowledge of networking, computing, and storage
  • Minimum of 5 years of experience with AWS and/or Azure cloud computing environments, with at least 2 years in an architect/design role
  • Windows and Linux deployment experience, including common services for each platform
  • Proficiency in at least one scripting language (preferably Python or Powershell/.NET) and proficiency utilizing Git as a source control system
  • Strong background in application operations, including Incident Management, Change Management, and Capacity Management
  • Excellent troubleshooting and problem-solving skills, knowledge of security best practices, a strong desire to learn independently, and exceptional written/verbal communication skills with a customer-service mindset
Job Responsibility
Job Responsibility
  • Cloud Architecture & Enhancement: Develop new and enhance current shared public cloud services with a strict focus on Availability, Operations, Performance, Capacity, Security, and User Experience
  • Technical Leadership: Provide input and expertise relating to cloud hosting solutions (full infrastructure design and management). Transform business requirements into scalable operational designs
  • Collaboration & Planning: Attend and provide input on product planning sessions with internal development teams. Act as an expert on Business System services to communicate the value of our platform
  • Automation & Documentation: Identify and implement automation solutions. Develop and maintain critical documentation, including architecture diagrams, service descriptions, build/deploy processes, and operations run books
  • Mentorship & Support: Provide technical escalation and mentoring to other team members. Train operations teams to provide Level 1/2 support for shared public cloud services, acting as the ultimate Level 3 escalation point
  • Standards & Governance: Manage AWS/Azure best practice expectations and ensure alignment with corporate standards
  • Global Collaboration: Work effectively within a global team framework. Strike a balance between Indian and US time zones to attend business stakeholder meetings, address production issues, and serve as a reliable escalation point (including off-hours tasks when necessary)
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer - Cloud

Elevate Global Operations as our Next Cloud Site Reliability Engineer (OpenTelem...
Location
Location
United Kingdom
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience with the OpenTelemetry Collector, APIs, and SDKs
  • Extensive experience with observability tools like NewRelic, Datadog, or Splunk
  • Strong proficiency in Infrastructure as Code (Terraform, Ansible) and cloud platforms (AWS, GCP, or Azure)
  • Deep understanding of containerization and orchestration using Docker and Kubernetes
  • Advanced coding skills in Python, Go, or Java for building robust automation and monitoring tools
  • Experience leveraging AI coding assistants like GitHub Co-Pilot to accelerate development
Job Responsibility
Job Responsibility
  • Lead a global "OTel First" strategy, implementing OpenTelemetry at scale across a diverse technological landscape
  • Spearhead the development of automation scripts and Infrastructure as Code using Terraform to ensure seamless, reproducible platform delivery
  • Optimize platform performance and cost-efficiency, ensuring our observability tools scale economically as our data grows
  • Collaborate with engineering teams to embed reliability and security standards into new features from the ground up
  • Drive root cause analysis and problem management to proactively prevent incidents and improve the customer experience
Read More
Arrow Right

Senior Staff Site Reliability Engineer

Fivetran is looking for a high-performance, experienced engineer to be a part of...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
fivetran.com Logo
Fivetran
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience working with SaaS products at scale
  • Working knowledge of managed Kubernetes (EKS, AKS and GKE)
  • Knowledge of Cloud Platforms and related tooling: AWS, Azure, Google Cloud (GCP), Terraform, Ansible, Buildkite, Pulumi and ArgoCD
  • Experience in Python/Shell scripting and Go Language. Bonus if you have Java
  • Experience with Linux operating systems internals and administration
  • Experience with cloud networking like Site-to-Site VPNs, Privatelinks and Private Service connect (GCP)
Job Responsibility
Job Responsibility
  • Responsible for ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
  • Evolve systems by adding reliability into our product roadmap
  • Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
  • Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
  • Ensure scalable artifacts deployment to all environments by automation scripts
  • Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team
What we offer
What we offer
  • 100% employer-paid medical insurance
  • Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
  • RSU stock grants
  • Professional development and training opportunities
  • Company virtual happy hours, free food, and fun team-building activities
  • Monthly cell phone stipend
  • Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer Staff

Site Reliability Engineer Staff. This role has been designed as 'Hybrid' with an...
Location
Location
United States , San Juan
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 4 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
  • Proficiency with Linux systems, especially Debian-based distributions
  • Strong experience with cloud platforms such as AWS and GCP
  • Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
  • Solid programming skills in Python and/or Golang
  • Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
  • Experience with GitOps workflows
  • Proven track record in implementing and maintaining CI/CD pipelines
  • Strong background in security and familiarity with security programs
  • Experience with monitoring and logging tools (Prometheus, Grafana, ELK)
Job Responsibility
Job Responsibility
  • Enhance Infrastructure as Code (IAC) and enforce best practices
  • Optimize cloud infrastructure for scalability, security, and cost-effectiveness
  • Develop internal tools to support and streamline cloud platform operations
  • Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
  • Address container image vulnerabilities and standardize remediation processes
  • Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
  • Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
  • Troubleshoot complex production issues to ensure system reliability and customer satisfaction
  • Fine-tune distributed systems such as Apache Kafka and Cassandra
  • Collaborate with development, security, and operations teams to align infrastructure with application needs
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Fivetran is building data pipelines to power the modern data stack for thousands...
Location
Location
United States , Oakland
Salary
Salary:
196033.00 - 245041.50 USD / Year
fivetran.com Logo
Fivetran
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in managed Kubernetes (EKS, AKS, and GKE)
  • Expertise of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
  • Expertise in Python/Shell scripting
  • Expertise with Linux operating systems, internals, and administration
  • Expertise with cloud networking like VPNs, Privatelinks, and Private Service connect (GCP)
  • Experience with databases such as PostgreSQL
Job Responsibility
Job Responsibility
  • Responsible for ongoing reliability and robustness of Fivetran's production infrastructure by monitoring availability, capacity, and throughput
  • Evolve systems by adding reliability into our product roadmap
  • Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
  • Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
  • Ensure scalable artifacts deployment to all environments by automation scripts
  • Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team
What we offer
What we offer
  • 100% employer-paid medical insurance
  • Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
  • RSU stock grants
  • Professional development and training opportunities
  • Company virtual happy hours, free food, and fun team-building activities
  • Monthly cell phone stipend
  • Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Fivetran is looking for a high-performance engineer to join a team of Site Relia...
Location
Location
Serbia , Novi Sad
Salary
Salary:
Not provided
fivetran.com Logo
Fivetran
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience working with SaaS platforms at scale
  • Expertise in managed Kubernetes (EKS, AKS, and GKE)
  • Knowledge of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
  • Experience in Python, Shell scripting, and Go
  • Experience with Linux operating systems, internals, and administration
  • Experience with cloud networking like Managed NAT Gateways, VPNs, Privatelinks, and Private Service Connect (GCP)
  • Experience with databases such as PostgreSQL
Job Responsibility
Job Responsibility
  • Responsible for the ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
  • Collaborate with engineering teams to integrate reliability best practices into the product roadmap
  • Support the prioritization and resolution of critical bugs identified by support or sales
  • Contribute to maintaining the high reliability and availability of production infrastructure by collaborating with engineering to implement automation for scalable deployments
  • Ensure scalable artifacts deployment to all environments through automation scripts
  • Proactively monitor infrastructure vulnerabilities and collaborate with the security team to promptly address them
What we offer
What we offer
  • 100% employer-paid medical insurance
  • Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
  • RSU stock grants
  • Professional development and training opportunities
  • Company virtual happy hours, free food, and fun team-building activities
  • Monthly cell phone stipend
  • Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents
Read More
Arrow Right