CrawlJobs Logo

Senior Site Reliability Engineer - Observability

Germany, Berlin Employment contract · Job Posted May 16, 2026
Apply Position
Job Link Share

Job Description

We are looking for a Senior Site Reliability Engineer to join the Core Reliability & Observability team in Platform Engineering. Your mission will be to shape Doctolib's observability strategy and ensure our platform remains reliable, debuggable, and scalable at a European scale. You will work in a feature team developing logging, metrics, tracing, and alerting capabilities, contributing directly to supporting 400,000 health professionals and 80 million patients in their daily healthcare journey.

Job Responsibility

  • Lead the observability strategy across the platform, with an emphasis on building scalable, developer-friendly logging and tracing capabilities
  • Identify and lead large-scale cross-cutting reliability initiatives, including improvements to our incident detection, response, and postmortem analysis capabilities
  • Take part in the on-call rotation, and actively contribute to improving our on-call experience by refining alerting, reducing noise, and ensuring actionable telemetry

Requirements

  • Have a solid hands-on experience (3y+) on a large-scale production platform
  • Have proven experience with cloud platforms such as AWS, Azure or Google Cloud
  • Have solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
  • Have a strong understanding of Helm for managing Kubernetes manifests and ArgoCD for GitOps workflows
  • Have deep expertise in observability tooling and architecture, such as: Logging: Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Logstash, Vector, Tracing: OpenTelemetry or proprietary APMs, Metrics: Prometheus, Thanos, Datadog, or equivalent
  • Have proficiency in at least one programming language (Ruby, Python, Go, Java, etc.) and a deep understanding of infrastructure as code principles
  • Have experience with monitoring and observability tools
  • Like troubleshooting performance issues in complex environments
  • Are fluent in English

Nice to have

  • Have experience contributing to open-source observability projects
  • Have worked in a high-growth tech environment
  • Are passionate about developer experience and platform engineering

What we offer

  • A Deutschlandticket (Germany-wide public transport pass) fully paid for by Doctolib
  • 28 vacation days + 1 additional day for each full calendar year of employment (up to a maximum of 30 days)
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Company health insurance with great supplementary benefits through our partner Allianz
  • Company pension scheme (bAV) through Allianz with an employer subsidy of 40% (15% within the probationary period)
  • The Doctolib Parent Care program, which includes one month additional parental leave and much more
  • Enrollment in Doctolib's long-term employee value sharing plan called DoctoGrowth
  • Free mental health and coaching services through our partner Moka.care
  • Subsidized sports membership through our partner Urban Sports Club
  • A flexible workplace policy offering both hybrid and office-based mode
  • Alongside healthy snacks and our regular breakfast buffet, we provide a subsidized meal benefit
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Relocation support in case of international mobility
  • Access to the best AI tools for coding, development and dedicated training

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineer - Observability

8 matching positions

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

The Senior Site Reliability Engineer establishes and maintains the infrastructur...
Location
Location
United Kingdom; United States; Canada
Salary
Salary:
Not provided
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
  • Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
  • Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
  • Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
  • Excellent async written communication skills
  • comfortable working with a geographically distributed team
  • Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
  • Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes
Job Responsibility
Job Responsibility
  • Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
  • Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
  • Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
  • Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
  • Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
  • Diagnose and debug production incidents
  • drive root-cause analysis and post-incident improvements to prevent recurring problems
  • Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
  • Contribute to runbooks, architecture documentation, and team processes
What we offer
What we offer
  • Fully remote work & schedule flexibility
  • Company-provided laptop
  • Annual bonus program
  • Monthly remote work stipend
  • Annual professional development stipend
  • Industry conferences
  • Company all-hands and team gatherings
  • 24 days PTO per year (prorated)
  • Birthday
  • Year-end company shutdown
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...
Location
Location
United States
Salary
Salary:
116633.00 - 181243.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
  • Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
  • CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
  • Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
  • SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
  • Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
  • Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
  • Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
  • Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
  • Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements
Job Responsibility
Job Responsibility
  • Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
  • Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
  • Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
  • Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
  • Partner with engineering team members to embed reliability best practices early in the development lifecycle
  • Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
  • Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
  • Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
  • Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
  • Reduce operational toil by identifying repetitive work and implementing automation-first solutions
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

At bsport, the Senior Site Reliability Engineer is a role for someone who doesn’...
Location
Location
Spain; France , Barcelona; Paris
Salary
Salary:
Not provided
pro.bsport.io Logo
Bsport
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE, Platform Engineering, Infrastructure or Backend Engineering
  • Strong experience with cloud infrastructure (AWS preferred), Kubernetes and CI/CD
  • Experience building or maintaining high-availability, scalable systems
  • Solid Python experience (bonus points for Django)
  • Experience working with SQL databases, ideally PostgreSQL
  • A proactive mindset: you enjoy taking ownership and solving complex technical challenges
  • Strong communication skills and fluency in English
Job Responsibility
Job Responsibility
  • Scale infrastructure and design resilient systems supporting international growth
  • Improve deployment speed, CI/CD pipelines and developer experience
  • Shape platform architecture through modularisation and scalable deployment strategies
  • Enhance observability, reliability and incident response capabilities
  • Influence engineering practices and collaborate across teams to improve how we build and ship
What we offer
What we offer
  • Competitive salary packages based on your experience and role
  • Hybrid model with 3 days in the office per week
  • Work from anywhere: up to 15 days of remote work from abroad each year
  • Exclusive fitness perks: discounted access to Wellhub for Spain and HelloCSE membership for France
  • Private health insurance and flexible remuneration for Spain
  • Diverse fun loving team: multicultural colleagues, after-work events, team-building & more
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer (SRE)

The Senior SRE is responsible for deployment, updates, and operational support f...
Location
Location
India , Chennai
Salary
Salary:
Not provided
dalet.com Logo
Dalet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud platforms: AWS, Azure
  • Containerisation & Orchestration: Kubernetes
  • Infrastructure as Code: Terraform
  • Configuration Management: Ansible
  • Packaging & Deployment: Helm
  • Databases: MariaDB, MongoDB
  • Monitoring, observability, networking, and cloud security.
Job Responsibility
Job Responsibility
  • Act as a senior technical authority for APAC Site Reliability Engineering activities
  • Drive best practices in reliability, operations, and engineering standards
  • Promote technical excellence, collaboration, and accountability across stakeholders
  • Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
  • Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
  • Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
  • Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
  • Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
  • Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
  • Collaborate closely with engineering to improve platform components, automation, and operational processes
What we offer
What we offer
  • Great career opportunities around the world
  • Truly collaborative environment with supportive leadership
  • Cutting edge technologies (AI, Cloud, Cybersecurity...)
  • Talented and passionate team members
  • Fun working environment
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer Manager

RemoteStar is looking to hire a Senior Site Reliability Engineering Manager on b...
Location
Location
United Kingdom of Great Britain and Northern Ireland , London
Salary
Salary:
Not provided
Remotestar
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in a senior or lead SRE role, with a strong track record of building and maintaining highly reliable infrastructure and services.
  • Expertise in incident management, including incident response, resolution, and post-mortem analysis.
  • Proficiency in monitoring, alerting, and observability tools such as Prometheus, Grafana, ELK stack or Datadog.
  • Experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure as code tools like Terraform or CloudFormation.
  • Strong scripting and automation skills, with proficiency in languages such as Python, Bash, or Go.
  • Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams in a remote environment.
  • Demonstrated leadership capabilities, with a passion for mentoring and developing team members.
Job Responsibility
Job Responsibility
  • Take full ownership of the production estate from both a technical and process perspective.
  • Provide a consistent smooth operation of live systems and drive all on-call support issues.
  • Design and operate a new incident tracking process to ensure root causes are found and remediated in a timely fashion by the development team.
  • Create and maintain high end monitoring and automation tooling.
  • Drive automation initiatives to streamline operational workflows and improve efficiency.
  • Develop and maintain tools, scripts, and dashboards to monitor system health, performance, and reliability.
  • Build a first class SRE team.
  • Through a combination of leading by example, coaching and mentoring, mould the team would want to have around you.
  • Provide leadership and guidance to the SRE team, fostering a culture of collaboration, innovation, and continuous improvement.
What we offer
What we offer
  • Dynamic working environment in an extremely fast-growing company
  • Work in an international environment
  • Work in a pleasant environment with very little hierarchy
  • Intellectually challenging, play a massive role in client’s success and scalability
  • Flexible working hours
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Embark on a transformative journey as a Senior Site Reliability Engineer - AVP. ...
Location
Location
United States , Whippany
Salary
Salary:
120000.00 - 175000.00 USD / Year
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Considerable programming expertise in languages such as Python, Java, and others
  • Practical experience with Infrastructure as Code (IaC) tools, including Ansible, Chef, and Terraform
  • Validated experience with observability and monitoring platforms such as Observe, Elastic, InfluxDB, and Grafana
  • Solid understanding of containerization technologies and Unix/Linux environments
  • Demonstrates a Site Reliability Engineering (SRE) mindset, with good analytical skills, ownership, and a forward-thinking approach to problem-solving
Job Responsibility
Job Responsibility
  • Build and maintain infrastructure platforms and products that support applications and data systems
  • Ensure the reliability, availability, and scalability of the systems, platforms, and technology
  • Development, delivery, and maintenance of high-quality infrastructure solutions
  • Monitoring of IT infrastructure and system performance to measure, identify, address, and resolve any potential issues, vulnerabilities, or outages
  • Development and implementation of automated tasks and processes to improve efficiency and reduce manual intervention
  • Implementation of a secure configuration and measures to protect infrastructure against cyber-attacks, vulnerabilities, and other security threats
  • Cross-functional collaboration with product managers, architects, and other engineers to define IT Infrastructure requirements
  • Stay informed of industry technology trends and innovations
What we offer
What we offer
  • medical, dental and vision coverage
  • 401(k)
  • life insurance
  • other paid leave for qualifying circumstances
  • Fulltime
Read More
Arrow Right