CrawlJobs Logo

Senior Site Reliability Engineer - Observability

doctolib.fr Logo

Doctolib

Location Icon

Location:
Germany , Berlin

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

Not provided

Job Description:

We are looking for a Senior Site Reliability Engineer to join the Core Reliability & Observability team in Platform Engineering. Your mission will be to shape Doctolib's observability strategy and ensure our platform remains reliable, debuggable, and scalable at a European scale. You will work in a feature team developing logging, metrics, tracing, and alerting capabilities, contributing directly to supporting 400,000 health professionals and 80 million patients in their daily healthcare journey.

Job Responsibility:

  • Lead the observability strategy across the platform, with an emphasis on building scalable, developer-friendly logging and tracing capabilities
  • Identify and lead large-scale cross-cutting reliability initiatives, including improvements to our incident detection, response, and postmortem analysis capabilities
  • Take part in the on-call rotation, and actively contribute to improving our on-call experience by refining alerting, reducing noise, and ensuring actionable telemetry

Requirements:

  • Have a solid hands-on experience (3y+) on a large-scale production platform
  • Have proven experience with cloud platforms such as AWS, Azure or Google Cloud
  • Have solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
  • Have a strong understanding of Helm for managing Kubernetes manifests and ArgoCD for GitOps workflows
  • Have deep expertise in observability tooling and architecture, such as: Logging: Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Logstash, Vector, Tracing: OpenTelemetry or proprietary APMs, Metrics: Prometheus, Thanos, Datadog, or equivalent
  • Have proficiency in at least one programming language (Ruby, Python, Go, Java, etc.) and a deep understanding of infrastructure as code principles
  • Have experience with monitoring and observability tools
  • Like troubleshooting performance issues in complex environments
  • Are fluent in English

Nice to have:

  • Have experience contributing to open-source observability projects
  • Have worked in a high-growth tech environment
  • Are passionate about developer experience and platform engineering
What we offer:
  • A Deutschlandticket (Germany-wide public transport pass) fully paid for by Doctolib
  • 28 vacation days + 1 additional day for each full calendar year of employment (up to a maximum of 30 days)
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Company health insurance with great supplementary benefits through our partner Allianz
  • Company pension scheme (bAV) through Allianz with an employer subsidy of 40% (15% within the probationary period)
  • The Doctolib Parent Care program, which includes one month additional parental leave and much more
  • Enrollment in Doctolib's long-term employee value sharing plan called DoctoGrowth
  • Free mental health and coaching services through our partner Moka.care
  • Subsidized sports membership through our partner Urban Sports Club
  • A flexible workplace policy offering both hybrid and office-based mode
  • Alongside healthy snacks and our regular breakfast buffet, we provide a subsidized meal benefit
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Relocation support in case of international mobility
  • Access to the best AI tools for coding, development and dedicated training

Additional Information:

Job Posted:
May 16, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer - Observability

Senior Site Reliability Engineer

Architect, develop, and troubleshoot large-scale infrastructure, maintain and im...
Location
Location
United States , San Francisco
Salary
Salary:
180960.00 - 230900.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Software Engineering, Information Technology or a closely related field
  • four years of experience as a Site Reliability Engineer architecting, developing, and troubleshooting large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash
  • networking technologies such as TCP/IP or security
  • four years of experience in automation development and infrastructure as code implementation using tools such as Terraform, AWS CloudFormation, Ansible, or Salt
  • knowledge of Linux and Windows systems
  • cloud technologies within AWS, GCP, Azure
  • continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
  • must pass technical interview
Job Responsibility
Job Responsibility
  • Architect, develop, and troubleshoot large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash and networking technologies such as TCP/IP or security
  • provide real-time feedback on production systems
  • work with product family and platform developers to maintain and improve services and performance with a strong customer focus
  • utilize a variety of data collection, enrichment, analytics, and visualizations to support our complex systems
  • responsible for automation development and infrastructure-as-code implementation using tools such as Terraform, AWS CloudFormation, Ansible, and/or Salt
  • build solutions to enhance availability, performance, and stability for hundreds of Atlassian enterprise customers in the cloud as well as automate repetitive work
  • help secure the cloud architecture with penetration testing, vulnerability resolution, and compliance audit responses
  • responsible for continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
What we offer
What we offer
  • Health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

As a Senior Site Reliability Engineer on the Platform team, you will identify is...
Location
Location
United States , Denver; San Francisco
Salary
Salary:
138000.00 - 191000.00 USD / Year
https://checkr.com Logo
Checkr
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science (or related field)
  • 6+ years of experience in building tools with Python (preferred), GoLang, or Ruby
  • 6+ years of experience in maintaining and observing production customer-facing environments in AWS or Azure
  • 6+ years of experience as a member of an incident response team
  • Deep understanding of the fundamental infrastructure and platform concepts behind a micro-service architecture, REST APIs, and asynchronous queueing models
  • Experience with observability platforms and frameworks like Datadog, Splunk, Grafana, Prometheus, or OpenTelemetry
  • Strong collaboration, documentation, communication, and project management skills
  • Experience with container orchestration using Kubernetes/Docker/Terraform
  • Experience driving platform adoption across engineering teams, guided by a self-service and product-first approach
  • A passion for customer-centricity and building relationships with other teams
Job Responsibility
Job Responsibility
  • Collaborate, drive, and execute architectural discussions with cross-functional teams
  • Lead cross-team projects and SREs' technical roadmap to enable engineering and help Checkr customers
  • Design, build, ship, and maintain the core observability libraries, tools, and patterns used by all of Checkr’s engineering teams
  • Proactively engage across teams to foster service reliability, efficiency, and scalability
  • Troubleshoot complex production issues across the stack, with respect to performance, availability, and data quality
  • Present detailed technical information and benefits of the Checkr platform to a wide array of customers, including operations, developers, technical architects, and executives
What we offer
What we offer
  • A fast-paced and collaborative environment
  • Learning and development allowance
  • Competitive cash and equity compensation and opportunities for advancement
  • 100% medical, dental, and vision coverage
  • Up to $25K reimbursement for fertility, adoption, and parental planning services
  • Flexible PTO policy
  • Monthly wellness stipend, home office stipend
  • In-office perks such as lunch four times a week, commuter stipend, and an abundance of snacks and beverages
  • Fulltime
Read More
Arrow Right

Senior Vice President, Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent work experience
  • 8+ years of relevant work experience
  • Highly motivated self-starter with excellent interpersonal and communication skills. Able to communicate efficiently at multiple levels of seniority
  • Certification or formal training in site reliability engineering concepts and practices
  • Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
  • 5+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
  • Experience working on observability, logging and metrics toolsets
  • Experience of k8s and container technologies such as Docker, Openshift and EKS.
  • Experience with public cloud technologies such as AWS, GCP or Azure
  • Experience with Secrets products such as HashiCorp Vault or CyberArk
Job Responsibility
Job Responsibility
  • Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
  • Architecting and building tools and platforms that provide capabilities for SRE
  • Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organization
  • Actively owning production level incidents till resolution.
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
  • Strong hands-on experience with: Terraform & Infrastructure as Code
  • AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
  • Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
  • Kubernetes troubleshooting and operations
  • Prometheus/Grafana/Datadog observability stacks
  • Proven ability to operate in high-scale, high-uptime, multi-environment production systems
  • Experience building automation via Python/Bash and reducing operational toil
  • Strong understanding of incident management, root cause analysis, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
  • Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
  • Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
  • Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
  • Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
  • Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
  • Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
  • Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
  • Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
  • Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

HiveWatch is seeking a Staff Site Reliability Engineer to join our Platform Team...
Location
Location
United States , El Segundo
Salary
Salary:
183000.00 - 235000.00 USD / Year
hivewatch.com Logo
HiveWatch
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience with strong coding skills in production environments
  • 5+ years of SRE, DevOps, or production operations experience
  • Expertise with cloud platforms (AWS preferred) and containerized applications (Docker, Kubernetes)
  • Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
  • Proficiency in at least one object oriented programming language in our tech stack (Java, Kotlin, Python)
  • Hands-on experience with relational databases and SQL performance optimization
  • Experience with monitoring and observability tools (Prometheus, Grafana, DataDog, or equivalent)
  • Strong debugging skills across distributed systems and microservices architectures
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Own the reliability of mission-critical systems including production monitoring, alerting, and capacity planning
  • Debug and resolve complex production issues across the full stack, from infrastructure to application code
  • Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
  • Perform root cause analysis requiring deep code-level investigation and implement preventive measures
  • Build automation and tooling to reduce operational toil and improve system reliability
  • Maintain CI/CD pipelines, observability infrastructure, and database performance optimization
  • Increase the resiliency, scalability, and maintainability of production environments
  • Establish on-call procedures and disaster recovery processes
  • Provide technical leadership and mentorship to foster engineering excellence and reliability culture
What we offer
What we offer
  • Comprehensive health coverage: medical, dental, vision, and life insurance
  • Cutting-edge work in an emerging field with huge growth potential
  • Competitive compensation packages designed to reward top talent
  • A modern, newly renovated HQ right on Main Street in El Segundo, CA
  • 401(k) with a 4% company match to help you invest in your future (match launches in 2026)
  • Flexible paid time off so you can recharge when you need it
  • Additional benefits include ClassPass credits and a discount on pet insurance
  • A family-friendly, compassionate culture that values balance and belonging
  • Eligible to participate in HiveWatch Equity Incentive Plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

What will you be doing at Miniclip? Participate in an on-call rotation with the ...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
miniclip.com Logo
Miniclip
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience with AWS in both development and operations contexts
  • Strong Linux system administration skills, including performance tuning and debugging
  • Software development background and strong coding skills in one or more of the following: Go, Python, Ruby
  • Experience with Infrastructure as Code, particularly Terraform
  • Familiarity with CI/CD pipelines and artifact management tools
  • A mindset for resilient systems design, thinking about edge cases, failure modes, and graceful degradation
  • Excellent communication skills in English, both written and spoken
  • Comfortable in a fast-paced environment and adaptable to shifting priorities
Job Responsibility
Job Responsibility
  • Participate in an on-call rotation with the Cloud Engineering team to respond to production incidents and outages
  • Operate and evolve infrastructure using Infrastructure as Code (Terraform), configuration management tools, and containerized platforms on AWS
  • Build and maintain observability tooling to detect symptoms before they lead to outages
  • Automate repetitive tasks and processes to reduce operational toil
  • Collaborate with Engineering and Product teams to design resilient systems that meet performance and reliability goals
  • Troubleshoot production issues across application, network, and infrastructure layers
  • Document systems, processes, and runbooks to improve team transparency and onboarding
Read More
Arrow Right

Senior Site Reliability Engineer

As a Site Reliability Engineer, you will focus on ensuring that the Prolific pla...
Location
Location
United Kingdom
Salary
Salary:
Not provided
prolific.com Logo
Prolific
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years with Google Cloud Platform, GKE, and the Kubernetes ecosystem with experience with Terraform and Terragrunt
  • Strong programming skills in Python
  • Strong experience in observability principles and tooling
  • Experience in GitOps flows and platforms for Kubernetes, such as ArgoCD
  • Deep understanding of system architecture and scalability principles
  • Strong collaboration and communication skills to work with cross-functional teams
Job Responsibility
Job Responsibility
  • Develop and maintain highly available infrastructure using modern infra-as-code techniques, with a focus on terragrunt and terraform
  • Manage and optimise Kubernetes clusters and their workloads with a focus on reliability and performance
  • Participate in incident response and remediation, working with relevant product teams and stakeholders to resolve production issues efficiently, including creating and maintaining runbooks
  • Review and optimise other areas of our tooling stack, such as CICD or release strategies
  • Foster a culture of continuous improvement, such as enhancing documentation and upskilling teams in cloud architecture and kubernetes
  • Improve observability and alerting systems across our application and infrastructure, ensuring proactive detection of system degradation
  • Collaborate with Engineering teams to foster an SRE culture, including contributing defining SLO’s, SLA’s and error budgets
  • Design and implement automation strategies to ensure managed services remain up-to-date, secure, and performant
  • Lead and support initiatives that automate processes to improve system efficiency, resilience and reduce toil
  • Organising, supporting and responding to on-call incidents
What we offer
What we offer
  • competitive salary
  • benefits
  • remote working
  • impactful, mission-driven culture
  • Fulltime
Read More
Arrow Right

Senior Observability Engineer

Coralogix is a modern, full-stack observability platform transforming how busine...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in Site Reliability, DevOps, or Platform Engineering with a focus on observability
  • Proven expertise with at least one major observability platform (e.g., Prometheus, Victoria Metrics, OpenSearch)
  • Hands-on experience with Kubernetes, including deep knowledge of controllers, operators, and Helm
  • Experience writing Kubernetes controllers (controller-runtime, KubeBuilder)
  • Strong programming skills in Go or Python (Rust is a plus)
  • Experience designing, scaling, and operating observability systems at enterprise scale
  • Familiarity with at least one major cloud provider (AWS, Azure, or GCP)
  • Strong understanding of distributed systems, telemetry pipelines, and instrumentation standards (e.g., OpenTelemetry)
  • Excellent communication skills with the ability to explain complex topics to diverse stakeholders
Job Responsibility
Job Responsibility
  • Design, implement, and maintain observability features such as Alerting, SLOs, Reporting, and Synthetic Tests
  • Manage and scale OpenTelemetry Collectors and other observability agents across Kubernetes environments
  • Write and maintain Kubernetes Controllers using frameworks like controller-runtime and KubeBuilder
  • Operate and optimize the internal Coralogix account, ensuring proper usage, cost efficiency, and best practices adoption
  • Define and enforce observability guidelines and standards across the organization
  • Partner with engineering teams to embed observability by default into products and services
  • Control observability-related costs while maximizing performance, visibility, and value
  • Contribute to upstream projects such as OpenTelemetry, helping shape industry standards
  • Explore and implement cutting-edge observability technologies, including eBPF-based approaches
  • Fulltime
Read More
Arrow Right