Senior Site Reliability Engineer

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare....

Location

Germany , Berlin

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of site reliability engineering experience
Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
Proactive, curious, collaborative and eager to learn
Proven experience with cloud services such as AWS, Azure or Google Cloud
Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles

Job Responsibility

Collaborating with Feature teams to ensure services align with developer needs
Driving improvements by evaluating new technologies and processes
Defining best practices ("golden paths") for software development and deployment
Developing and maintaining tools and services that facilitate best practices
Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
Collaborating on roadmap delivery

What we offer

Company health insurance through partner Allianz
Minimum 28 days of paid leave
Parent Care Program: one additional month of leave on top of legal parental leave
Free mental health and coaching services through partner Moka.care
For caregivers and workers with disabilities, a package including adaptation of remote policy, extra days off for medical reasons, and psychological support
Flexible workplace policy offering both hybrid and office-based mode
Work from EU countries and the UK for up to 10 days per year
Reimbursement of public transportation

Fulltime

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...

Location

United States , San Francisco

Salary:

230000.00 - 345000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation

Job Responsibility

Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

New

We are seeking a Senior Site Reliability Engineer with deep expertise in Kuberne...

Location

Denmark , Copenhagen

Salary:

Not provided

Keepit

Expiration Date

Until further notice

Requirements

5+ years in a Site Reliability, Platform, or DevOps Engineering role
Hands-on Kubernetes experience, including storage (Rook-Ceph or equivalent)
Solid Linux fundamentals
Proactive mindset
Clear communicator

Job Responsibility

Participate in the daily operation of our existing stack
Evolve and take part in designing our next generation infrastructure setup
Define and enforce reliability standards, runbooks, and operational best practices across the platform
Collaborate with Development and Operations teams to identify and resolve bottlenecks before they become incidents
Champion automation
if something is done twice, it should be scripted the third time

What we offer

Competitive salary
Pension scheme
A modern, energetic global work environment
Flexible work-life balance supported by a hybrid working model
Regular team-building activities
Opportunities for professional development and career advancement
Compensation is based on experience and skill set

Fulltime

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...

Location

United States

Salary:

116633.00 - 181243.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements

Job Responsibility

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
Partner with engineering team members to embed reliability best practices early in the development lifecycle
Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
Reduce operational toil by identifying repetitive work and implementing automation-first solutions

Fulltime

Senior Site Reliability Engineer

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...

Location

United States

Salary:

113082.00 - 175725.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

6+ years experience in an SRE/Operations/DevOps role as part of a team
Experience with shell and any scripting language used in an SRE context (Python, Go, Bash, Ruby
we primarily use Python) and configuration management tools (Puppet, Ansible
we use Puppet)
Experience with distributed caching systems: including their underlying algorithms and how to optimize their performance
Experience with package management on Linux systems (we use Debian)
Strong Linux system-level troubleshooting skills
History of automating tasks and processes, identifying process gaps, and finding automation opportunities
Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures

Job Responsibility

Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
Working closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure.
Collaborating with a global, cross-functional team in an asynchronous communication environment
Mentoring peers in your areas of technical and operational strength

Fulltime

Senior Site Reliability Engineer (SRE)

The Senior SRE is responsible for deployment, updates, and operational support f...

Location

India , Chennai

Salary:

Not provided

Dalet

Expiration Date

Until further notice

Requirements

Cloud platforms: AWS, Azure
Containerisation & Orchestration: Kubernetes
Infrastructure as Code: Terraform
Configuration Management: Ansible
Packaging & Deployment: Helm
Databases: MariaDB, MongoDB
Monitoring, observability, networking, and cloud security.

Job Responsibility

Act as a senior technical authority for APAC Site Reliability Engineering activities
Drive best practices in reliability, operations, and engineering standards
Promote technical excellence, collaboration, and accountability across stakeholders
Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
Collaborate closely with engineering to improve platform components, automation, and operational processes

What we offer

Great career opportunities around the world
Truly collaborative environment with supportive leadership
Cutting edge technologies (AI, Cloud, Cybersecurity...)
Talented and passionate team members
Fun working environment

Fulltime

Senior Site Reliability Engineer Manager

RemoteStar is looking to hire a Senior Site Reliability Engineering Manager on b...

Location

United Kingdom of Great Britain and Northern Ireland , London

Salary:

Not provided

Remotestar

Expiration Date

Until further notice

Requirements

Proven experience in a senior or lead SRE role, with a strong track record of building and maintaining highly reliable infrastructure and services.
Expertise in incident management, including incident response, resolution, and post-mortem analysis.
Proficiency in monitoring, alerting, and observability tools such as Prometheus, Grafana, ELK stack or Datadog.
Experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure as code tools like Terraform or CloudFormation.
Strong scripting and automation skills, with proficiency in languages such as Python, Bash, or Go.
Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams in a remote environment.
Demonstrated leadership capabilities, with a passion for mentoring and developing team members.

Job Responsibility

Take full ownership of the production estate from both a technical and process perspective.
Provide a consistent smooth operation of live systems and drive all on-call support issues.
Design and operate a new incident tracking process to ensure root causes are found and remediated in a timely fashion by the development team.
Create and maintain high end monitoring and automation tooling.
Drive automation initiatives to streamline operational workflows and improve efficiency.
Develop and maintain tools, scripts, and dashboards to monitor system health, performance, and reliability.
Build a first class SRE team.
Through a combination of leading by example, coaching and mentoring, mould the team would want to have around you.
Provide leadership and guidance to the SRE team, fostering a culture of collaboration, innovation, and continuous improvement.

What we offer

Dynamic working environment in an extremely fast-growing company
Work in an international environment
Work in a pleasant environment with very little hierarchy
Intellectually challenging, play a massive role in client’s success and scalability
Flexible working hours

Fulltime

Senior Site Reliability Engineer

Embark on a transformative journey as a Senior Site Reliability Engineer - AVP. ...

Location

United States , Whippany

Salary:

120000.00 - 175000.00 USD / Year

Barclays

Expiration Date

Until further notice

Requirements

Considerable programming expertise in languages such as Python, Java, and others
Practical experience with Infrastructure as Code (IaC) tools, including Ansible, Chef, and Terraform
Validated experience with observability and monitoring platforms such as Observe, Elastic, InfluxDB, and Grafana
Solid understanding of containerization technologies and Unix/Linux environments
Demonstrates a Site Reliability Engineering (SRE) mindset, with good analytical skills, ownership, and a forward-thinking approach to problem-solving

Job Responsibility

Build and maintain infrastructure platforms and products that support applications and data systems
Ensure the reliability, availability, and scalability of the systems, platforms, and technology
Development, delivery, and maintenance of high-quality infrastructure solutions
Monitoring of IT infrastructure and system performance to measure, identify, address, and resolve any potential issues, vulnerabilities, or outages
Development and implementation of automated tasks and processes to improve efficiency and reduce manual intervention
Implementation of a secure configuration and measures to protect infrastructure against cyber-attacks, vulnerabilities, and other security threats
Cross-functional collaboration with product managers, architects, and other engineers to define IT Infrastructure requirements
Stay informed of industry technology trends and innovations

What we offer

medical, dental and vision coverage
401(k)
life insurance
other paid leave for qualifying circumstances

Fulltime

Select Country

Senior Site Reliability Engineer - Automation Platform

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Site Reliability Engineer - Automation Platform

Senior Site Reliability Engineer - Automation Platform

Senior Site Reliability Engineer - Fleet Reliability

Senior Site Reliability Engineer

Senior Site Reliability Engineer, Wikimedia Enterprise

Senior Site Reliability Engineer

Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer Manager

Senior Site Reliability Engineer

Our AI answers in your language