This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Senior Site Reliability Engineer establishes and maintains the infrastructure and operational systems that Thunderbird users and teams depend on every day. You'll design and develop CI/CD systems for MZLA websites, services, and release workflows, diagnose and debug production incidents, and implement improvements to enhance system reliability. We believe that good infrastructure work is invisible when it's going well and invaluable when it isn't. This role is for someone who treats production as something to be understood, not just kept running. You write things down, flag problems before they become fires, and leave documentation better than you found it. You bring production instincts, infrastructure-as-code fluency, and security awareness that's baked in, not bolted on. You'll work closely with Software Development Engineers, team members, and community contributors, reporting to the Sr Manager, Platform Infrastructure. This is a great opportunity for someone who thrives with ambiguity, makes good decisions without a complete picture, and cares about Thunderbird's mission: open-source software used by millions who choose privacy and ownership over convenience. This role requires consistent overlap with Pacific Time zone working hours to enable effective collaboration. You should have availability for regular overlap hours for context sharing with Pacific Time colleagues.
Job Responsibility
Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
Diagnose and debug production incidents
drive root-cause analysis and post-incident improvements to prevent recurring problems
Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
Contribute to runbooks, architecture documentation, and team processes
Requirements
7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
Excellent async written communication skills
comfortable working with a geographically distributed team
Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes
Nice to have
Experience with GitOps workflows (ArgoCD or Flux)
Familiarity with Keycloak or similar identity platforms (OIDC, SAML, federation)