This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Platform Engineer at Speak, you’ll be the driving force behind the reliability and resilience of the systems that power our global language learning experience. You’ll lead efforts to scale our infrastructure, harden our platform, and ensure that our services are fast, available, and reliable for millions of users around the world. You’ll work across our stack—from Kubernetes on GCP to our Node.js APIs, Postgres, and Redis —building robust infrastructure and operational tooling. You’ll own incident response, observability, and SLOs while embedding a culture of reliability throughout the engineering org. Speak is growing rapidly, and we’re pushing our systems harder every day. This is a unique opportunity to shape the future of our platform as we scale to the next 10x of users.
Job Responsibility:
Own the reliability of Speak’s infrastructure across GCP, Kubernetes, and our Node.js/Postgres stack
Lead response for P0/P1 incidents, drive postmortems, and ensure we’re learning from every outage
Improve observability, alerting, and on-call processes so we catch issues before users do
Define and drive adoption of SLOs/SLAs for core systems and services
Build tools and frameworks to make reliability easier for product engineers—think safer deploys and infrastructure automation
Collaborate cross-functionally with Product, Engineering, and ML teams to ensure reliability is baked into everything we build
Set short term and long term roadmaps to ensure stability for our growing user base
Be a thought leader and coach around platform and infrastructure engineering principles—blameless culture, operational maturity, and continuous improvement
Requirements:
7+ years of experience in SRE, DevOps, Platform, and/or infrastructure-focused engineering roles, ideally with experience leading or mentoring others
Strong experience with GCP, Kubernetes, Terraform, Node.js, Python, PostgreSQL, Redis, and observability tooling like Prometheus and Sentry
Proven track record of improving reliability, scaling systems, and reducing incident frequency and severity with high traffic systems
Strong incident management and root cause analysis skills
Experience building and maintaining CI/CD pipelines and deployment safety tooling
Strong systems thinking, with the ability to identify failure points and proactively harden services
Deep sense of ownership and a desire to make infrastructure a force multiplier for the rest of the org
Nice to have:
Familiarity with cost optimization strategies in cloud-native environments
Background in security, chaos engineering, or disaster recovery planning
Contributions to internal tooling, automation, or developer productivity initiatives
What we offer:
Offers Equity
Join a fantastic, tight-knit team at the right time
Do your life's work with people you’ll love working with