This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Gamma's infrastructure needs to be rock-solid for millions of daily users while enabling our engineering teams to ship fast. You'll own the operational health of our full backend platform, building automation and tooling that improves reliability and partnering with engineering to design systems that are observable, resilient, and easy to operate. Your work directly impacts every Gamma user's experience. This is a high-impact role where you'll balance reliability with velocity, knowing when to move fast and when to prioritize stability. You'll lead incident response, drive systemic improvements, and help shape how Gamma scales to serve its next 100 million users.
Job Responsibility:
Own reliability, availability, and performance of Gamma's production systems across primarily AWS infrastructure
Build observability infrastructure with metrics, logging, tracing, and alerting that provide deep visibility into system health
Design automation to reduce toil, improve deployment safety, and accelerate incident resolution
Lead incident response, conduct blameless post-mortems, and drive systemic improvements to prevent recurring issues
Partner with engineering teams on architecture reviews, SLOs/SLIs, and reliability best practices
Manage and optimize our infrastructure including compute, networking, databases, and managed services
Requirements:
5+ years in Site Reliability Engineering, DevOps, or systems engineering roles with deep AWS expertise
Strong programming skills (Python, Go, or TypeScript/Node.js) for building tools and automation
Experience with infrastructure-as-code (Terraform, CloudFormation) and comprehensive observability solutions
Track record improving system reliability through automation, monitoring, and architectural improvements
Solid understanding of networking, distributed systems, containerization (Docker, Kubernetes), and database performance
Strong incident management and debugging skills for complex production issues
Nice to have:
Experience scaling SaaS applications to millions of users
Background with real-time collaborative systems, Kafka, chaos engineering, or service mesh technologies
AWS certifications or experience with security/compliance requirements (SOC 2, ISO 27001)