This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We're looking for software engineers to join our Infrastructure Reliability Engineering team. In this role, you will build the distributed systems, services, and frameworks that make reliability a built-in property of the Whatnot platform. As scale, traffic, and complexity continue to grow, you will ensure our systems stay ahead of that curve. Whatnot's backend is built primarily in Python and Elixir, with Go used for performance-sensitive infrastructure components. You'll work across these languages depending on the problem, and we value engineers who are comfortable learning new stacks over those who are deep in only one. This is systems engineering work. You'll design and build components that sit in the critical path of Whatnot's traffic, testing infrastructure that validates system behavior before and during peak events, and developer-facing frameworks that raise the reliability floor across the entire platform. You will partner closely with product, platform, and infrastructure teams to embed reliability into system design, development workflows, and runtime behavior.
Job Responsibility:
Designing and building distributed systems that support reliability, resiliency, and safe operation at scale
Designing and operating traffic control mechanisms: circuit breakers, rate limiting, admission control, backpressure, and graceful degradation
Building and evolving load testing frameworks that validate system behavior under sustained, burst, and peak event traffic patterns
Building chaos and resilience testing infrastructure to proactively surface failure modes and validate recovery behavior
Building systems that enable teams to define and implement SLOs, SLIs, and error budgets to guide reliability tradeoffs
Developing tooling that improves incident detection, response, and automated mitigation
Reviewing service architectures with a focus on failure modes, scalability limits, and operational safety
Participating in incident response and driving systemic fixes that reduce repeated failure patterns
Requirements:
5+ years of experience designing and building large-scale distributed systems
Experience in Python, Elixir, or Go is preferred
Strong fundamentals in designing, building, and operating shared production services and frameworks
Experience with one or more of: Traffic control mechanisms such as circuit breakers and rate limiting, Building or operating load testing and chaos testing frameworks, Hands-on observability, monitoring, and debugging of production systems, SLOs, error budgets, and incident response processes
Comfortable in cloud-native environments such as AWS or GCP with Kubernetes and infrastructure as code
Strong collaborator with clear written and verbal communication skills
Nice to have:
Bonus: experience with high-traffic, real-time, or event-driven systems
Bonus: experience building developer-facing tools, frameworks, or platform libraries consumed by other engineering teams