This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
This is not a traditional SRE or DevOps role. Whatnot's Reliability Engineering team is a software engineering team that builds the distributed systems, frameworks, and developer-facing tools that make reliability a built-in property of the platform. Think of it as platform engineering for reliability: the team designs and ships software that other engineers use every day to build, test, and operate their services with confidence. As a senior leader in our Infrastructure organization, you will own the technical direction and execution of the Reliability Engineering team. The team's mandate spans SLOs, observability, load testing, resilience testing, incident response, and traffic control mechanisms. You will partner closely with engineering teams across the company to define reliability standards, accelerate detection and mitigation of issues, and ensure Whatnot's systems remain reliable, scalable, and performant as we grow. This role carries significant leadership scope. Depending on the candidate, this person may also take on responsibilities as the Infrastructure engineering lead for Poland and a broader site leadership role for Whatnot Poland, either immediately or as the Poland presence scales.
Job Responsibility:
Lead and mentor a team of highly skilled software engineers, supporting their technical growth, execution, and long-term career development
Set technical direction and quality standards for the team while empowering senior ICs to own design and architecture decisions
Develop and execute the strategic roadmap for reliability engineering at Whatnot
Build and operationalize best practices that empower product and platform teams to design and run reliable systems
Own the strategic roadmap for reliability tooling, including incident response systems, SLO measurement platforms, and developer-facing reliability libraries
Lead the team in designing and building traffic control systems as reusable platform components
Lead the design and execution of load testing at scale
Drive continuous improvement in incident detection and mitigation
Collaborate with cross-functional teams to influence product and architectural decisions that improve overall reliability and customer impact
Partner with Infrastructure and Engineering leadership to shape reliability strategy and investment priorities across the organization
Build a culture of learning and continuous improvement through blameless incident analysis, proactive reliability investment, and systematic reduction of repeated failure patterns
Scale the team through hiring, mentorship, leadership development, and thoughtful organizational design
Requirements:
10+ years of experience in infrastructure or platform engineering
5+ years managing engineering teams
Experience leading managers or multiple teams a plus
Proven track record building and operating large-scale distributed systems with strong reliability, observability, and incident response practices
Deep technical grounding in one or more of: SLO design, monitoring/alerting, incident tooling, traffic control mechanisms, load and chaos testing, or platform engineering
Experience leading teams that ship developer-facing platforms, frameworks, or internal tools
Strong software engineering fundamentals
Demonstrated ability to guide teams through complex system challenges, large-scale migrations, and longer-term reliability initiatives
Exceptional communication and leadership skills
A passion for enabling teams to build fast while building safely through well-designed tooling and proactive detection mechanisms
Experience leading multiple teams, managing managers, or serving as a site lead is a plus
Nice to have:
Experience leading multiple teams, managing managers, or serving as a site lead is a plus