This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
This role owns the engineering response to high priority customer-impacting issues, operational defects, and technical debt, ensuring issues are resolved correctly and permanently—not just patched. This is a hands-on engineering leadership role that sits at the intersection of Product Engineering, QA, SRE, and Technical Support. The Sustaining Engineering Manager is accountable for reducing customer pain, improving release confidence, and protecting roadmap velocity by preventing recurring issues.
Job Responsibility:
Own the sustaining engineering response for P0–P1 customer-impacting issues
Lead root cause analysis (RCA) and ensure durable fixes are delivered
Partner with SRE and Support during incidents to restore service quickly and safely
Drive post-incident reviews and ensure action items are completed
Identify recurring issues and systemic weaknesses in the platform
Work with Product and Engineering to prioritize fixes that reduce customer pain and support load
Champion improvements in test coverage, observability, performance, and operational readiness
Ensure fixes meet engineering standards for quality, performance, and security
Build, manage, and mentor a distributed Sustaining Engineering team
Establish clear ownership, on-call practices, and escalation paths
Balance reactive work with proactive investments in stability and platform health
Foster a culture of accountability, ownership, and continuous improvement
Define and maintain clear operating boundaries between Technical Support, Sustaining Engineering, and Product Engineering
Partner with Product to influence roadmap decisions based on production learnings
Collaborate with QA on regression prevention and release readiness
Communicate clearly with stakeholders during incidents and escalations
Define and track KPIs such as P0/P1 incident frequency and TTR, defect recurrence rate, support ticket volume tied to product defects, post-release defect rates
Use data to drive prioritization and justify proactive investment
Requirements:
10+ years of professional software engineering experience
5+ years of engineering management experience
Strong background in production systems, incident management, and debugging complex distributed systems
Experience running on-call rotations and incident response processes
Proven ability to lead teams through high-pressure situations
Excellent written and verbal communication skills
Nice to have:
Familiarity with OLAP concepts and technologies such as SSIS and SSAS
Familiarity with business intelligence tools (e.g. Tableau, PowerBI, Excel)
Experience with technologies such as Data Analysis Expressions (DAX) and Multidimensional Data Expressions (MDX)
What we offer:
Competitive compensation, including equity
Flexible, remote-friendly work environment with a strong culture of ownership and trust
Unlimited PTO and competitive benefits
The opportunity to directly shape AtScale’s growth by building the team that powers our next phase