Job Description:
The Site Reliability Product Owner leads end-to-end release engineering and operationalizing for a growing, multi-application software portfolio across multiple missions and effectivities—owning release coordination, bug/fix lifecycle, customer and multi-level leadership approvals, incident command, and post-incident reporting. This hands-on development-focused role requires strong AWS infrastructure and Python automation skills, practical knowledge of signal‑processing algorithm behavior to interpret anomalous system results, and ownership of on-call scheduling with an expectation of ~80% availability while assigned. The Product Owner defines and implements environment-wide monitoring and observations, builds comprehensive monitoring strategies (real‑time system health, anomaly detection, and alerting to pre-empt resource exhaustion and performance degradation), and develops environment monitoring dashboards and application monitoring using APM tools with proactive thresholds. Responsible for CI/CD and release quality, the role validates release candidates through operational and enterprise testing, compiles and coordinates release packages, facilitates development activities into operational environments, and enforces release control (scheduling, versioning, change control) while tracking and verifying fixes. The position drives continuous improvement—standardizing runbooks, automating deployment and recovery workflows, instrumenting DORA-style KPIs (deployment frequency, lead time, change success rate, MTTR), and partnering with engineering, suppliers, and the customer to reduce downtime, accelerate delivery cadence, and enable future capability growth and proposal support.