Data Center Incident Program Manager Job at OpenAI

Job Description

The Data Center Incident Program Manager is responsible for designing, operating, and continuously improving the end-to-end incident management lifecycle across mission-critical data center environments.This role owns the “before, during, and after” mechanics of incidents — establishing standards and playbooks in steady state, serving as (or designating) Incident Commander during active events, and driving structured post-incident review and corrective action to closure.

Job Responsibility

Define and maintain incident severity levels (SEV definitions), classification criteria, and escalation thresholds
Establish end-to-end incident response standards: protocols, lifecycle stages (declare → stabilize → mitigate → recover → close), and operating cadence
Build and maintain governance artifacts: runbooks, war room formats, reporting templates, and decision/communication standards
Create and operationalize notification trees, stakeholder comms templates (initial, periodic updates, recovery/closure), and executive escalation criteria
Define clear RACI across Facilities, Hardware Ops, Network, Security, and vendor/partner teams, including handoffs and accountability paths
Set and manage SLAs/OLAs for acknowledgment, escalation, containment, mitigation, and reporting
Implement and run incident management tooling (ticketing, paging, logging) and ensure integrations with monitoring and workflow systems
Establish dashboards and program health metrics to track incident performance and readiness
Lead readiness activities: tabletop exercises, cross-functional simulations, IC/Deputy training, and a rotating on-call IC bench with certification standards
Serve as Incident Commander as needed: declare severity, stand up the war room, assign functional leads, and drive structured execution under pressure
Maintain real-time documentation (decisions, timelines, impact scope) and ensure clear restoration objectives and scope control during active events
Run post-incident reviews (PIRs), validate timelines, drive structured RCA (e.g., 5 Whys, Fault Tree), and separate root cause vs contributing factors
Define corrective/preventative actions (CAPAs), assign accountable owners, track to verified closure, and escalate overdue actions
Publish trend reporting (incident taxonomy, counts by severity, MTTA/MTTR, repeat failure domains) and feed systemic gaps back into design and operations teams

Requirements

7+ years in mission-critical infrastructure, data center operations, or reliability engineering
Direct experience leading major incidents (P1/P0 equivalent)
Strong familiarity with facilities systems, hardware operations, or network infrastructure
Demonstrated experience running war rooms and executive updates
Experience conducting root cause analysis and corrective action tracking
Ability to remain calm and decisive under high-pressure conditions

Nice to have

Experience in hyperscale or high-density AI compute environments
Background in facilities commissioning, facility operations, hardware operations, or network reliability
Familiarity with ISO-based quality systems or structured operational documentation frameworks
Experience implementing incident tooling (PagerDuty, ServiceNow, Jira, etc.)

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible
Relocation support for eligible employees
Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
Generous equity
Performance-related bonus(es) for eligible employees

OpenAI - All Job Offers

Select Country

Data Center Incident Program Manager

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Data Center Incident Program Manager

Data Center Program Manager

Data Center Program Manager

Data Center Program Manager - IT/Network

Data Center Program Manager

Data Center Program Manager

Data Center Program Manager

Data Center Program Manager - Logistics

Data Center Program Manager

Our AI answers in your language