CrawlJobs Logo

Data Center Incident Program Manager

openai.com Logo

OpenAI

Location Icon

Location:
United States

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

125600.00 - 228000.00 USD / Year

Job Description:

The Data Center Incident Program Manager is responsible for designing, operating, and continuously improving the end-to-end incident management lifecycle across mission-critical data center environments.This role owns the “before, during, and after” mechanics of incidents — establishing standards and playbooks in steady state, serving as (or designating) Incident Commander during active events, and driving structured post-incident review and corrective action to closure.

Job Responsibility:

  • Define and maintain incident severity levels (SEV definitions), classification criteria, and escalation thresholds
  • Establish end-to-end incident response standards: protocols, lifecycle stages (declare → stabilize → mitigate → recover → close), and operating cadence
  • Build and maintain governance artifacts: runbooks, war room formats, reporting templates, and decision/communication standards
  • Create and operationalize notification trees, stakeholder comms templates (initial, periodic updates, recovery/closure), and executive escalation criteria
  • Define clear RACI across Facilities, Hardware Ops, Network, Security, and vendor/partner teams, including handoffs and accountability paths
  • Set and manage SLAs/OLAs for acknowledgment, escalation, containment, mitigation, and reporting
  • Implement and run incident management tooling (ticketing, paging, logging) and ensure integrations with monitoring and workflow systems
  • Establish dashboards and program health metrics to track incident performance and readiness
  • Lead readiness activities: tabletop exercises, cross-functional simulations, IC/Deputy training, and a rotating on-call IC bench with certification standards
  • Serve as Incident Commander as needed: declare severity, stand up the war room, assign functional leads, and drive structured execution under pressure
  • Maintain real-time documentation (decisions, timelines, impact scope) and ensure clear restoration objectives and scope control during active events
  • Run post-incident reviews (PIRs), validate timelines, drive structured RCA (e.g., 5 Whys, Fault Tree), and separate root cause vs contributing factors
  • Define corrective/preventative actions (CAPAs), assign accountable owners, track to verified closure, and escalate overdue actions
  • Publish trend reporting (incident taxonomy, counts by severity, MTTA/MTTR, repeat failure domains) and feed systemic gaps back into design and operations teams

Requirements:

  • 7+ years in mission-critical infrastructure, data center operations, or reliability engineering
  • Direct experience leading major incidents (P1/P0 equivalent)
  • Strong familiarity with facilities systems, hardware operations, or network infrastructure
  • Demonstrated experience running war rooms and executive updates
  • Experience conducting root cause analysis and corrective action tracking
  • Ability to remain calm and decisive under high-pressure conditions

Nice to have:

  • Experience in hyperscale or high-density AI compute environments
  • Background in facilities commissioning, facility operations, hardware operations, or network reliability
  • Familiarity with ISO-based quality systems or structured operational documentation frameworks
  • Experience implementing incident tooling (PagerDuty, ServiceNow, Jira, etc.)
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Generous equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Data Center Incident Program Manager

Facility Manager Italy

Facility manager’s main mission will be to manage and leverage the maximum power...
Location
Location
Italy , Milan
Salary
Salary:
Not provided
data4group.com Logo
DATA4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Electrical engineering degree preferred or proven experience and knowledge of electrotechnical issues applied to the components of the entire electrical line-up of data centers: Transformers, UPSs, Generators, common symbols for components like resistors, capacitors, transformers, switches, and circuit breakers
  • Technical knowledge to pilot the various trades involved
  • Able to communicate in written and oral English
  • Deep understanding of Electrical Schematics: mastery in reading and interpreting electrical schematics and diagrams, focusing on power distribution, UPS systems, and cooling infrastructure relevant to data centers
  • Reading and Interpreting Schematics: Understanding Schematic Flow, how to follow the flow of circuits and identify the relationship between different components within a schematic. Component Identification and Function: Detailed study of the function of each component within a schematic and its role in the overall system
  • Advanced circuit analysis techniques to understand complex electrical systems within data centers
  • Practical Application of Schematics in Troubleshooting: Utilizing electrical schematics for effective troubleshooting and fault identification in data center infrastructure
  • Electrical System Design and Maintenance: knowledge of designing electrical systems for data centers, including redundancy configurations and energy efficiency improvements
  • Preventive and predictive maintenance practices for electrical equipment, based on manufacturer guidelines and industry best practices
  • Incident management, including root cause analysis and corrective action planning, with a focus on electrical system failures
Job Responsibility
Job Responsibility
  • Manage and leverage the maximum power from our suppliers, mainly with the FM supplier
  • Lead the day-to-day relationship in terms of planning, preparation and field execution according to the internal procedures
  • Monitor the infrastructure management activities of Datacenters leading technically all our suppliers
  • Work together with other FMs and other teams
  • Cover all technical areas in the Datacenter, with very strong needs on: Electricity, Low Current (access control, CCTV…) and H&S management, among others
  • Be one of the main players in the quality of service rendered by our suppliers, especially in terms of continuity
  • Work very closely with the different suppliers and mainly with the FM company with focus on: Planning maintenance, prepare corrective works (review and validation of changes and MOPs when required, attending dry runs, BMS follow up of the installation to ensure a good performance, supporting other teams that could require on the field support for projects (Construction, Service Delivery, Energy management, quality and compliance team, among others)
  • Work closely with the Critical Environment Manager supporting, from the technical perspective, all Incident management
  • Be part of the on-call team for Data4
  • Support the CEM and the CSM team in terms of customer communication providing them info related to maintenance operations, preventive planning, incident management, RCA analysis, action plans tracking, correctives, small projects follow up, BMS monitoring, reporting
  • Fulltime
Read More
Arrow Right

Head of Operations Poland

Head of Operations role for Poland, responsible for managing data center operati...
Location
Location
Poland , Warsaw
Salary
Salary:
Not provided
data4group.com Logo
DATA4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Significant experience in areas in data center industry or in critical environment
  • Technical knowledge to pilot the various fields involved
  • Fluent in English
  • Knowledge of electricity, air conditioning, computer networks, security, security
  • Ability to manage a long-term relationship with suppliers/providers
  • Ability to coordinate and manage root cause search in the event of an incident
  • Ability to listen to customer needs and constraints
  • Strong People Management skills
  • Good Strategic decision-making skills
  • Good understanding of safety and fire safety standards and procedures
Job Responsibility
Job Responsibility
  • Managing operations in Poland
  • The main warrantor of the quality of service rendered to Customers in Poland, especially in terms of continuity of service
  • Responsible for the technical and security management of the Data Center area
  • Monitoring the operating budget of non-construction sites: responsible for the financial management of operating costs (COGS and Capex of Maintenance) in accordance with the budget and internal rules of the company in budget terms
  • Monitoring and execution of the company’s H&S and environment policy and monitoring its proper application with stakeholders (suppliers, subcontractors, visitors, customers…)
  • Coordinating the various internal and outsourced activities and services of the sites
  • Ensure that regulatory controls are in place and monitored in accordance with applicable regulations including ICPE
  • Management of on-call schedules
  • In coordination with the Senior Management, arbitration and reallocation of resources according to the company’s objectives and strategy
  • Assists the Sales team in the negotiation phase when necessary
  • Fulltime
Read More
Arrow Right
New

Data Center & Critical Environments Programs & Governance Lead

This leadership position is responsible for ensuring the delivery of Data Center...
Location
Location
United States , New York
Salary
Salary:
135000.00 - 155000.00 USD / Year
jll.com Logo
JLL
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to interact with leadership, both internally and externally
  • Ability to interact with engineers and technicians at all levels
  • Develop business continuity and disaster recovery plans
  • Conduct regular operational audits, programmatic audits, safety audits and risk assessments
  • Ability to lead incident response and root cause analysis for system failures
  • Position requires up to 30% travel
  • Bachelor’s degree in engineering (Electrical, Mechanical, Civil, or related field) preferred or related experience
  • Minimum of 7-15 years data center operations experience related to FM service delivery
  • Thorough understanding of data center infrastructure, programs and systems related to operational delivery
  • Prior experience with development of mission critical programs, procedures, and DCIM/MCIM/CMMS systems
Job Responsibility
Job Responsibility
  • Provide data center operational direction and leadership at the account level to advance operational delivery quality for data center program management for both gray space and white space infrastructures
  • Build and maintain strong working relationships with key client representatives, acting as subject matter expert in data center operations and reliability
  • Implement standard business operating mechanisms including regular team meetings, strategic review forums, and analysis/reporting processes to ensure team alignment
  • Deployment and continuous alignment of the Critical Environments Playbook and related programs across existing and new accounts
  • Lead large teams through education and influence to optimize operational outcomes for data center/CEM global programs
  • Drive operational excellence within critical operational performance and compliance environments
  • Support a culture of 100% data center uptime according to JLL contractual parameters, exceeding customer expectations
  • Lead data center site assessments, JCAP site assessments, and other on/off-site deliverables for clients and JLL
  • Provide account-specific data center solutions and transition support for complex engagements and transformation programs in high-risk environments
  • Monitor market trends and changes to ensure JLL provides industry best practice in DC FM Operations delivery
What we offer
What we offer
  • 401(k) plan with matching company contributions
  • Comprehensive Medical, Dental & Vision Care
  • Paid parental leave at 100% of salary
  • Paid Time Off and Company Holidays
  • Early access to earned wages through Daily Pay
  • Fulltime
Read More
Arrow Right

Data Center Security Operations Manager

Cloud Operations + Innovation (CO+I) is the engine that powers Microsoft’s cloud...
Location
Location
Austria , Vienna
Salary
Salary:
90000.00 EUR / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Demonstrated Capability To: Oversee deliver of physical security services to Microsoft data center security operations, including oversight of contract guard operations, alarm investigation and incident reporting and coordination with regional security disciplinary specialists on projects, expansions and other security-related efforts
  • Evaluate and drive continuous improvement of contract guard operations through the use of key performance indicators and collaborative improvement plans
  • Close coordination with security vendor management to ensure continuous improvement of security team skills through targeted training, practical exercises and the documentation and application of lessons-learned
  • Coordination with local emergency services in an effort to develop, maintain and practice/test cross-functional emergency response procedures for the datacenters
  • Assess and communicate risk and mitigation strategies to non-security audiences, supporting operational needs and maintaining security compliance
  • Travel not expected to exceed 10-15% of the time
  • Bachelor’s degree in a security or management related discipline, or equivalent experience
  • experience applicable to target role/level, including 3+ years managing people
Job Responsibility
Job Responsibility
  • Oversee the implementation of physical security policies and procedures, ensuring Microsoft’s physical security vendor has the resources and information to deliver physical security services that exceed Microsoft and customer requirements to protect people, information and critical infrastructure
  • Partner with datacenter operations, security systems and other Microsoft stakeholders to ensure secure and continuous operations while maintaining a One Team, One Microsoft environment
  • Continuously improve the efficiency and maturity of the overall physical security program at Microsoft datacenters, seeking data and recommending strategies and ideas to reduce churn, optimize resources, implement creative solutions to problems, scale, automate and simplify process whenever possible
  • Demonstrate and promote a Microsoft culture within the workplace that supports the ability to attract, develop and retain talent
  • deliver results through teamwork
  • role model our Microsoft values with a passion for diversity and inclusion
  • Partner with vendor guard force management at site to drive a training objective of providing enhanced industry leading and ‘certified’ dedicated Datacenter Security Protection Professionals (ex: Corporate/ASIS/DCPRO certifications)
  • Function as a physical security subject matter expert who can operate on their own and represent the overall (multi-disciplinary) regional physical security team
  • Partner and collaborate closely with regional peer leaders and stakeholders, focused on maintaining a One Team, One Microsoft environment
  • As the on-site COF representative, ensure the operations team and all related security vendors successfully represent Microsoft during internal, external and customer audits for all COF teams (EH&S, EGRC, etc)
  • Fulltime
Read More
Arrow Right
New

Senior Technical Program Manager – AI Infrastructure, Site Operations

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. ...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in Technical Program Management, Infrastructure Ops, or Data Center Ops
  • Experience leading large, cross-functional infrastructure programs
  • Strong understanding of: Data center power and cooling fundamentals
  • Network and storage basics
  • Hardware-centric platforms
  • Proven ability to define and operationalize metrics
  • Strong written and executive-level communication skills
Job Responsibility
Job Responsibility
  • Own end-to-end technical programs for data center and site operations
  • Act as single-threaded owner across: Hardware & Systems Engineering
  • AI Cloud Infrastructure & Operations
  • Network & Storage Engineering
  • Facilities, power, cooling, and colo partners
  • Drive site readiness for Cerebras Wafer-Scale Engine systems
  • Partner on installation, commissioning, change management, and break/fix workflows
  • Lead incident reviews and postmortems
  • ensure corrective actions are closed
  • Define and own operational metrics and KPIs, including: Availability and reliability
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right

Critical Environment Business Program Manager

Microsoft’s Cloud Operations & Innovation (CO+I) is the engine that powers our c...
Location
Location
United States , Atlanta
Salary
Salary:
81400.00 - 161800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Qualification or equivalent AND 2+ years experience supporting IT equipment or related technology or delivering server and network deployment projects in large-scale environments
  • 1+ year(s) of experience with maintenance planning and execution
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Follows and adheres to processes and policy(ies) provided by security and safety governing partners
  • Reports immediately any safety or security issues or concerns
  • Participates in safety and security related Root Cause Analysis (RCA) processes as appropriate
  • Makes recommendations for improvements to safety and security processes or procedures
  • Fosters and exhibits a culture of safety
  • Ensures no unauthorized or unescorted personnel access in secured production environments
  • Manages and regularly audits physical access lists for personnel accessing secured production environments and related systems
  • Understands strategic vision as communicated by leaders
  • Identifies potential improvements aligned with this vision
  • Demonstrates conscientiousness on cost and adheres to budget requirements
  • Fulltime
Read More
Arrow Right
New

Technical Program Manager, AI Infrastructure

Be part of the team that builds and operates the world's fastest AI infrastructu...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading large, cross-functional infrastructure programs
  • Experience with AI/ML, HPC, or accelerator-based infrastructure
  • Strong understanding of data center power and cooling fundamentals
  • Experience installing and managing network, storage, and compute devices
  • Proven ability to define and operationalize metrics
  • Strong written and executive-level communication skills
  • Experience working with colocation providers and facilities teams
  • Background in incident management, reliability, or service operations
Job Responsibility
Job Responsibility
  • Own end-to-end technical programs for multiple data center buildouts, coordinating with partners, contractors, and internal teams
  • Drive facility site readiness for power and cooling for Cerebras Wafer-Scale Engine systems
  • Coordinate equipment delivery and manage vendor accountability for schedules and quality related to rack integration and inter-rack cabling
  • Act as the single-threaded owner across internal partners: Hardware & Systems Engineering, Network & Storage Engineering, AI Cloud Infrastructure & Operations
  • Enforce handover criteria between site completion, equipment deployment, and operations
  • Own overall schedule tracking, risk identification, and mitigation, creating clear visibility for leadership
  • Establish program governance, risk tracking, and RACI clarity
  • Present program status, metrics, and operational risks to senior leadership
  • Drive partner accountability on contractual milestones and commercial commitments
  • Document repeatable processes and implement them to scale across future data centers
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right

Quality Assurance Engineering Manager

Aruba, a Hewlett Packard Enterprise company, is seeking a Quality Assurance Engi...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Electrical Engineering, Computer Science, or a related field (advanced degree preferred)
  • 10+ years of relevant work experience, including 5+ years in a people management role
  • Proven experience in quality assurance, hardware/software testing, and customer escalation management, preferably in the networking or data center industry
  • Strong knowledge of hardware testing and validation processes, including optical interconnect standards, thermal profile characterization, and high-speed fabric interconnects for AI-enabled data centers
  • Proficiency in software testing for device drivers, BIOS, firmware, and hardware/software integration
  • Familiarity with CPU performance characterization, memory tuning, and platform optimization techniques
  • Experience with automated testing frameworks, tools, and methodologies
  • Advanced leadership capabilities, including team building, coaching, conflict resolution, and strategic workforce planning
  • Experience managing globally distributed teams and fostering cross-functional collaboration
  • Strong project management skills, including resource prioritization, risk management, and budget oversight
Job Responsibility
Job Responsibility
  • Lead and manage the Platform Validation & Customer Escalation Team, including individual contributors and subordinate managers, to ensure product quality and customer satisfaction
  • Act as the key point of escalation for complex technical issues, working cross-functionally with internal teams to resolve customer challenges effectively
  • Foster a culture of continuous improvement, innovation, and collaboration within the team
  • Oversee quality assurance processes for hardware modules (e.g., chassis, line cards, ASICs, transceivers, memory, power controllers) and software components (e.g., device drivers, BIOS, firmware)
  • Develop and implement rigorous testing frameworks to ensure platform performance, scalability, and reliability
  • Collaborate with engineering teams to validate designs and ensure hardware/software compatibility
  • Manage and resolve high-priority customer escalations by identifying root causes and implementing long-term solutions
  • Collaborate with technical leaders, program managers, and support teams to deliver timely, effective resolutions to customer-reported issues
  • Utilize insights from escalations to drive product and process improvements, reducing future customer-impacting incidents
  • Manage headcount, deliverables, schedules, and budgets for quality assurance activities and customer escalations
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right