CrawlJobs Logo

Data Center Incident Program Manager

openai.com Logo

OpenAI

Location Icon

Location:
United States

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

125600.00 - 228000.00 USD / Year

Job Description:

The Data Center Incident Program Manager is responsible for designing, operating, and continuously improving the end-to-end incident management lifecycle across mission-critical data center environments.This role owns the “before, during, and after” mechanics of incidents — establishing standards and playbooks in steady state, serving as (or designating) Incident Commander during active events, and driving structured post-incident review and corrective action to closure.

Job Responsibility:

  • Define and maintain incident severity levels (SEV definitions), classification criteria, and escalation thresholds
  • Establish end-to-end incident response standards: protocols, lifecycle stages (declare → stabilize → mitigate → recover → close), and operating cadence
  • Build and maintain governance artifacts: runbooks, war room formats, reporting templates, and decision/communication standards
  • Create and operationalize notification trees, stakeholder comms templates (initial, periodic updates, recovery/closure), and executive escalation criteria
  • Define clear RACI across Facilities, Hardware Ops, Network, Security, and vendor/partner teams, including handoffs and accountability paths
  • Set and manage SLAs/OLAs for acknowledgment, escalation, containment, mitigation, and reporting
  • Implement and run incident management tooling (ticketing, paging, logging) and ensure integrations with monitoring and workflow systems
  • Establish dashboards and program health metrics to track incident performance and readiness
  • Lead readiness activities: tabletop exercises, cross-functional simulations, IC/Deputy training, and a rotating on-call IC bench with certification standards
  • Serve as Incident Commander as needed: declare severity, stand up the war room, assign functional leads, and drive structured execution under pressure
  • Maintain real-time documentation (decisions, timelines, impact scope) and ensure clear restoration objectives and scope control during active events
  • Run post-incident reviews (PIRs), validate timelines, drive structured RCA (e.g., 5 Whys, Fault Tree), and separate root cause vs contributing factors
  • Define corrective/preventative actions (CAPAs), assign accountable owners, track to verified closure, and escalate overdue actions
  • Publish trend reporting (incident taxonomy, counts by severity, MTTA/MTTR, repeat failure domains) and feed systemic gaps back into design and operations teams

Requirements:

  • 7+ years in mission-critical infrastructure, data center operations, or reliability engineering
  • Direct experience leading major incidents (P1/P0 equivalent)
  • Strong familiarity with facilities systems, hardware operations, or network infrastructure
  • Demonstrated experience running war rooms and executive updates
  • Experience conducting root cause analysis and corrective action tracking
  • Ability to remain calm and decisive under high-pressure conditions

Nice to have:

  • Experience in hyperscale or high-density AI compute environments
  • Background in facilities commissioning, facility operations, hardware operations, or network reliability
  • Familiarity with ISO-based quality systems or structured operational documentation frameworks
  • Experience implementing incident tooling (PagerDuty, ServiceNow, Jira, etc.)
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Generous equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:
PREMIUM
More languages and countries
+ Unlock 31694 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Data Center Incident Program Manager

Facility Manager Italy

Facility manager’s main mission will be to manage and leverage the maximum power...
Location
Location
Italy , Milan
Salary
Salary:
Not provided
data4group.com Logo
DATA4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Electrical engineering degree preferred or proven experience and knowledge of electrotechnical issues applied to the components of the entire electrical line-up of data centers: Transformers, UPSs, Generators, common symbols for components like resistors, capacitors, transformers, switches, and circuit breakers
  • Technical knowledge to pilot the various trades involved
  • Able to communicate in written and oral English
  • Deep understanding of Electrical Schematics: mastery in reading and interpreting electrical schematics and diagrams, focusing on power distribution, UPS systems, and cooling infrastructure relevant to data centers
  • Reading and Interpreting Schematics: Understanding Schematic Flow, how to follow the flow of circuits and identify the relationship between different components within a schematic. Component Identification and Function: Detailed study of the function of each component within a schematic and its role in the overall system
  • Advanced circuit analysis techniques to understand complex electrical systems within data centers
  • Practical Application of Schematics in Troubleshooting: Utilizing electrical schematics for effective troubleshooting and fault identification in data center infrastructure
  • Electrical System Design and Maintenance: knowledge of designing electrical systems for data centers, including redundancy configurations and energy efficiency improvements
  • Preventive and predictive maintenance practices for electrical equipment, based on manufacturer guidelines and industry best practices
  • Incident management, including root cause analysis and corrective action planning, with a focus on electrical system failures
Job Responsibility
Job Responsibility
  • Manage and leverage the maximum power from our suppliers, mainly with the FM supplier
  • Lead the day-to-day relationship in terms of planning, preparation and field execution according to the internal procedures
  • Monitor the infrastructure management activities of Datacenters leading technically all our suppliers
  • Work together with other FMs and other teams
  • Cover all technical areas in the Datacenter, with very strong needs on: Electricity, Low Current (access control, CCTV…) and H&S management, among others
  • Be one of the main players in the quality of service rendered by our suppliers, especially in terms of continuity
  • Work very closely with the different suppliers and mainly with the FM company with focus on: Planning maintenance, prepare corrective works (review and validation of changes and MOPs when required, attending dry runs, BMS follow up of the installation to ensure a good performance, supporting other teams that could require on the field support for projects (Construction, Service Delivery, Energy management, quality and compliance team, among others)
  • Work closely with the Critical Environment Manager supporting, from the technical perspective, all Incident management
  • Be part of the on-call team for Data4
  • Support the CEM and the CSM team in terms of customer communication providing them info related to maintenance operations, preventive planning, incident management, RCA analysis, action plans tracking, correctives, small projects follow up, BMS monitoring, reporting
  • Fulltime
Read More
Arrow Right

Head of Operations Poland

Head of Operations role for Poland, responsible for managing data center operati...
Location
Location
Poland , Warsaw
Salary
Salary:
Not provided
data4group.com Logo
DATA4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Significant experience in areas in data center industry or in critical environment
  • Technical knowledge to pilot the various fields involved
  • Fluent in English
  • Knowledge of electricity, air conditioning, computer networks, security, security
  • Ability to manage a long-term relationship with suppliers/providers
  • Ability to coordinate and manage root cause search in the event of an incident
  • Ability to listen to customer needs and constraints
  • Strong People Management skills
  • Good Strategic decision-making skills
  • Good understanding of safety and fire safety standards and procedures
Job Responsibility
Job Responsibility
  • Managing operations in Poland
  • The main warrantor of the quality of service rendered to Customers in Poland, especially in terms of continuity of service
  • Responsible for the technical and security management of the Data Center area
  • Monitoring the operating budget of non-construction sites: responsible for the financial management of operating costs (COGS and Capex of Maintenance) in accordance with the budget and internal rules of the company in budget terms
  • Monitoring and execution of the company’s H&S and environment policy and monitoring its proper application with stakeholders (suppliers, subcontractors, visitors, customers…)
  • Coordinating the various internal and outsourced activities and services of the sites
  • Ensure that regulatory controls are in place and monitored in accordance with applicable regulations including ICPE
  • Management of on-call schedules
  • In coordination with the Senior Management, arbitration and reallocation of resources according to the company’s objectives and strategy
  • Assists the Sales team in the negotiation phase when necessary
  • Fulltime
Read More
Arrow Right

Global Security Area Manager

Our Global Security organization is composed of multiple teams that work togethe...
Location
Location
United States , Prineville
Salary
Salary:
111000.00 - 161000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of site/campus security management experience
  • Planning, organizational, and motivational experience
  • Experience drafting in technical and non-technical formats
  • Experience presenting both extemporaneously and in formal settings
  • Experience in root cause analysis, industry benchmarking, survey evaluation and data interpretation
  • Experience in the areas of emergency/disaster management, physical security, critical incident stress management, risk management and business resiliency
  • Experience with emergency procedure protocols and regulatory interfaces
  • Knowledge in physical security strategies, principles, standards, policies, and procedures
  • Experience with security technologies including Video Surveillance, Access Control, and Incident Management Systems, Security Operations Centers
Job Responsibility
Job Responsibility
  • Accountable for leading and providing management oversight of the onsite Data Center Physical Security program (DCPS) at the data center in support of all designated security activities
  • Responsible for the development and oversight of any additional Global Security FTEs assigned to the site
  • Act as a trusted advisor to the business and participate with the other organizational leads to make strategic decisions which drive site-based operations and resourcing for all teams
  • Provide security leadership direction to support all site-based events, they will ensure that the site security team maintains a strategic plan to support all ongoing operations, projects and construction milestones
  • Accountable for ensuring the DCPS team adheres with all Global Security Policies, Protocols, SOPs, and Post orders
  • Responsible for the execution of and assisting with Global Physical Security strategies to include enforcement of business conduct and integrity standards, employee safety and security, investigations, crisis response, business continuity and interaction with the security industry and government partners
  • Regularly engages with and communicates updates to the site Circle of Leadership (COL) and other key partners, advising on incidents, emerging risks, and other issues that may impact Meta’s data center operations, employees, or vendors
  • Works closely with the Global Security Operations Center and the Global Security Investigations and Intelligence Team to anticipate, identify, and evaluate risks to the Meta data center
  • Accountable to ensure that the Data Center staffing requirement is aligned with the Global manning model and help to ensure the security vendor is recruiting, hiring, training, developing, and retaining highly qualified team members in accordance with the statement of services
  • Responsible for ensuring security operations meet expectations of team and company audit programs
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Data Center & Critical Environments Programs & Governance Lead

This leadership position is responsible for ensuring the delivery of Data Center...
Location
Location
United States , New York
Salary
Salary:
135000.00 - 155000.00 USD / Year
jll.com Logo
JLL
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to interact with leadership, both internally and externally
  • Ability to interact with engineers and technicians at all levels
  • Develop business continuity and disaster recovery plans
  • Conduct regular operational audits, programmatic audits, safety audits and risk assessments
  • Ability to lead incident response and root cause analysis for system failures
  • Position requires up to 30% travel
  • Bachelor’s degree in engineering (Electrical, Mechanical, Civil, or related field) preferred or related experience
  • Minimum of 7-15 years data center operations experience related to FM service delivery
  • Thorough understanding of data center infrastructure, programs and systems related to operational delivery
  • Prior experience with development of mission critical programs, procedures, and DCIM/MCIM/CMMS systems
Job Responsibility
Job Responsibility
  • Provide data center operational direction and leadership at the account level to advance operational delivery quality for data center program management for both gray space and white space infrastructures
  • Build and maintain strong working relationships with key client representatives, acting as subject matter expert in data center operations and reliability
  • Implement standard business operating mechanisms including regular team meetings, strategic review forums, and analysis/reporting processes to ensure team alignment
  • Deployment and continuous alignment of the Critical Environments Playbook and related programs across existing and new accounts
  • Lead large teams through education and influence to optimize operational outcomes for data center/CEM global programs
  • Drive operational excellence within critical operational performance and compliance environments
  • Support a culture of 100% data center uptime according to JLL contractual parameters, exceeding customer expectations
  • Lead data center site assessments, JCAP site assessments, and other on/off-site deliverables for clients and JLL
  • Provide account-specific data center solutions and transition support for complex engagements and transformation programs in high-risk environments
  • Monitor market trends and changes to ensure JLL provides industry best practice in DC FM Operations delivery
What we offer
What we offer
  • 401(k) plan with matching company contributions
  • Comprehensive Medical, Dental & Vision Care
  • Paid parental leave at 100% of salary
  • Paid Time Off and Company Holidays
  • Early access to earned wages through Daily Pay
  • Fulltime
Read More
Arrow Right

Data Center Program Manager

As a Microsoft Data Center Project Manager (DCPM), you will perform troubleshoot...
Location
Location
United States , Boydton
Salary
Salary:
81400.00 - 161800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Qualification or equivalent AND 2+ years experience supporting IT equipment or related technology or delivering server and network deployment projects in large-scale environments OR equivalent experience
  • Bachelor's or Technical College Degree in Computer Science, Math, Telecommunications, Electrical/Mechanical Engineering, Supply Chain Management or related field AND 5+ years experience in critical environment infrastructures (e.g., UPS, Generator, AHU), or working in physical IT infrastructures (e.g., Servers, SANs, Networking, Capacity, DC Rack/Enclosures, structured cabling) OR High School Qualification or equivalent AND 7+ years experience in critical environment infrastructures (e.g., UPS, Generator, AHU), or working in physical IT infrastructures (e.g., Servers, SANs, Networking, Capacity, DC Rack/Enclosures, structured cabling) OR equivalent experience
  • Applicable certifications: APICS/Inventory Control, CompTIA, Microsoft, Network Certifications, PMP, ITIL, CDCP
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Citizenship Verification: This position requires verification of US Citizenship to meet federal government security requirements
  • Criminal Justice Information Services: This position may require passing a background check conducted through the CJIS criminal justice information system by authorized local, state, and/or federal agencies.
Job Responsibility
Job Responsibility
  • Follows and adheres to processes and policy(ies) provided by security and safety governing partners
  • Reports immediately any safety or security issues or concerns
  • Participates in safety and security related Root Cause Analysis (RCA) processes as appropriate
  • Makes recommendations for improvements to safety and security processes or procedures
  • Fosters and exhibits a culture of safety
  • Ensures no unauthorized or unescorted personnel access in secured production environments, ensuring alignment with security practices and standards
  • Manages and regularly audits physical access lists for personnel accessing secured production environments and related systems
  • Conducts security risk assessments of data center operations and assesses the design, build, and delivery of technology, tools, data, and processes to meet high security standards with minimal guidance
  • Documents and tracks security Key Performance Indicators (KPIs) and identifies and escalates action items
  • Demonstrates conscientiousness on cost and adheres to budget requirements
  • Fulltime
Read More
Arrow Right

Data Center Security Operations Manager

Cloud Operations + Innovation (CO+I) is the engine that powers Microsoft’s cloud...
Location
Location
Austria , Vienna
Salary
Salary:
90000.00 EUR / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Demonstrated Capability To: Oversee deliver of physical security services to Microsoft data center security operations, including oversight of contract guard operations, alarm investigation and incident reporting and coordination with regional security disciplinary specialists on projects, expansions and other security-related efforts
  • Evaluate and drive continuous improvement of contract guard operations through the use of key performance indicators and collaborative improvement plans
  • Close coordination with security vendor management to ensure continuous improvement of security team skills through targeted training, practical exercises and the documentation and application of lessons-learned
  • Coordination with local emergency services in an effort to develop, maintain and practice/test cross-functional emergency response procedures for the datacenters
  • Assess and communicate risk and mitigation strategies to non-security audiences, supporting operational needs and maintaining security compliance
  • Travel not expected to exceed 10-15% of the time
  • Bachelor’s degree in a security or management related discipline, or equivalent experience
  • experience applicable to target role/level, including 3+ years managing people
Job Responsibility
Job Responsibility
  • Oversee the implementation of physical security policies and procedures, ensuring Microsoft’s physical security vendor has the resources and information to deliver physical security services that exceed Microsoft and customer requirements to protect people, information and critical infrastructure
  • Partner with datacenter operations, security systems and other Microsoft stakeholders to ensure secure and continuous operations while maintaining a One Team, One Microsoft environment
  • Continuously improve the efficiency and maturity of the overall physical security program at Microsoft datacenters, seeking data and recommending strategies and ideas to reduce churn, optimize resources, implement creative solutions to problems, scale, automate and simplify process whenever possible
  • Demonstrate and promote a Microsoft culture within the workplace that supports the ability to attract, develop and retain talent
  • deliver results through teamwork
  • role model our Microsoft values with a passion for diversity and inclusion
  • Partner with vendor guard force management at site to drive a training objective of providing enhanced industry leading and ‘certified’ dedicated Datacenter Security Protection Professionals (ex: Corporate/ASIS/DCPRO certifications)
  • Function as a physical security subject matter expert who can operate on their own and represent the overall (multi-disciplinary) regional physical security team
  • Partner and collaborate closely with regional peer leaders and stakeholders, focused on maintaining a One Team, One Microsoft environment
  • As the on-site COF representative, ensure the operations team and all related security vendors successfully represent Microsoft during internal, external and customer audits for all COF teams (EH&S, EGRC, etc)
  • Fulltime
Read More
Arrow Right

Critical Environment Operations Manager

Microsoft’s Cloud Infrastructure and Operations (MCIO) is the engine that powers...
Location
Location
United States , Quincy
Salary
Salary:
127600.00 - 229200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Qualification or equivalent AND 6+ years experience of mission-critical service management (e.g., providing IT services, manufacturing, warehouse, retail, military, or managing physical operations in an IT and/or critical environment infrastructure) OR equivalent experience
  • 1+ year(s) people management experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
  • 6+ years enterprise-level experience managing large scale and complex projects/programs AND 6+ years experience in Critical Environment infrastructures (e.g., UPS, Generator, AHU) AND 6+ years experience in physical IT infrastructures (e.g., Servers, SANs, Networking, Capacity, DC Rack/Enclosures, structured cabling) AND Experience managing budget $1M+
  • Bachelor's Degree in Computer Science, Math, Telecommunications, Electrical/Mechanical Engineering, Supply Chain Management or related field
  • 5+ years experience leading diverse, technical workforce or managing global and virtual teams
  • Applicable certifications: ASICS/Inventory Control, CompTIA, Microsoft, Network Certifications, CCNA Certifications, ITIL v3 Foundation, Microsoft Operations Framework (MOF) Certifications, Leadership Development Certifications, PMP, CDCP
  • Bachelor’s Degree or Technical College certification in mechanical or electrical engineering and/or services
  • Experience working on large scale CE projects
  • Experience with the operation of IT infrastructure (Servers, SANs, Networking, etc.)
Job Responsibility
Job Responsibility
  • Clarifies and refines strategic vision to communicate to their teams and drives alignment of operational strategies with a security-first culture
  • Brings suggestions including security enhancement to local management team and/or global leaders or partners
  • Reviews Objectives and Key Results (OKRs) and Key Performance indicators (KPIs) and provides input to improve or make KPIs and OKRs more efficient/effective
  • Understands key business drivers and where Data Center Operations (DCOPS) feeds into business success
  • Understands connections between different organizations and their related initiatives
  • Suggests or recommends risk-based decisions or escalates based on what is best for business and strategy, and without all the data
  • Frames DCOPs challenges in context of business expectations and drives business context
  • Understands key financial metrics, understands how to grow an organization, and exhibits a One Microsoft mentality
  • May change course and strategically delegates tasks
  • Influences guidelines that to ensure they are consistent with contractual service agreements and security standards
  • Fulltime
Read More
Arrow Right

Senior Technical Program Manager – AI Infrastructure, Site Operations

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. ...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in Technical Program Management, Infrastructure Ops, or Data Center Ops
  • Experience leading large, cross-functional infrastructure programs
  • Strong understanding of: Data center power and cooling fundamentals
  • Network and storage basics
  • Hardware-centric platforms
  • Proven ability to define and operationalize metrics
  • Strong written and executive-level communication skills
Job Responsibility
Job Responsibility
  • Own end-to-end technical programs for data center and site operations
  • Act as single-threaded owner across: Hardware & Systems Engineering
  • AI Cloud Infrastructure & Operations
  • Network & Storage Engineering
  • Facilities, power, cooling, and colo partners
  • Drive site readiness for Cerebras Wafer-Scale Engine systems
  • Partner on installation, commissioning, change management, and break/fix workflows
  • Lead incident reviews and postmortems
  • ensure corrective actions are closed
  • Define and own operational metrics and KPIs, including: Availability and reliability
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right