This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Senior Incident Optimization Specialist serves as a critical bridge between the Technology Incident Optimization Program and the core Compute, Virtualization, Cloud Services, and Storage technology domains. This role demands deep technical expertise combined with strategic thinking to drive tactical incident reduction while architecting the future state of intelligent event management and automation. You will be responsible for building automated incident remediation workflows and achieving measurable incident reduction within your domain through event optimization, correlation, and automation while ensuring comprehensive observability is maintained and enhanced. This position offers the unique opportunity to shape the future of enterprise event management.
Job Responsibility:
Conduct comprehensive analysis of alert and incident patterns to identify top sources of operational noise, determine root causes, and develop data-driven strategies for reduction
Design, implement, and optimize rules for event correlation, de-duplication, and suppression on AIOps and event management platforms
Develop domain-specific correlation logic leveraging configuration management data and infrastructure topology
Architect and develop automation playbooks for incident data enrichment and create self-healing capabilities for common and recurring infrastructure incident scenarios
Assess the current observability footprint across all infrastructure domains to identify gaps and propose enhancements that align with enterprise event management standards
Partner closely with infrastructure operations, engineering, and platform teams to understand incident drivers, validate correlation logic, and provide expert guidance on event management best practices
Continuously validate the effectiveness of implemented rules and automation to ensure no business-impacting alerts are missed
Monitor and report on alert quality metrics and lead iterative improvements
Requirements:
Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related technical field
A minimum of 8+ years of hands-on experience in IT operations, infrastructure engineering, or system architecture within large-scale enterprise environments
Proven experience and demonstrated success in leading event management and incident reduction initiatives with quantifiable results
Direct, hands-on experience with modern AIOps and event management platforms is required
Deep understanding of enterprise infrastructure including virtualization architectures, container orchestration, microservices, and various storage architectures (block, file, object)
Expertise with a broad range of domain-specific monitoring tools for compute, virtualization, storage, and cloud platforms
Hands-on experience developing robust automation solutions using scripting languages and modern automation frameworks
Proficiency in log analysis, pattern recognition, and using query languages for data analysis on log aggregation platforms
Excellent analytical abilities with a systematic approach to troubleshooting complex issues and a holistic view of technology systems
Exceptional communication skills with the ability to influence and collaborate effectively across diverse, cross-functional teams and present technical concepts to various audiences
Nice to have:
An advanced degree (Master's) in a relevant technical field
Relevant industry certifications (e.g., Cloud, Virtualization, Automation, ITIL)
Experience with AIOps, machine learning for IT operations, and Site Reliability Engineering (SRE) practices
Knowledge of ITSM platforms, CMDB management, and infrastructure-as-code (IaC) principles
Familiarity with financial services regulatory requirements
What we offer:
medical, dental & vision coverage
401(k)
life, accident, and disability insurance
wellness programs
paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays
discretionary and formulaic incentive and retention awards