This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Senior Incident Optimization & Reliability Specialist serves as a critical bridge between our Technology Incident Optimization Program and the core End-User Technology domains, including cloud desktop infrastructure, Microsoft productivity tools, content management, and conference/video platforms. This role demands deep technical expertise combined with a strategic, data-driven mindset to drive tactical incident reduction while architecting the future state of intelligent event management and automation for end-user services.
Job Responsibility:
Conduct comprehensive analysis of alert and incident patterns to identify top sources of operational noise, determine root causes, and develop data-driven strategies for reduction
Design, implement, and optimize rules for event correlation, de-duplication, and suppression on AIOps and event management platforms
Architect and develop automation playbooks for incident data enrichment and create self-healing capabilities to reduce manual intervention (toil)
Assess the current observability footprint across all end-user technology domains
Champion and apply core SRE practices to systematically improve service reliability
Partner closely with end-user services, engineering, and platform teams to understand incident drivers, validate correlation logic, and provide expert guidance
Continuously validate the effectiveness of implemented rules and automation to ensure no business-impacting alerts are missed
Requirements:
Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related technical field
A minimum of 8+ years of hands-on experience in IT operations, end-user computing, or a related field, with proven experience in incident reduction and operational excellence
Demonstrated success in leading event management and incident reduction initiatives with quantifiable results
Direct, hands-on experience with modern AIOps and enterprise event management platforms (e.g., BigPanda)
Deep understanding of end-user technology ecosystems, including VMWare-hosted cloud desktop infrastructure, Microsoft 365 suite (Teams, Outlook, Office), SharePoint, and collaboration platforms
Expertise with a broad range of domain-specific monitoring and observability tools
Hands-on experience developing robust automation solutions using scripting languages (e.g., Python, PowerShell) and modern automation frameworks
Proficiency in log analysis, pattern recognition, and using query languages for data analysis on log aggregation platforms
Excellent analytical abilities with a systematic approach to troubleshooting complex issues
Exceptional communication skills with the ability to influence and collaborate effectively across diverse, cross-functional teams
Nice to have:
An advanced degree (Master's) in a relevant technical field
Relevant industry certifications (e.g., Microsoft 365, VMWare, ITIL)
Experience with Site Reliability Engineering (SRE) practices and applying them in an enterprise context
Knowledge of ITSM platforms, CMDB management, and infrastructure-as-code (IaC) principles
Familiarity with financial services regulatory requirements