This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a Principal Engineer to set the enterprise technical direction for Site Reliability Engineering and AIOps within Enterprise Functions Technology (EFT). This is a hands-on architecture and engineering leadership role responsible for defining the reliability strategy, reference architectures, and engineering standards across a large application portfolio and multiple lines of business. You will drive cross-organization adoption of SLOs/error budgets, full-stack observability, incident and problem management rigor, and automation-first operations—ensuring reliability is designed into the software delivery lifecycle and operating model. Success is defined by measurable outcomes at scale: improved availability and resiliency of critical journeys, fewer customer-impacting incidents, reduced operational toil, and faster, safer recovery—delivered through modern engineering practices, data-driven decisioning, and platform capabilities.
Job Responsibility:
Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups
Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization
Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
Set and evangelize the SRE and AIOps technical strategy for EFT, establishing reference architectures, standards, and guardrails (service tiering, onboarding criteria, SLO/error budget governance) and holding teams accountable through transparent executive-level reporting
Act as a principal-level technical advisor and multiplier: mentor senior engineers, contribute to hiring and technical bar-raising, and define reliability patterns and guardrails across applications, networks, databases, operating systems, and web technologies
Own the reliability and observability architecture across hybrid/multi-cloud, driving standardization of monitoring, logging, tracing, synthetics, and resilience/chaos testing
define platform patterns that teams can adopt with minimal friction
Design and implement AIOps and automation platforms (event correlation, anomaly detection, runbook automation, self-healing) with strong engineering discipline (testability, auditability, change safety) and prioritize initiatives that materially reduce incident volume, toil, and MTTR
Define the reliability measurement system (SLIs/SLOs, error budgets, customer impact, MTTR/MTBF, change failure rate) and build reusable dashboards and alerts that drive consistent prioritization, investment decisions, and engineering behavior across teams
Provide technical leadership during major incidents for critical services, driving rapid triage, clear stakeholder communications, and cross-domain coordination
institutionalize blameless post-incident reviews and engineering mechanisms that eliminate systemic causes
Partner with application, platform, and architecture leaders to embed reliability into planning and delivery (design and architecture reviews, operational readiness gates, non-functional requirements, capacity/performance engineering), influencing roadmaps based on quantified risk and customer impact
Lead multi-quarter, cross-organization reliability transformations (e.g., platform modernization, resilience programs, observability convergence), delivering reusable capabilities and operating mechanisms that improve reliability posture and reduce operational risk at scale
Ability to travel up to 10%, as needed for stakeholder engagement, program delivery, and operational reviews
Hybrid work expectation: work from the office three days per week at one of the listed locations, aligned to team and business needs
Requirements:
7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
7+ years of engineering experience, including principal-level technical leadership on large-scale reliability, production operations, or platform programs across complex environments
7+ years of software engineering experience (e.g., Java, C#, Python) with demonstrated expertise in system design and distributed systems
track record of delivering reusable automation and platform capabilities adopted by multiple teams
5+ years operating Linux/Unix and Windows platforms in production, including performance tuning, capacity planning, and reliability hardening for mission-critical services
5+ years designing and operating cloud solutions (public and/or private cloud), including reliability and security architecture, infrastructure-as-code, and cost-aware engineering at scale
5+ years leading reliability and operations practices for enterprise-scale, highly available services, including major incident leadership, problem management, and establishing operational readiness mechanisms
5+ years architecting and scaling full-stack observability solutions, including instrumentation standards, alert strategy, service dashboards, and governance that improves signal quality and reduces noise
5+ years with automation and observability toolsets (e.g., Ansible, Grafana, Elastic, Splunk, Prometheus) and experience building reusable components, templates, and paved paths integrated with CI/CD
Exceptional communication and influence skills, including the ability to align senior stakeholders, drive technical decisions across organizations, and clearly articulate risk, tradeoffs, and recommended paths forward
Deep experience applying advanced analytics and/or AI/ML to production operations (AIOps), including model monitoring, drift/quality controls, explainability, and risk/compliance alignment
Experience leading complex, cross-team delivery using Agile/Scrum and/or Kanban, including establishing operating mechanisms (reviews, readiness gates, metrics cadences) that scale
Experience defining reliability and resiliency architecture for regulated environments, including alignment with information security, audit, and risk management partners
Demonstrated technical thought leadership (e.g., internal engineering community leadership, reference implementations, publications/patents, or speaking engagements)