This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Lead complex technology initiatives including those that are companywide with broad impact
Act as a key participant in developing standards and companywide best practices for engineering complex and large scale technology solutions for technology engineering disciplines
Design, code, test, debug, and document for projects and programs
Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
Make decisions in developing standard and companywide best practices for engineering and technology solutions requiring understanding of industry best practices and new technologies, influencing and leading technology team to meet deliverables and drive new initiatives
Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
Lead projects, teams, or serve as a peer mentor
Own and improve availability, performance, scalability, and resilience of production systems
Define, monitor, and manage SLIs/SLOs and error budgets to guide reliability investments
Lead capacity planning, performance testing, failover readiness, and disaster‑recovery design
Design and operate a comprehensive observability stack using Prometheus, Grafana, and Splunk
Build and maintain golden dashboards and actionable alerts aligned to business impact
Reduce alert fatigue through signal‑based monitoring and correlation of metrics, logs, and traces
Partner with application teams to define instrumentation standards for metrics and logging
Use observability data to improve MTTD, MTTR, and reliability outcomes
Develop Python‑based automation for monitoring, alert remediation, deployments, scaling, and recovery
Build self‑healing workflows integrated with Prometheus alerts and Splunk signals
Create reusable automation frameworks and internal SRE tooling
Embed automation into CI/CD pipelines to improve deployment safety and reliability
Apply AI/ML techniques to observability and operations use cases
Partner with data and platform teams to operationalize ML models in production
Evaluate and integrate AIOps capabilities into the observability ecosystem
Serve as incident commander and senior escalation point for P1/P2 incidents
Lead blameless post‑incident reviews (PIRs) backed by Grafana metrics and Splunk evidence
Drive corrective and preventive actions to completion
Collaborate with platform, application, cloud, and SRE teams to embed reliability and observability by design
Influence architectural decisions to ensure systems are observable, scalable, and operable
Provide SRE guidance during major releases, migrations, and modernization initiatives
Ensure observability and automation comply with enterprise security and audit requirements
Support resilience validation, failover drills, and business continuity testing
Mentor and guide SRE and software engineers
Define standards for observability, automation, reliability, and incident response
Act as the technical authority for complex production and platform issues
Requirements
5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Experience in Software Engineering, SRE, DevOps, or Platform Engineering
Strong proficiency in Python for automation and tooling
Hands‑on experience with Grafana, Prometheus, and Splunk in production environments
Solid understanding of SLIs, SLOs, dashboards, alerting, and observability best practices
Experience applying AI/ML concepts to monitoring, alerting, or operational analytics
Strong knowledge of Linux, networking, and distributed systems
Experience with Cloud platforms and Kubernetes/OpenShift
Proven experience leading incidents, RCAs, and reliability initiatives
Experience building custom Prometheus exporters or advanced Grafana dashboards