This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The PCW (Pharmacy & Consumer Wellness) Edge SRE team is seeking a Staff Site Reliability Engineer (SRE) with a primary focus on observability to join our team. This role will lead the design, implementation, and optimization of observability systems to ensure the reliability, performance, and scalability of our environment with emphasis on edge environments. You will collaborate with cross-functional teams to build robust monitoring, alerting, and telemetry solutions, enabling proactive issue detection and resolution across distributed systems. As a senior member of the SRE team, you will drive best practices, mentor others, and shape the strategic evolution of our observability ecosystem in a complex, edge-centric architecture.
Job Responsibility:
Design and implement comprehensive observability solutions tailored for edge computing environments
Define and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and business KPIs
Build and optimize dashboards, visualizations, and alerting systems
Implement distributed tracing and log aggregation systems
Collaborate with engineering teams to ensure applications and infrastructure at edge locations are designed with observability in mind
Drive proactive identification of issues in edge facilities
Lead incident postmortems
Develop and maintain tools, scripts, and automation to enhance observability pipelines
Evaluate and integrate industry-standard observability tools
Optimize observability data storage, retention, and querying
Mentor and guide junior SREs and engineers
Partner with solution, engineering, and business teams
Lead cross-functional initiatives to improve observability
Stay current with emerging observability trends, tools, and methodologies
Contribute to the development of observability standards, runbooks, and documentation
Drive cost optimization for observability infrastructure
Requirements:
7+ years of experience in Site Reliability Engineering, Observability Engineering, or a related field
5+ years of experience with observability tools and platforms such as Prometheus, Grafana, Splunk, ELK, OpenTelemetry, or similar
3+ years of experience with microservices, containerized environments (e.g., Kubernetes, Docker), and distributed systems, particularly in edge deployments
Bachelor's degree, or equivalent experience (HS diploma + 4 years relevant experience)
Nice to have:
Experience with implementation of AIOps
Demonstrated ability to handle observability challenges in environments with intermittent connectivity, high latency, or geographically dispersed infrastructure
Strong proficiency in programming/scripting languages (e.g., Python, java) for automation and tooling in distributed environments
Expertise working in edge computing environments with a large number of remote facilities
Experience with OpenTelemetry or other open-source observability frameworks optimized for edge computing
Familiarity with chaos engineering principles to validate observability systems in edge environments
Certifications in cloud platforms (Google Cloud Professional certification) or Kubernetes
Strong problem-solving skills with a proactive, analytical mindset
Excellent communication and collaboration skills
Ability to mentor and lead technical initiatives
Comfortable working in a fast-paced, dynamic environment
Knowledge of incident management processes and tools (e.g., ServiceNow, xMatters, Opsgenie)
Deep understanding of monitoring, logging, and tracing concepts
Familiarity with cloud infrastructure, CI/CD pipelines, and edge-specific deployment patterns