This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking an Apigee X Site Reliability Engineer (SRE) with strong production experience operating APIs at scale. This role is focused on ensuring the reliability, performance, and resilience of Apigee X–backed services that support critical customer journeys.
Job Responsibility:
Define, implement, and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for Apigee-backed services
Establish SLO targets, manage error budgets, and own reliability reporting cadence
Design, implement, and continuously tune alerting strategies across the API platform
Classify and route alerts by severity (P1/P2/P3) based on customer impact and SLO burn rates
Implement alert correlation patterns, including authentication failures, quota spikes, and backend target failures
Own and enhance operational dashboards covering Golden Signals and dependency health using Datadog
Build and maintain dashboards for traffic, latency, error rates, backend dependencies, DNS health, certificate expiry, and authentication providers
Create SLO burn-rate views and identify top impacted API proxies
Proactively identify anomalies and performance degradation trends
Analyze seasonality patterns and establish intelligent baseline thresholds
Produce weekly and monthly reliability reports covering SLO performance, major incidents, recurring root causes, change failure rate, and MTTR
Implement and maintain synthetic monitoring and user journey checks for critical API flows
Participate in 24x7 on-call rotations and lead incident response and problem management activities
Requirements:
Degree level in Computer Science, Computer Engineering
Strong hands-on expertise in the Apigee platform, particularly Apigee X
Proficient in custom reporting and advanced debugging within Apigee environments
Experienced with APM and observability tools, including creating dashboards, alerts, and monitors (Datadog preferred)
Knowledgeable in modern cloud technologies and distributed systems
Familiar with Agile ways of working
What we offer:
The opportunity to work on large-scale, business-critical API platforms
Exposure to advanced reliability engineering practices
Collaboration with diverse, cross-functional teams
A role with clear ownership, influence, and measurable outcomes