This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
You will join an SRE-aligned operations team responsible for keeping a mission-critical, global cloud platform reliable, performant, and secure. The project focuses on 24/7 cloud operations, proactive monitoring, incident response, and continuous improvement of observability coverage across multi-region GCP environments. You will work closely with SRE, Cloud Engineering, and development teams to maintain high availability, support business continuity, and drive operational excellence.
Job Responsibility:
Monitor GCP cloud infrastructure across multiple regions using advanced observability tooling
identify monitoring gaps and implement improvements to increase coverage
Respond to alerts and incidents in real time
gather supporting data for root cause analysis and escalate when required
Investigate logs, APM traces, dashboards, and monitors to assess broader/tangential impact and provide incident forensics
Troubleshoot issues related to cloud networking, containers, storage, APIs, and service reliability
Create, maintain, and improve troubleshooting guides (TSGs), incident response procedures, runbooks, and operational documentation
Provide leadership and mentorship to NOC engineers
help set operational standards and best practices
Collaborate with SRE, Cloud Engineering, and development teams to resolve complex infrastructure and reliability issues
Perform routine health checks across the cloud environment and ensure readiness for high availability
Monitor observability platform spend and recommend optimization actions where appropriate
Evaluate private beta/beta releases of observability tooling
summarize findings and advise on adoption
Perform routine patching and upgrades of observability agents across the platform
Contribute changes to a source repository (e.g., runbooks, configs, automation, monitoring-as-code)
Ensure compliance with SLAs, security policies, and operational standards
Participate in a 24/7 on-call rotation and support disaster recovery and business continuity activities
Analyze performance metrics and recommend opportunities for optimization and automation
Requirements:
Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent practical experience)
3–5 years of experience in a NOC, operations, or cloud infrastructure support role
Strong understanding of cloud platforms and core services (AWS / Azure / GCP), with hands-on exposure to production operations
Familiarity with container orchestration (Kubernetes / GKE) and CI/CD pipelines
Experience with monitoring and logging tools such as Datadog, Dynatrace, Prometheus, Grafana, ELK, CloudWatch, Splunk, Sumo Logic, New Relic (or equivalents)