This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Transform traditional operations into a modern SRE model—building reliability by design, improving SLIs/SLOs, automating toil, defining and enabling critical monitoring, templatize the observability based on business-critical application and define critical user journeys for the same. Maturing incident & problem management. You’ll be hands-on while also mentoring Ops/Dev teams to adopt SRE practices on scale.
Job Responsibility:
Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
Work with vendors and other technical personnel for problem resolution
Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability
Define and implement SLIs/SLOs and error budgets for critical services
drive SLO adoption across teams
Build and tune observability (metrics/logs/traces) with golden signals (latency, traffic, errors, saturation)
Partner with Performance Engineering to run load/stress/soak tests and remove performance bottlenecks
Platform & Automation: Eliminate toil , Generate AI based observability assessment and maturity score card for all applications
Create selfservice reliability tooling (runbooks, bots, reliability checks, golden paths)
Incident, Problem & Change
Lead high severity incidents (Major/SEV1), facilitate blameless postmortems, and track corrective actions
Culture & Enablement: Coach product and ops teams on SRE principles
define maturity models and track adoption
Build documentation: runbooks, dashboards, readiness checklists, and reliability reviews. always current
Requirements:
4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Strong experience in large-scale distributed systems
5+ years hands-on SRE/DevOps/Platform Engineering
Cloud: One or more—AWS / Azure / GCP (certifications a plus)