Filters

Countries

Peru (1)

Work Mode

Remote work (2)

Principal Site Reliability Engineer (AI-first SRE) Jobs (Remote work)

2 Job Offers

Filters

Principal Site Reliability Engineer (AI-first SRE)

Lead the AI-driven reliability transformation at Groupon as a Principal SRE. Architect self-healing systems using AI/ML, GCP/AWS, and Kubernetes to ensure 99.9%+ availability. Leverage your 10+ years of experience to build predictive, intelligent platforms in a transformative, remote-friendly env...

Location

Peru

Salary

Not provided

Groupon

Expiration Date

Until further notice

Principal Site Reliability Engineer (AI-first SRE)

Lead the AI-driven reliability transformation at Groupon as a Principal SRE. You will architect self-healing systems using AI/ML, GCP/AWS, and Kubernetes to ensure 99.9%+ availability. This role requires 10+ years of experience, expertise in AIOps, and offers a chance to shape scalable, predictiv...

Location

Salary

Not provided

Groupon

Expiration Date

Until further notice

Explore Principal Site Reliability Engineer (AI-first SRE) jobs and step into the forefront of modern infrastructure engineering. This senior-level role represents the evolution of traditional SRE, merging deep software engineering expertise with artificial intelligence and machine learning to build inherently reliable, predictive, and self-regulating systems. Professionals in this field move beyond reactive incident management to architect platforms that anticipate and autonomously mitigate issues, ensuring exceptional user experiences and business continuity. The core mission of a Principal AI-first SRE is to institutionalize reliability through intelligent automation. Common responsibilities include architecting and maintaining large-scale, distributed systems with stringent availability targets. They design and implement self-healing systems that automatically detect and remediate failures. A significant part of the role involves leveraging AI/ML for predictive analytics, using historical and real-time telemetry to forecast potential failures before they impact users. They build sophisticated AIOps observability pipelines, automate infrastructure governance, and develop adaptive Service Level Indicators (SLIs) and Objectives (SLOs) that evolve from system behavior. Leading chaos engineering and resilience testing programs, mentoring engineering teams, and driving a culture of reliability and data-driven decision-making across the organization are also typical duties. They often participate in and refine on-call processes, focusing on eliminating repetitive operational work through automation. Typical skills and requirements for these high-impact jobs are extensive. Candidates generally possess 8-10+ years of experience in software/systems engineering, with a substantial portion dedicated to SRE or platform engineering. Expertise in cloud platforms like AWS, GCP, or Azure, container orchestration with Kubernetes, and infrastructure-as-code tools like Terraform is standard. Strong programming proficiency in languages such as Python or Go for developing automation and tooling is essential. A deep understanding of observability stacks—including metrics, logging, tracing (e.g., Prometheus, Grafana, OpenTelemetry)—and service mesh technologies is required. Crucially, hands-on experience with AI/ML concepts applied to operations—such as anomaly detection, pattern recognition, and predictive modeling—differentiates this role. Strong leadership, communication, and the ability to influence cross-functional teams with data are paramount. For those seeking to define the future of resilient systems, Principal AI-first SRE jobs offer a challenging and rewarding career path at the intersection of cutting-edge AI and foundational infrastructure reliability.

We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.

Filters

Location

Work Mode

All (2)

Remote work (2)

Salary

All (2)

Specified salary (0)