CrawlJobs Logo
Briefcase Icon
Category Icon

Filters

×
Countries

Principal Site Reliability Engineer (AI-first SRE) Jobs

2 Job Offers

Filters
Principal Site Reliability Engineer (AI-first SRE)
Save Icon
Lead the AI-driven reliability transformation at Groupon as a Principal SRE. Architect self-healing systems using AI/ML, GCP/AWS, and Kubernetes to ensure 99.9%+ availability. Leverage your 10+ years of experience to build predictive, intelligent platforms in a transformative, remote-friendly env...
Location Icon
Location
Peru
Salary Icon
Salary
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Principal Site Reliability Engineer (AI-first SRE)
Save Icon
Lead the AI-driven reliability transformation at Groupon as a Principal SRE. You will architect self-healing systems using AI/ML, GCP/AWS, and Kubernetes to ensure 99.9%+ availability. This role requires 10+ years of experience, expertise in AIOps, and offers a chance to shape scalable, predictiv...
Location Icon
Location
Salary Icon
Salary
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Explore Principal Site Reliability Engineer (AI-first SRE) jobs and step into the forefront of modern infrastructure engineering. This senior-level role represents the evolution of traditional SRE, merging deep software engineering expertise with artificial intelligence and machine learning to build inherently reliable, predictive, and self-regulating systems. Professionals in this field move beyond reactive incident management to architect platforms that anticipate and autonomously mitigate issues, ensuring exceptional user experiences and business continuity. The core mission of a Principal AI-first SRE is to institutionalize reliability through intelligent automation. Common responsibilities include architecting and maintaining large-scale, distributed systems with stringent availability targets. They design and implement self-healing systems that automatically detect and remediate failures. A significant part of the role involves leveraging AI/ML for predictive analytics, using historical and real-time telemetry to forecast potential failures before they impact users. They build sophisticated AIOps observability pipelines, automate infrastructure governance, and develop adaptive Service Level Indicators (SLIs) and Objectives (SLOs) that evolve from system behavior. Leading chaos engineering and resilience testing programs, mentoring engineering teams, and driving a culture of reliability and data-driven decision-making across the organization are also typical duties. They often participate in and refine on-call processes, focusing on eliminating repetitive operational work through automation. Typical skills and requirements for these high-impact jobs are extensive. Candidates generally possess 8-10+ years of experience in software/systems engineering, with a substantial portion dedicated to SRE or platform engineering. Expertise in cloud platforms like AWS, GCP, or Azure, container orchestration with Kubernetes, and infrastructure-as-code tools like Terraform is standard. Strong programming proficiency in languages such as Python or Go for developing automation and tooling is essential. A deep understanding of observability stacks—including metrics, logging, tracing (e.g., Prometheus, Grafana, OpenTelemetry)—and service mesh technologies is required. Crucially, hands-on experience with AI/ML concepts applied to operations—such as anomaly detection, pattern recognition, and predictive modeling—differentiates this role. Strong leadership, communication, and the ability to influence cross-functional teams with data are paramount. For those seeking to define the future of resilient systems, Principal AI-first SRE jobs offer a challenging and rewarding career path at the intersection of cutting-edge AI and foundational infrastructure reliability.

Filters

×
Category
Location
Work Mode
Salary