Senior Site Reliability Engineer Job at Arcadia (Chennai)

Senior Site Reliability Engineer

Arcadia

Location:
India , Chennai

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Not provided

Save Job

Apply Position

Job Description:

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our SRE/Platform Engineering team in India. This role will focus on building, scaling, and maintaining our AWS- and Kubernetes-based platform, ensuring high reliability, cost efficiency, and secure operations across multiple environments. The successful candidate will work closely with Engineering, Security, DevOps, and Product teams to drive automation, improve infrastructure resilience, and elevate observability across mission-critical systems. The ideal candidate is a self-starter and hands-on engineer who can dive deep into complex distributed systems, automate away manual processes, and proactively identify reliability gaps. They should have a proven track record of managing production-grade AWS infrastructure, Kubernetes clusters, CI/CD pipelines, and cloud security. They will collaborate daily with US-based engineering teams and cross-functional partners to ensure our platform remains scalable, secure, and cost-optimized as we continue to grow.

Job Responsibility:

Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
Collaborate daily with US-based teams for incident reviews, migrations, roadmap work, and platform enhancements
Contribute to development and adoption of AI-enabled tooling (e.g., automation, debugging assistants, MCP, RAG pipelines—good to have, not mandatory)
Document runbooks, architecture diagrams, SOPs, troubleshooting guides, and operational best practices
Participate in on-call rotations (if required) and drive post-incident analysis and long-term fixes

Requirements:

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
Strong hands-on experience with: Terraform & Infrastructure as Code
AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
Kubernetes troubleshooting and operations
Prometheus/Grafana/Datadog observability stacks
Proven ability to operate in high-scale, high-uptime, multi-environment production systems
Experience building automation via Python/Bash and reducing operational toil
Strong understanding of incident management, root cause analysis, and reliability engineering principles
Experience working with globally distributed teams across multiple time zones
Excellent communication skills (must interact with US teams daily)
Ability to work independently with minimal supervision, take ownership, and drive initiatives end-to-end
A growth mindset, strong troubleshooting ability, and comfort with complex cloud-native environments

Nice to have:

Experience with n8n self-hosted, workflow automation platforms
Exposure to LLMs, RAG, vector DBs, MCP concepts
Experience with cloud security/DevSecOps tools (Trivy, Inspector, OPA, Kyverno)
Hands-on experience with FinOps platforms and cloud cost governance
Certifications in related field ( AWS , Kubernetes , Terraform ..etc)

What we offer:

Competitive compensation and employee stock options
Hybrid/remote-first working model (India-based role, with global collaboration)
Flexible leave policy
Comprehensive medical insurance (self + family members)
Annual performance cycle + quarterly recognition awards
A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation

Additional Information:

Job Posted:
December 06, 2025

Employment Type:

Fulltime

Work Type:

Hybrid work

Arcadia - All Job Offers

Job Link Share: