CrawlJobs Logo

Senior Site Reliability Engineer

United States 55.00 - 68.00 USD / Hour · Job Posted April 12, 2026
Apply Position
Job Link Share

Job Description

The Sr Site Reliability Engineer will architect, develop, and maintain cloud environment in both the commercial and government cloud. The role will work closely with software engineers, architects, and DevOps engineers to architect and maintain a secure, resilient and high performance cloud infrastructure.

Job Responsibility

  • Build, maintain, and operate IaaS and PaaS infrastructure in Azure commercial and government clouds
  • Work closely with dev teams to identify and measure SLOs, SLAs and SLIs
  • Act a strong contributor to development of platform services including architecture, provisioning, configuration, deployment, and support
  • Perform integrations with central logging, metrics dashboards, instrumentation, incident monitoring and management
  • Build/integrate/administer systems and tools that enable engineering teams to observe their applications in production with autonomy (Dashboards, APMs)
  • Support software and/or cloud-infrastructure in an on-call rotation basis
  • Assist with identification and remediation of technical problems at the root cause by continuously implementing automation, self-healing, and real-time monitoring to production systems
  • Maintain and improve operational tooling, frameworks, build frameworks that test the performance and resiliency of our platform services/tools
  • Automate alerts for metrics on performance, cost, vulnerabilities, risk, compliance violations
  • Improve processes and champion automation of any manual items around support

Requirements

  • 4 + years of experience working within a SRE engineer/cloud platform role
  • Experience leveraging AI tools in the software development (or product) lifecycle in order to improve quality and efficiency
  • Expert knowledge of a cloud service provider
  • Expert knowledge and hands on production experience in Kubernetes (bare metal or managed) cluster setup and management required
  • Experience with infrastructure as code (IaC) tools like Terraform, Pulumi
  • Experience with Kubernetes deployment tools like Helm, ArgoCD, Flux
  • Strong awareness of networking and internet protocols
  • Understanding of identity and access management (IAM)
  • Experience supporting infrastructure in production cloud environments
  • Knowledge of Encryption, Public Key Infrastructure (PKI), understanding of OWASP
  • Experience working with RESTful services
  • Some experience with monitoring tools (Azure Monitor, Splunk, Dynatrace, Graphana, Prometheus)
  • Familiarity with IDEs and Source Control tools like Visual Studio Code and Git

Nice to have

  • Bachelor’s Degree in Computer Science, Information Technology, Software Engineering, Math, Physics
  • Master’s Degree with coursework focused on advanced algorithms, mathematics in computing, data structures or related field
  • Expert knowledge of Azure
  • Demonstrate passion about infrastructure automation
  • Ability to prioritize work in a fast-paced environment

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineer

8 matching positions

Senior Site Reliability Engineer

AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce. Its ...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
autorabit.com Logo
AutoRABIT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in SRE, DevOps, or related roles
  • Solid hands-on experience with AWS services (EKS, ECS, EC2, RDS, S3, Redis, etc.)
  • Proficient in writing Terraform infrastructure scripts
  • Strong scripting skills in Python using Boto3
  • Deep understanding of monitoring/logging tools (ELK, CloudWatch, TrendMicro)
  • Experience building and managing CI/CD pipelines (CodeBuild, CodePipeline)
  • Knowledge of infrastructure security and incident response practices
  • Willing to work in rotational shifts and rotational week-offs
  • Bachelor’s in computers or any related field
  • AWS certifications is preferred
Job Responsibility
Job Responsibility
  • Provision and manage AWS infrastructure using Terraform
  • Write AWS Lambda functions (Python3 + Boto3) to automate operational tasks
  • Set up monitoring, logging, and alerting with ELK, TrendMicro, and AWS CloudWatch
  • Configure alerts for performance and security anomalies
  • Develop and maintain CI/CD pipelines using AWS CodeBuild and CodePipeline
  • Troubleshoot production issues and contribute to blameless postmortems
  • Contribute to system hardening and security compliance efforts
  • Responsibility to adhere to set internal controls
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

As a Senior Site Reliability Engineer on the Platform team, you will identify is...
Location
Location
United States , Denver; San Francisco
Salary
Salary:
138000.00 - 191000.00 USD / Year
https://checkr.com Logo
Checkr
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science (or related field)
  • 6+ years of experience in building tools with Python (preferred), GoLang, or Ruby
  • 6+ years of experience in maintaining and observing production customer-facing environments in AWS or Azure
  • 6+ years of experience as a member of an incident response team
  • Deep understanding of the fundamental infrastructure and platform concepts behind a micro-service architecture, REST APIs, and asynchronous queueing models
  • Experience with observability platforms and frameworks like Datadog, Splunk, Grafana, Prometheus, or OpenTelemetry
  • Strong collaboration, documentation, communication, and project management skills
  • Experience with container orchestration using Kubernetes/Docker/Terraform
  • Experience driving platform adoption across engineering teams, guided by a self-service and product-first approach
  • A passion for customer-centricity and building relationships with other teams
Job Responsibility
Job Responsibility
  • Collaborate, drive, and execute architectural discussions with cross-functional teams
  • Lead cross-team projects and SREs' technical roadmap to enable engineering and help Checkr customers
  • Design, build, ship, and maintain the core observability libraries, tools, and patterns used by all of Checkr’s engineering teams
  • Proactively engage across teams to foster service reliability, efficiency, and scalability
  • Troubleshoot complex production issues across the stack, with respect to performance, availability, and data quality
  • Present detailed technical information and benefits of the Checkr platform to a wide array of customers, including operations, developers, technical architects, and executives
What we offer
What we offer
  • A fast-paced and collaborative environment
  • Learning and development allowance
  • Competitive cash and equity compensation and opportunities for advancement
  • 100% medical, dental, and vision coverage
  • Up to $25K reimbursement for fertility, adoption, and parental planning services
  • Flexible PTO policy
  • Monthly wellness stipend, home office stipend
  • In-office perks such as lunch four times a week, commuter stipend, and an abundance of snacks and beverages
  • Fulltime
Read More
Arrow Right

Senior Vice President, Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent work experience
  • 8+ years of relevant work experience
  • Highly motivated self-starter with excellent interpersonal and communication skills. Able to communicate efficiently at multiple levels of seniority
  • Certification or formal training in site reliability engineering concepts and practices
  • Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
  • 5+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
  • Experience working on observability, logging and metrics toolsets
  • Experience of k8s and container technologies such as Docker, Openshift and EKS.
  • Experience with public cloud technologies such as AWS, GCP or Azure
  • Experience with Secrets products such as HashiCorp Vault or CyberArk
Job Responsibility
Job Responsibility
  • Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
  • Architecting and building tools and platforms that provide capabilities for SRE
  • Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organization
  • Actively owning production level incidents till resolution.
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
  • Strong hands-on experience with: Terraform & Infrastructure as Code
  • AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
  • Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
  • Kubernetes troubleshooting and operations
  • Prometheus/Grafana/Datadog observability stacks
  • Proven ability to operate in high-scale, high-uptime, multi-environment production systems
  • Experience building automation via Python/Bash and reducing operational toil
  • Strong understanding of incident management, root cause analysis, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
  • Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
  • Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
  • Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
  • Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
  • Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
  • Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
  • Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
  • Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
  • Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...
Location
Location
United States; Canada
Salary
Salary:
186818.00 - 224183.00 USD; CAD / Year
babylist.com Logo
Babylist
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a Site Reliability Engineer or similar role
  • Experience supporting high-traffic consumer-facing websites
  • Proficiency with Terraform
  • Strong experience working with AWS cloud-based infrastructure and services
  • Proficiency with Docker and Kubernetes
  • Solid understanding of cloud-native systems design
  • Troubleshooting and debugging skills
  • Experience designing and supporting CI systems
  • Familiar with monitoring and alerting best practices
  • Proven experience in on-call management best practices
Job Responsibility
Job Responsibility
  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
  • Improve the speed and reliability of our Continuous Integration (CI) systems
  • Provide support to developers in troubleshooting issues
  • Establish, communicate, and support best practices for monitoring and alerting
What we offer
What we offer
  • Company-paid medical, dental, and vision insurance
  • Retirement savings plan with company matching and flexible spending accounts
  • Generous paid parental leave and PTO
  • Remote work stipend
  • Perks for physical, mental, and emotional health, parenting, childcare, and financial planning
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

HiveWatch is seeking a Staff Site Reliability Engineer to join our Platform Team...
Location
Location
United States , El Segundo
Salary
Salary:
183000.00 - 235000.00 USD / Year
hivewatch.com Logo
HiveWatch
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience with strong coding skills in production environments
  • 5+ years of SRE, DevOps, or production operations experience
  • Expertise with cloud platforms (AWS preferred) and containerized applications (Docker, Kubernetes)
  • Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
  • Proficiency in at least one object oriented programming language in our tech stack (Java, Kotlin, Python)
  • Hands-on experience with relational databases and SQL performance optimization
  • Experience with monitoring and observability tools (Prometheus, Grafana, DataDog, or equivalent)
  • Strong debugging skills across distributed systems and microservices architectures
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Own the reliability of mission-critical systems including production monitoring, alerting, and capacity planning
  • Debug and resolve complex production issues across the full stack, from infrastructure to application code
  • Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
  • Perform root cause analysis requiring deep code-level investigation and implement preventive measures
  • Build automation and tooling to reduce operational toil and improve system reliability
  • Maintain CI/CD pipelines, observability infrastructure, and database performance optimization
  • Increase the resiliency, scalability, and maintainability of production environments
  • Establish on-call procedures and disaster recovery processes
  • Provide technical leadership and mentorship to foster engineering excellence and reliability culture
What we offer
What we offer
  • Comprehensive health coverage: medical, dental, vision, and life insurance
  • Cutting-edge work in an emerging field with huge growth potential
  • Competitive compensation packages designed to reward top talent
  • A modern, newly renovated HQ right on Main Street in El Segundo, CA
  • 401(k) with a 4% company match to help you invest in your future (match launches in 2026)
  • Flexible paid time off so you can recharge when you need it
  • Additional benefits include ClassPass credits and a discount on pet insurance
  • A family-friendly, compassionate culture that values balance and belonging
  • Eligible to participate in HiveWatch Equity Incentive Plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

What will you be doing at Miniclip? Participate in an on-call rotation with the ...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
miniclip.com Logo
Miniclip
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience with AWS in both development and operations contexts
  • Strong Linux system administration skills, including performance tuning and debugging
  • Software development background and strong coding skills in one or more of the following: Go, Python, Ruby
  • Experience with Infrastructure as Code, particularly Terraform
  • Familiarity with CI/CD pipelines and artifact management tools
  • A mindset for resilient systems design, thinking about edge cases, failure modes, and graceful degradation
  • Excellent communication skills in English, both written and spoken
  • Comfortable in a fast-paced environment and adaptable to shifting priorities
Job Responsibility
Job Responsibility
  • Participate in an on-call rotation with the Cloud Engineering team to respond to production incidents and outages
  • Operate and evolve infrastructure using Infrastructure as Code (Terraform), configuration management tools, and containerized platforms on AWS
  • Build and maintain observability tooling to detect symptoms before they lead to outages
  • Automate repetitive tasks and processes to reduce operational toil
  • Collaborate with Engineering and Product teams to design resilient systems that meet performance and reliability goals
  • Troubleshoot production issues across application, network, and infrastructure layers
  • Document systems, processes, and runbooks to improve team transparency and onboarding
Read More
Arrow Right

Senior Site Reliability Engineer

As a Site Reliability Engineer, you will focus on ensuring that the Prolific pla...
Location
Location
United Kingdom
Salary
Salary:
Not provided
prolific.com Logo
Prolific
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years with Google Cloud Platform, GKE, and the Kubernetes ecosystem with experience with Terraform and Terragrunt
  • Strong programming skills in Python
  • Strong experience in observability principles and tooling
  • Experience in GitOps flows and platforms for Kubernetes, such as ArgoCD
  • Deep understanding of system architecture and scalability principles
  • Strong collaboration and communication skills to work with cross-functional teams
Job Responsibility
Job Responsibility
  • Develop and maintain highly available infrastructure using modern infra-as-code techniques, with a focus on terragrunt and terraform
  • Manage and optimise Kubernetes clusters and their workloads with a focus on reliability and performance
  • Participate in incident response and remediation, working with relevant product teams and stakeholders to resolve production issues efficiently, including creating and maintaining runbooks
  • Review and optimise other areas of our tooling stack, such as CICD or release strategies
  • Foster a culture of continuous improvement, such as enhancing documentation and upskilling teams in cloud architecture and kubernetes
  • Improve observability and alerting systems across our application and infrastructure, ensuring proactive detection of system degradation
  • Collaborate with Engineering teams to foster an SRE culture, including contributing defining SLO’s, SLA’s and error budgets
  • Design and implement automation strategies to ensure managed services remain up-to-date, secure, and performant
  • Lead and support initiatives that automate processes to improve system efficiency, resilience and reduce toil
  • Organising, supporting and responding to on-call incidents
What we offer
What we offer
  • competitive salary
  • benefits
  • remote working
  • impactful, mission-driven culture
  • Fulltime
Read More
Arrow Right