CrawlJobs Logo

SRE Developer

India, Bangalore South · Job Posted March 26, 2026
Apply Position
Job Link Share

Job Description

We are looking for a proactive SRE Developer with 3–5 years of experience to manage Business‑As‑Usual (BAU) SRE operations while driving automation, reliability, and operational excellence. The role focuses on incident management, CI/CD operations, observability, and leveraging AI‑assisted tools to reduce manual effort and improve system reliability across cloud‑native environments.

Job Responsibility

  • Handle SRE BAU operations including incident management, root cause analysis, problem resolution, and service restoration
  • Manage and maintain CI/CD pipelines and deployment automation across environments
  • Improve system reliability, scalability, and performance through automation and proactive monitoring
  • Implement and manage observability solutions including logging, metrics, alerting, and dashboards
  • Utilize AI tools (CursorAI, Generative AI, automation copilots) for faster troubleshooting, documentation, code generation, and incident analysis
  • Collaborate with engineering, product, and security teams to ensure smooth releases and secure infrastructure
  • Reduce manual operational effort through AI-assisted automation and scripting
  • Drive DevOps best practices and continuous improvement initiatives

Requirements

  • Strong hands-on experience in SRE or DevOps operations
  • Expertise in CI/CD tools such as GitHub Actions, GitLab CI, Jenkins, Azure DevOps
  • Experience with monitoring and observability tools (Grafana, Prometheus, ELK, Splunk, Datadog, New Relic, etc.)
  • Good understanding of cloud platforms (AWS, Azure, or GCP)
  • Practical experience using AI tools in daily engineering workflows (CursorAI, ChatGPT, GenAI tools, automation assistants)
  • Ability to identify repetitive operational tasks and automate using AI or scripts
  • Familiarity with AI-driven troubleshooting and documentation
  • Proficiency in Python, Bash, PowerShell, or similar scripting languages
  • Exposure to Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, ARM, or Ansible

Nice to have

  • Experience supporting production environments with on-call rotations
  • Knowledge of containerization and orchestration (Docker, Kubernetes)
  • Understanding of performance tuning and capacity planning
  • Experience integrating AI into operational workflows or automation pipelines
  • Strong ownership mindset, adaptability, and continuous improvement attitude
  • Excellent communication and cross‑team collaboration skills

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

SRE Developer

8 matching positions

Sre (Developer Relations)

Location
Location
Japan , 東京23区
Salary
Salary:
7000000.00 - 10000000.00 JPY / Year
https://www.randstad.com Logo
Randstad
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fluent in English
  • Minimum 4 years of experience as an SRE engineer or Infrastructure Engineer
  • Experience in consulting / forward deployed engineering (FDE) experience
  • Experience with Kubernetes
  • Experience with debugging, problem solving, and resolving incidents
  • Experience with application development
  • Experience in multiple widely-used programming languages
  • Experience in AWS, GitHub, JIRA/Confluence, Slack, Linux (bash, CLI)
What we offer
What we offer
  • 健康保険
  • 厚生年金保険
  • 雇用保険
  • Fulltime
Read More
Arrow Right

SRE Ansible developer

Location
Location
Canada , Toronto
Salary
Salary:
155000.00 USD / Year
realign-llc.com Logo
Realign
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Design and implement automation scripts using Ansible for infrastructure provisioning and configuration management
  • Develop and maintain monitoring solutions leveraging Dynatrace for application and system performance
  • Configure and optimize ITRS monitoring tools to ensure proactive alerting and incident management
  • Collaborate with development and operations teams to improve system reliability and scalability
  • Automate deployment pipelines and integrate with CICD processes for faster releases
  • Troubleshoot performance issues and implement solutions to enhance system resilience
  • Ensure compliance with security and operational standards across environments
  • Document automation workflows, monitoring configurations, and best practices for knowledge sharing
  • Total Experience: 6-8 years
  • Fulltime
Read More
Arrow Right

Python Developer - Site Reliability Engineering (SRE)

We are seeking a skilled Python Developer with experience in the Site Reliabilit...
Location
Location
Canada , Montreal
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience with Python development
  • 6 years of experience working with Infrastructure as Code (Terraform and Ansible)
  • Experience with CI/CD pipelines, preferably GitHub Actions and Jenkins
  • Strong understanding of object-oriented design and development principles
  • Proficiency in Linux/Unix environments
  • Experience working with database technologies (preferably NoSQL), including data modeling, testing, and performance tuning
  • Ability to write reusable, optimized, maintainable, and well‑documented code following industry best practices
  • Experience implementing open-source monitoring and observability tools such as Prometheus, Grafana, Splunk or Open Telemetry
  • Strong problem‑solving skills and ability to take ownership of tasks and drive them independently to closure
  • Understanding of networking concepts (TCP/IP, DNS, Load Balancing)
Job Responsibility
Job Responsibility
  • Develop quality software working with public cloud service provider (CSP) infrastructure across different Public Cloud areas
  • Develop, enhance, and integrate automation workflows for Public Cloud Service Providers (CSP), initially focused on Azure, and integrate with in-house tooling
  • Integrate automation workflows into CI/CD pipelines using GitHub Actions and Jenkins
  • Build proof-of-concept solutions in new areas of cloud and automation development
  • Provide technical support and debugging for application failures in both on-premises and cloud environments
  • Participate in all phases of the Software Development Life Cycle (SDLC), including analysis, design, coding, testing, and deployment
  • Evaluate, onboard, and implement emerging DevOps and automation tools to improve efficiency
  • Build and integrate observability into cloud platforms and solutions using open-source tools (Prometheus, Grafana, OpenTelemetry)
  • Identify, highlight, and reduce operational toil through automation, architectural improvements, and process optimization
  • Collaborate with global teams to understand requirements, develop high‑quality code, and deliver cloud-focused projects
Read More
Arrow Right

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Experience with CI/CD pipelines, GitLab, and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right

Lead Database Reliability Engineer

As our Lead Database Reliability Engineer, you'll support our products, Timely a...
Location
Location
New Zealand
Salary
Salary:
Not provided
evercommerce.com Logo
EverCommerce
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong ability to work autonomously, prioritise effectively, and make sound technical decisions while knowing when to seek support or collaboration
  • Deep expertise in database reliability, performance, and management, with the ability to mentor and coach others
  • Strong experience with relational databases, ideally SQL Server and Azure SQL, with exposure to Oracle environments
  • Advanced T-SQL skills and a passion for solving complex data challenges
  • Experience with, or willingness to learn, scripting, object-oriented programming, and infrastructure-as-code technologies such as Python, PowerShell, and Terraform
  • Focus on delivering scalable, reliable solutions with strong practices across monitoring, alerting, automation, documentation, and knowledge sharing
  • Experience working in agile environments using SCRUM and/or Kanban methodologies
  • Confident communicator who enjoys collaborating, contributing ideas, and engaging in healthy discussions around data practices and engineering improvements
Job Responsibility
Job Responsibility
  • Drive data strategy and data practices for how our databases work, scale and are used
  • Capacity planning and performance tuning of the database platforms
  • Carry out database related project work (e.g. writing migration scripts, procs, DDL etc)
  • Manage high risk data deployments and post-deploy monitoring
  • Assist development teams in a consultancy role for identifying risks and proposing solutions around database reliability
  • Work with the data team and product teams to constantly improve data operational engineering practices
  • Maintain awareness of trends and emerging technologies in relevant fields and propose to Wellness Solutions when fit
  • Advocate for, and apply Devops and SRE principles across Wellness engineering teams
  • Grow and mentor other Data professionals
What we offer
What we offer
  • Work-life balance
  • Additional annual leave
  • Flexibility to work from home or the office
  • High-spec home office setup
  • Professional development budget
  • Annual wellness allowance
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Kubernetes & ServiceMesh

Join us in building Roku’s next-generation cloud-agnostic platform that powers K...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong hands-on experience with cloud technologies (AWS preferred
  • GCP or Azure is a plus), specifically in architecting and managing performant, large-scale systems handling significant traffic/data
  • Deep knowledge of Kubernetes (EKS, GKE, AKS, or similar) and service mesh technologies
  • Proficiency in Go or another programming language, Python or another scripting language
  • Experience designing infrastructure and building automation tools, while collaborating with internal team members and external stakeholders
  • Experience building CI/CD pipelines and following modern deployment practices
  • Familiarity with observability tools (Prometheus, Thanos, Loki, Grafana, etc.)
  • Ability to work independently and communicate effectively with technical and non-technical stakeholders
  • Passion for learning and solving complex infrastructure challenges
  • Experience integrating AI tools to improve processes and reduce operational toil (a plus)
Job Responsibility
Job Responsibility
  • Architect, design, and deploy Roku’s next-generation cloud platform and service mesh
  • Build and own solutions to Roku's compute problems using Docker, Kubernetes, Istio/Envoy, Terraform and scripting to evolve our tech stack and deployments
  • Proactively drive the research and implementation of new technologies to enhance scalability, reliability, and developer experience
  • Integrate security best practices into infrastructure design and automation
  • Build tooling to visualize inefficiencies and optimize costs across shared-tenancy clusters, including network traffic insights, cross-cluster communication efficiency, and cost attribution
  • Collaborate with internal teams to migrate workloads to Kubernetes + Istio, leveraging open-source observability tools
  • Work closely with the Observability team to scale monitoring and logging solutions for a holistic view of the platform
  • Leverage SRE principles to maintain high availability and streamline onboarding workflows
  • Mentor team members and help define best practices for infrastructure and automation
What we offer
What we offer
  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life insurance
  • accident insurance
  • disability insurance
  • commuter benefits
  • retirement options (401(k)/pension)
  • time off
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - SRE

Roku is changing how the world watches TV. Roku is the #1 TV streaming platform ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Preferably 8+ years of experience in DevOps/SRE roles, with demonstrated expertise in implementing SRE principles, SLO/SLI frameworks, and error budget policies in production environments
  • Deep experience with observability and monitoring platforms such as Prometheus, Grafana, Datadog, New Relic, or equivalent, including experience building custom dashboards, alerts, and SLO-based monitoring
  • Strong background in incident management, including experience as an Incident Commander, conducting blameless postmortems, and implementing systematic reliability improvements based on incident learnings
  • Strong understanding of distributed systems and reliability engineering, including failure modes, fault tolerance patterns, circuit breakers, bulkheads, rate limiting, and graceful degradation strategies
  • Experience with a number of the following: Kubernetes, Docker, Service Mesh such as Istio, Envoy, Linkerd, Solo & ECS
  • Experience in cloud-focused software development, preferably in Go, Python, or other object-oriented programming languages
  • Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation
  • Experience with CI/CD automation, including GitLab pipelines and other related tools
  • Strong hands-on experience with cloud platforms such as AWS, GCP or Azure
  • Proven track record of implementing scalable, high-performance infrastructure solutions in fast-paced, dynamic environments
Job Responsibility
Job Responsibility
  • Design & Infrastructure
  • Contribute to postmortem culture by facilitating comprehensive, blameless post-incident reviews that identify root causes, contributing factors, and actionable remediation items. Track incident trends to identify systemic issues and prioritize reliability improvements
  • Implement chaos engineering practices to proactively identify failure modes, validate system resilience, and build confidence in recovery procedures. Conduct game days and disaster recovery exercises
  • SRE Process & Principles Implementation
  • Deploy and evolve SRE practices across the organization by establishing core SRE principles, frameworks, and methodologies. Define and implement service reliability practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets, to balance innovation velocity with system reliability
  • Manage Error Budgets as a mechanism for making data-driven decisions about feature velocity vs. reliability. Track, report, and enforce error budget policies, facilitating conversations between engineering and product teams about risk tolerance and release decisions
  • Reliability Engineering & Infrastructure
  • Reduce toil through automation by identifying repetitive operational work and systematically eliminating it through infrastructure-as-code, automation frameworks, and intelligent tooling. Measure and track toil reduction efforts, aiming to keep toil below 50% of team time
  • Implement capacity planning processes that ensure systems have adequate headroom to meet SLOs during peak traffic, unexpected load spikes, and degraded states. Develop predictive models and automated scaling mechanisms
  • Observability, Monitoring & Reporting
What we offer
What we offer
  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off in accordance with local leave policies
  • Fulltime
Read More
Arrow Right

Applications Support Tech Lead Analyst - Vice President

The Apps Sup Tech Lead Analyst is a strategic professional who stays abreast of ...
Location
Location
India , Chennai, Tamil Nadu, India, Pune, Maharashtra, India
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years Proven and practical experience of in a production support role managing enterprise-level applications, demonstrating strong problem-solving and strategic thinking skills
  • Demonstrated experience with SRE practices, including advanced monitoring, alerting, incident response, post-mortems, and driving automation for operational efficiency and system reliability
  • Operating Systems & Scripting: Deep expertise in Unix/Linux environments and advanced Shell scripting
  • Monitoring & Logging Tools: Proficiency with enterprise monitoring tools (e.g., ITRS Geneos, AppDynamics) and log aggregation platforms (e.g., Splunk, ELK)
  • Practical experience with containerization platforms (OpenShift, Kubernetes)
  • Hands-on experience with relational (Oracle, MSSQL) and NoSQL (MongoDB) databases
  • Strong knowledge of messaging solutions (e.g., Tibco EMS, MQ, Kafka)
  • Experience working with REST APIs
  • Infrastructure Fundamentals: Solid understanding of distributed application architecture, including networks, load balancers, storage, and authentication (AD/LDAP)
  • Working knowledge and practical application of Object-Oriented Programming (OOP) concepts and principles
Job Responsibility
Job Responsibility
  • Partner with multiple technology teams to ensure appropriate integration of functions to meet goals
  • identify and define necessary system enhancements
  • analyze existing system logic, identify problems
  • and recommend and implements solutions
  • Provides expertise in area and an advanced level of understanding of the principles of apps support
  • Formulates and defines systems scope and objectives for complex, high impact application enhancements and problem resolution through in-depth analysis and evaluation of complex business processes, systems and industry standards
  • documents requirements
  • Partners with multiple technology areas and management teams to ensure appropriate integration of functions to meet goals
  • Works closely with Product Owners, Business Analysts and Systems Analysts to determine and document Systems impacts and support requirements
  • Considers the implications of the application of technology to the current environment
  • Fulltime
Read More
Arrow Right