CrawlJobs Logo

Cloud Reliability Engineer

India, Bangalore · Job Posted August 14, 2025
Apply Position
Job Link Share

Job Description

This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an HPE office. Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today’s complex world. Our culture thrives on finding new and better ways to accelerate what’s next. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career our culture will embrace you. Open up opportunities with HPE.

Job Responsibility

  • Work on tools and technologies that involve monitoring, automating, and improving systems through software engineering principles applied to IT operations
  • Harness the power of data and metrics to make evidence-based improvements that enhance the way we operate COM
  • Build and maintain comprehensive metrics collection systems
  • Collaborate and partner with feature development partner teams on best practices to ensure we have global team visibility of our application's health, SLIs and SLOs
  • Use data to gain insight into the COM stack for the purpose of improving performance, reliability, and cost effectiveness
  • Build out robust documentation and runbook standards that our teams use to improve our incident response effectiveness
  • Implement and maintain security controls and practices to protect systems from unauthorized access and attacks

Requirements

  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 2-8 years’ experience
  • Development experience with Python, Go or Java (or C#, C++, C) or similar programming languages
  • Good understanding of REST APIs and the fundamentals of successful design and testing of a REST API
  • Have an enthusiastic, go-for-it attitude
  • Have an urge to collaborate and communicate asynchronously
  • Good understanding of distributed systems, event driven programming paradigms and designing for scale and performance
  • Ability to troubleshoot complex issues with curiosity, flexibility, creativity and a sense of ownership and accountability
  • Strong communication skills and ability to work in a distributed team
  • Highly desirable one or more of: Grafana, Prometheus, AWS, Kubernetes, Terraform

Nice to have

Cloud Architectures, Cross Domain Knowledge, Design Thinking, Development Fundamentals, DevOps, Distributed Computing, Microservices Fluency, Full Stack Development, Release Management, Security-First Mindset, User Experience (UX)

What we offer

  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Cloud Reliability Engineer

8 matching positions

Cloud Engineer / Site Reliability Engineer (SRE)

Location
Location
United States , Orlando
Salary
Salary:
75.00 USD / Hour
bhsg.com Logo
Beacon Hill
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong hands-on AWS experience with solid understanding of core AWS services
  • Experience supporting and troubleshooting AWS and Azure cloud environments
  • Terraform experience for Infrastructure as Code
  • Docker/containerization experience
  • Strong troubleshooting and problem-solving skills
  • Ability to translate requirements into technical execution
  • Experience performing cloud architecture and diagramming
  • Experience supporting deployments, environments, and site standups
  • Strong communication and collaboration skills
Job Responsibility
Job Responsibility
  • Support cloud infrastructure and deployments across AWS and Azure
  • Troubleshoot infrastructure and application-related cloud issues
  • Build and maintain Terraform-based infrastructure
  • Support Docker/containerized environments
  • Create architecture diagrams and technical documentation
  • Work closely with engineering and project teams to execute cloud initiatives
  • Assist with automation and operational improvement efforts
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...
Location
Location
Salary
Salary:
175000.00 - 225000.00 USD / Year
zilliz.com Logo
Zilliz
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
  • Proficiency in scripting languages such as Python, Go, or Java
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
  • Experience with infrastructure as code tools such as Terraform or Ansible
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously
Job Responsibility
Job Responsibility
  • Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
  • Develop and implement strategies for monitoring, incident management, and disaster recovery
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
  • Collaborate with software engineers to enhance system reliability, scalability, and performance
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency
  • Fulltime
Read More
Arrow Right
New

Principal Site Reliability Engineer (Sovereign Cloud)

Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
  • Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
  • Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
  • Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
  • Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
  • Strong communication, strategic thinking, and problem-solving skills
  • Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
  • Ready to understand and dissect new technology stacks quickly
Job Responsibility
Job Responsibility
  • Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
  • Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
  • Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
  • Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
  • Participate in on-call rotations to support critical business and production systems
  • Fulltime
Read More
Arrow Right
New

Sr Principal Site Reliability Engineer (Sovereign Cloud)

The Prisma Access team is seeking a seasoned Principal Site Reliability Engineer...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in Infrastructure, SRE, or DevOps roles
  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • 7+ years of experience with GCP, and expertise in their architecture, services and PKI concepts for cloud security
  • Expert troubleshooting skills to resolve cloud infrastructure and service issues, effectively identifying root cause and devising effective solutions
  • Proficiency in automation using Python and shell scripting
  • Expertise in Infrastructure as Code (IaC) with Terraform and Helm, leveraging AI tools for development
  • Solid experience with Kubernetes, container networking, and container workloads
  • Strong Linux administration skills
  • Proficiency with CI/CD pipelines, GitOps principles, and tooling like GitLab and Jenkins
  • Excellent written and verbal communication skills, with the ability to collaborate effectively to drive outcomes
Job Responsibility
Job Responsibility
  • Design, build, and operate reliable, secure Cloud infrastructure across multi-cloud environments for our sovereign customers
  • Lead cross-functional initiatives to ensure applications are production-ready, scalable, secure, and resilient
  • Develop expertise in new technologies, embracing continuous learning and the adoption of AI tools
  • Develop tools and automation frameworks, championing Infrastructure as Code (IaC) and Monitoring as Code (MaC) principles
  • Automate robust deployments and orchestrate end-to-end monitoring and alerting solutions
  • Participate in on-call rotations to support critical business and production systems
  • Lead root cause analysis of critical issues, driving improvements and preventing recurrence
  • Champion the success of SRE and DevOps initiatives, aligning technical decisions with business goals
  • Fulltime
Read More
Arrow Right
New

Sr Principal Site Reliability Engineer (Sovereign Cloud)

Palo Alto Networks runs a large infrastructure and is one of the largest GCP cus...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
  • 7+ years building high availability, scalable cloud-native applications on AWS and GCP
  • BS or MS in Computer Science, a related field, or equivalent professional experience required
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm
  • Passion for infrastructure and monitoring as code
  • Solid experience in container workloads and Kubernetes
  • Familiarity with PKI concepts, Networking concepts
  • In-depth knowledge of different security controls ( app-id, user-id, security profile, url category, content, ssl decryption, firewall MFA etc)
  • Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Golang or Python along with shell scripting to automate tasks
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate in on-call rotations to support critical business and production systems
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right
New

Principal Site Reliability Engineer (Sovereign Cloud)

As a Principal Site Reliability Engineer, you will serve as the technical author...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Infrastructure, SRE, or DevOps roles
  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • Kubernetes Mastery: Expert-level experience (6+ years) managing production K8s workloads (preferably within GKE, but will also consider EKS)
  • Deep understanding of Networking, Storage, and RBAC
  • CI/CD & GitOps: Hands-on expertise with ArgoCD and modern pipeline runners (GitHub Actions, GitLab CI, or Jenkins)
  • Programming: Proficient in Python for systems programming and automation
  • Security Mindset: Proven experience integrating security scanning and compliance checks within a containerized environment
  • Modern Workflow: Experience (or strong desire) using AI-pair programming tools like Cursor and Claude to multiply personal and team productivity
  • Excellent written and verbal communication, able to collaborate and rally support
  • Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
Job Responsibility
Job Responsibility
  • Infrastructure Leadership: Architect and oversee large-scale Kubernetes clusters in GKE, ensuring high availability, performance tuning, and cost optimization
  • GitOps & Orchestration: Design and refine complex CI/CD lifecycles using ArgoCD, moving toward a fully declarative infrastructure-as-code model
  • Security Engineering: Implement and manage security scanning tools (e.g., Prisma Cloud, Snyk, or GKE native security) to ensure container integrity and shift-left security compliance
  • Automation & Tooling: Develop sophisticated automation scripts and internal tools using Python to eliminate manual toil and improve system observability
  • AI-Driven Development: Lean into the future of engineering by utilizing Cursor and Claude to accelerate coding, debugging, and documentation tasks
  • Incident Management: Act as a final escalation point for complex infrastructure outages, conducting blameless post-mortems to drive systemic improvements
  • Participate in on-call rotations to support critical business and production systems
  • Fulltime
Read More
Arrow Right
New

Principal Site Reliability Engineer (Sovereign Cloud)

We are looking for a Principal Engineer to join our SDWAN engineering team. You ...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
  • Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
  • Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
  • Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
  • Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
  • Excellent written and verbal communication, able to collaborate and rally support
  • Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
  • Passion for infrastructure and monitoring as code
  • Ready to understand and dissect new technology stacks quickly
Job Responsibility
Job Responsibility
  • Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
  • Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
  • Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
  • Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
  • Participate in on-call rotations to support critical business and production systems
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...
Location
Location
United States , Austin
Salary
Salary:
Not provided
dutechsystems.com Logo
Dutech Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, DevOps, or Systems Engineering
  • Strong expertise in Linux/Unix systems and system internals
  • Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
  • Experience designing and operating distributed systems
  • Hands-on experience with cloud platforms (AWS or GCP)
  • Experience with Docker and Kubernetes
  • Strong understanding of monitoring, alerting, and logging concepts
  • Experience managing SLIs, SLOs, and error budgets
  • Experience with incident management and RCA processes
Job Responsibility
Job Responsibility
  • Design, implement, and manage highly available, distributed systems
  • Maintain and optimize cloud infrastructure (AWS/GCP)
  • Develop automation scripts using Python, Go, Java, or Bash
  • Manage containerized environments using Docker and Kubernetes
  • Define and monitor SLIs, SLOs, and error budgets
  • Implement monitoring, logging, and alerting solutions
  • Lead incident management, root cause analysis (RCA), and postmortems
  • Ensure system security and compliance within operational workflows
  • Improve system reliability through performance tuning and optimization
  • Collaborate with engineering teams to enhance deployment and release processes
Read More
Arrow Right