This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Senior Site Reliability Engineer (SRE). This role has been designed as ‘’Onsite’ with an expectation that you will primarily work from an HPE office. Aruba is an HPE Company, and a leading provider of next-generation network access solutions for the mobile enterprise. Helping some of the largest companies in the world modernize their networks to meet the demands of a digital future, Aruba is redefining the “Intelligent Edge” – and creating new customer experiences across intelligent spaces and digital workspaces. Join us redefine what’s next for you.
Job Responsibility:
Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP environments
Operate and support infrastructure components and distributed data platforms such as Kubernetes, Kafka, Flink, Storm, and Spark
Manage and maintain databases including Cassandra, Elasticsearch, Redis, Postgres, and ArangoDB
Monitor systems, troubleshoot issues, and resolve production incidents across microservices and distributed systems
Collaborate closely with software engineering teams to debug and resolve complex production problems
Participate in 24x7 on-call rotation supporting multi-cloud production environments
Monitor system metrics, application performance, and infrastructure health using observability tools
Own the incident management lifecycle, including detection, mitigation, Root Cause Analysis (RCA), and post-incident reviews
Develop and maintain runbooks, automation, and operational processes to improve reliability and efficiency
Perform capacity planning using system usage and performance data
Drive SRE best practices, operational standards, and continuous improvement initiatives
Requirements:
Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field
6–10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles
Strong hands-on experience with cloud platforms (AWS or GCP) including services like EC2/GCE, IAM, and object storage (S3/GCS)
Experience with containerization and orchestration technologies, especially Docker and Kubernetes
Experience building and managing CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab
Experience with monitoring and observability tools such as Prometheus, CloudWatch, or Stackdriver
Strong understanding of Linux systems administration and configuration management tools like Ansible
Experience managing distributed systems and streaming platforms such as Kafka, Cassandra, Elasticsearch, Spark, Flink, or Storm
Strong automation and scripting skills using Python, Go, Rust, or Shell scripting
Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation
Excellent analytical, troubleshooting, and problem-solving skills
Strong communication and collaboration skills with the ability to work with cross-functional teams