This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Site Reliability Engineer Staff. This role has been designed as 'Hybrid' with an expectation that you will work on average 2 days per week from an HPE office. HPE is the global edge-to-cloud company. As a Staff Software Engineer, you will play a key role in designing, building, and optimizing cloud infrastructure and deployment systems.
Job Responsibility:
Enhance Infrastructure as Code (IAC) and enforce best practices
Optimize cloud infrastructure for scalability, security, and cost-effectiveness
Develop internal tools to support and streamline cloud platform operations
Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
Address container image vulnerabilities and standardize remediation processes
Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
Troubleshoot complex production issues to ensure system reliability and customer satisfaction
Fine-tune distributed systems such as Apache Kafka and Cassandra
Collaborate with development, security, and operations teams to align infrastructure with application needs
Requirements:
Minimum of 4 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
Proficiency with Linux systems, especially Debian-based distributions
Strong experience with cloud platforms such as AWS and GCP
Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
Solid programming skills in Python and/or Golang
Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
Experience with GitOps workflows
Proven track record in implementing and maintaining CI/CD pipelines
Strong background in security and familiarity with security programs
Experience with monitoring and logging tools (Prometheus, Grafana, ELK)
Knowledge of both relational (SQL) and non-relational databases
Excellent problem-solving and debugging skills with a strong sense of ownership
Experience managing distributed systems like Apache Kafka and Cassandra
Effective communicator and collaborative team player
Nice to have:
Experience contributing to open-source projects
Background in security engineering or related disciplines