Site Reliability Engineer / Observability Engineer Job at Rackspace (Giza)

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team responsible for Private and Public...

Location

Singapore , Singapore

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Bachelor’s degree or equivalent work experience
6+ years of relevant work experience
Highly motivated self-starter with excellent interpersonal and communication skills
Certification or formal training in site reliability engineering concepts and practices
Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
4+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
Experience working on observability, logging and metrics toolsets
Experience of k8s and container technologies such as Docker, Openshift and EKS
Experience with public cloud technologies such as AWS, GCP or Azure
Experience with Secrets products such as HashiCorp Vault or CyberArk

Job Responsibility

Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
Architecting and building tools and platforms that provide capabilities for SRE
Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organisation
Actively owning production level incidents till resolution.

What we offer

Equal opportunity employer
Accessibility support for persons with disabilities.

Fulltime

Site Reliability Engineer

As a highly skilled Site Reliability Engineer (SRE), you will contribute to buil...

Location

United States , New York City; San Francisco

Salary:

160000.00 - 300000.00 USD / Year

Hebbia

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
5+ years software development experience at a venture-backed startup or top technology firm
Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role
Strong expertise in managing CI/CD pipelines and deployment automation
Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop)
Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes
Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or similar
Knowledge of infrastructure-as-code (IaC) tools such as Terraform or CloudFormation
Familiarity with security best practices and tools for infrastructure and application security
Excellent problem-solving skills and the ability to troubleshoot complex issues

Job Responsibility

Assist in managing deployment pipelines to facilitate smooth and efficient software releases
Help implement and maintain observability solutions for monitoring system performance and reliability
Support local development environments to optimize developer workflows
Work with development teams to ensure infrastructure aligns with project requirements
Contribute to improving the security of our infrastructure by assisting with proactive measures and audits
Assist in developing and maintaining automation scripts and tools to enhance operational efficiency
Help troubleshoot and resolve infrastructure and application issues to minimize downtime and maintain smooth operations
Participate in evaluating and integrating new technologies to enhance the scalability, reliability, and security of our infrastructure

What we offer

PTO: Unlimited
Insurance: Medical + Dental + Vision + 401K
Eats: Catered lunch daily + doordash dinner credit if you ever need to stay late
Parental leave policy: 3 months non-birthing parent, 4 months for birthing parent
Fertility benefits: $15k lifetime benefit
New hire equity grant: competitive equity package with unmatched upside potential

Fulltime

Staff Engineer, Site Reliability

LearnUpon is looking for a Staff Site Reliability Engineer to join our team in I...

Location

Ireland , Dublin

Salary:

Not provided

LearnUpon

Expiration Date

Until further notice

Requirements

7+ years of experience in a software or Ops role
5+ years of cloud engineering experience, with at least 2 years experience with AWS
Experience deploying Microservice environments, using containerisation technologies such as Kubernetes and Docker
Experience in designing and implementing Observability tech stacks
Have championed the benefits of Observability to Engineering teams
Can architect the design of SLO/SLI implementation that balances the needs of different teams
Familiar with cost analysis of Observability metrics gathering, Engineering effort, and tooling
Experience building and supporting large-scale distributed systems that back a consumer app or website with associated requirements of performance, security and disaster recovery
Experience with implementing IaaC (e.g. CloudFormation, Terraform etc.), automation tooling (e.g. Puppet, Ansible etc.), CI/CD (e.g. Jenkins, Travis CI, GitLab etc.)
Able to effectively communicate technical ideas to and collaborate with both technical and non-technical peers

Job Responsibility

Identifying opportunities to improve and scale our infrastructure for performance, observability, maintainability, and cost, by creating innovative solutions
Leading our efforts to build an observability function that incorporates application metrics, application transaction tracking, and event log management
Driving the processes to maintain resilient, scalable and cost-effective infrastructure
Working with other Engineering teams to provide infrastructure solutions that meet their ongoing requirements
Building tools focused on measuring, monitoring and alerting, with an eye towards self-service in order to promote Engineers’ ownership of observability
Reacting quickly to changing customer and business needs
Participate in on-call rota
Mentoring junior talent

What we offer

Work in a fun and supportive environment with regular team events
Excellent career progression
Structured learning environment
Competitive salary and company ESOP
Private health insurance
26 days annual leave

Fulltime

Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...

Location

Singapore , Singapore

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Bachelor’s degree or equivalent work experience
3+ years of relevant work experience
Highly motivated self-starter with good interpersonal and communication skills
Certification or formal training in site reliability engineering concepts and practices would be beneficial
Prior experience working towards SLIs, SLOs and observability capabilities
2+ years experience in Python alongside Linux based scripting languages
Experience working on observability, logging and metrics toolsets
Experience of k8s and container technologies such as Docker, Openshift and EKS
Experience with Secrets products such as HashiCorp Vault or CyberArk beneficial but not essential
Experience with CICD tools such as terraform, Jenkins, Ansible.

Job Responsibility

Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
Architecting and building tools and platforms that provide capabilities for SRE
Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organisation
Actively owning production level incidents till resolution.

Fulltime

Site Reliability Engineer

Corporate Tools is looking for a Site Reliability Engineer. You will be a tradit...

Location

United States

Salary:

175000.00 USD / Year

Corporate Tools

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Software Engineering, or equivalent practical experience
5+ years of experience in software engineering
2+ years of experience in site reliability engineering, DevOps, or infrastructure engineering roles
Deep experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code tools such as Terraform, CloudFormation, or Pulumi
Strong proficiency with Kubernetes, Docker, and container orchestration in production environments
Hands-on experience with observability and monitoring tools like Prometheus, Grafana, OpenTelemetry, Sentry, or New Relic
Proven ability to design and implement highly available, fault-tolerant systems and lead proactive incident response efforts
Experience with performance tuning, database optimization, and caching strategies (e.g., PostgreSQL, Redis, Memcached)
Demonstrated ability to drive reliability improvements, reduce operational toil, and foster a culture of resilience and continuous improvement
Experience leading reliability-focused initiatives such as post-incident reviews, capacity planning, and root cause analysis

Job Responsibility

Stop problems before they start
Fix issues quickly and learn from them
Help keep systems steady, secure, and running
Work closely with DevOps engineers to build out tools and automation
Take ownership

What we offer

100% employer-paid medical, dental and vision for employees
Annual review with raise option
22 days Paid Time Off accrued annually, and 4 holidays
After 3 years, PTO increases to 29 days
Employees transition to flexible time off after 5 years with the company—not accrued, not capped, take time off when you want
Paid Parental Leave
Up to 6% company matching 401(k) with no vesting period
Quarterly allowance
Open concept office with friendly coworkers
Creative environment where you can make a difference

Fulltime

Site Reliability Engineer

We are seeking a skilled Site Reliability Engineer (SRE) with experience in AWS,...

Location

Spain , Barcelona

Salary:

Not provided

Yokoy

Expiration Date

Until further notice

Requirements

Experience with AWS services such as ECS, S3, RDS, Lambda, CloudFront, etc.
Experience with monitoring tools like DataDog, CloudWatch, and Grafana
Experience with Docker, ECS, Kubernetes or similar containerisation technologies
Knowledge of languages such as Bash, Python, NodeJS
Experience with IaC tools such as Terraform, Pulumi, and so on

Job Responsibility

Design, build and maintain scalable, and reliable cloud infrastructure in AWS
Monitor and manage the performance, reliability, and security of our systems
Implement, and improve monitoring tools to ensure system health, and availability
Work with development teams to build, and maintain scalable, resilient and secure applications
Participate in our on-call rotation, and resolve production issues
Continuously improve automation, monitoring and deployment processes

What we offer

Competitive compensation, including equity in the company
Generous vacation days so you can rest and recharge
Health perks such as private healthcare
Fitness perks such as an onsite gym & fitness app subsidy
Flexible compensation plan to help you diversify and increase the net salary
Unforgettable Perk events, including travel to one of our hubs
Spring Health - Get access to 12x therapy & 12x coaching sessions per year
Exponential growth opportunities
VolunteerPerk - We offer 16 paid hours per year that you can use to give back to society by volunteering for a charity of your choice
Work from anywhere in the world allowance of 20 working days per year

Site Reliability Engineer

We’re looking for a passionate Site Reliability Engineer to pioneer our SRE stra...

Location

Spain , Barcelona

Salary:

45000.00 - 59000.00 EUR / Year

Edpuzzle

Expiration Date

Until further notice

Requirements

At least 3 years of experience in Site Reliability Engineering, DevOps Engineering, System Administration or Cloud Infrastructure Engineering for a web-based product with a focus on observability and reliability
Good knowledge of Amazon Web Services (AWS), CloudWatch and Datadog
Experience with software release management and deployment pipelines (Git, CI/CD)
Experience with Infrastructure as Code using AWS CDK
Experience writing JavaScript, TypeScript or Node.js code
Pragmatic with technologies: you understand tech is a tool to solve a product problem, tech is never the end goal
Excellent ability to communicate your ideas, regardless of the audience
Product-oriented: You make all your technology decisions with the final user in mind
You are naturally drawn towards understanding the bigger picture and recognize when there's a need for improvement, applying your intentional and rational thought process to address complex issues
You are able to work independently, plan and exercise conscious control of time spent on specific goals to reach deadlines effectively, and you don’t hesitate to pursue a goal despite the difficulties, all while maintaining a flexible mindset

Job Responsibility

Work with the Product, Infrastructure and Engineering teams to find the best technical solutions by participating in discussions and sharing your opinions
Take ownership of the problems that are being worked on, understanding why they are needed by the users, carrying out your own research, making your own proposals and working on the implementation while relying on your teammates for help when needed
Communicate effectively in a team in order to maximize productivity, ownership, and focus to help projects reach the finish line with the best possible outcome and by the project deadline
Design a cloud infrastructure that is secure, scalable, and highly available on AWS
Engage in proactive monitoring and observability with comprehensive tools and practices that not only detect and warn, but also predict potential system issues before they affect our users
Lead the charge in root cause analysis for production and infrastructure issues, transforming challenges into learning opportunities
Provision, configure and maintain cloud infrastructure as code
Perform rotatory on-call service, ensuring reliability and uptime for our users
Write technical documentation, contributing to our technical knowledge base and empowering your peers
Perform other exciting duties as opportunities and needs arise.

What we offer

On-call compensation
24 days’ paid holidays plus December 24th and 31st
Flexible working hours and reduced working time on Fridays to support work-life balance
€2000 annual allowance for meals with Cobee
Private health insurance policy with AXA
Access to Wellhub to support physical and emotional well-being
Flexible remuneration for childcare
Flexible remuneration for public transport
Flexible remuneration for health insurance of immediate family members (spouse and/or children)
Training and development (CodelyTV, Cloud Academy, etc.)

Fulltime

Select Country

Site Reliability Engineer / Observability Engineer

Job Description

Job Responsibility

Requirements

Looking for more opportunities?

Site Reliability Engineer / Observability Engineer

Site Reliability Engineering Manager

Cloud Security Site Reliability Engineer

Site Reliability Engineer

Staff Engineer, Site Reliability

Cloud Security Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language