Site Reliability Engineer Job at Citi (Pune)

New

Site Reliability Engineer

As a Site Reliability Engineer, you are passionate about experience innovation a...

Location

India , Bengaluru

Salary:

Not provided

Valtech

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
2+ years in DevOps, SRE, or Support Engineering roles
Experience with incident management in high-traffic, public-facing platforms
Strong scripting skills (Python, Bash, or PowerShell)
Familiarity with CI/CD tools: GitHub Actions, Azure DevOps, GitLab, Jenkins
Experience with monitoring/APM tools: Datadog, New Relic, Dynatrace, Prometheus, Grafana
Basic knowledge of serverless services in AWS, Azure, or GCP
Proficiency with Docker and containerized environments
Excellent English communication skills (B2+ level)
Experience working in international, cross-cultural teams

Job Responsibility

Maintain and improve observability systems (monitoring, logging, alerting)
Define, adjust, and maintain Service Level Objectives (SLOs)
Participate in incident resolution and on-call rotations (max 1 week/month)
Drive proactive reliability improvements across platforms
Collaborate with teams to analyze failure scenarios and implement mitigations
Create and maintain runbooks for incident response and prevention
Eliminate non-value-adding tasks through automation and process optimization

What we offer

Flexibility, with hybrid work options (country-dependent)
Learning and development, with access to cutting-edge tools, training and industry experts

Fulltime

New

Site Reliability Engineer

NetApp is looking for a Senior TechOps Engineer - Cassandra to join our growing ...

Location

India , Bengaluru

Salary:

Not provided

NetApp

Expiration Date

Until further notice

Requirements

Strong experience in Apache Cassandra administration and architecture, with a desire to continuously learn and develop to an expert level
Experience in diagnosing and recommending mitigation strategies for Cassandra-related issues, including performance degradation due to resource bottlenecks, suboptimal data modeling leading to hot partitions, excessive tombstones, and inefficiencies caused by range slices and poorly constructed queries
Hands-on experience with Cassandra architecture and core administrative tasks, including compactions, repairs, backup and recovery, schema disagreement resolution, and configuration management
Experience handling Cassandra maintenance activities, including upgrades and migrations
Ability to investigate and research Cassandra issues by reviewing the Apache Cassandra codebase
Strong knowledge and experience with Linux, with the ability to work comfortably from the command line
Exceptional ability to communicate clearly and professionally in written and verbal English
Experience working with at least one public cloud platform, preferably AWS
Prior IT customer service or support experience within an ITIL-based environment
Strong fundamental computer science and software engineering skills, particularly in operating system internals, memory management, and networking

Job Responsibility

Your work will ensure the security, reliability, and performance of world-class systems and databases
You will collaborate with the technical teams of our customers, who are globally recognized companies in the gaming, banking, and logistics industries, ranging from large multinationals to emerging start-ups

What we offer

Volunteer time off
Well-being
Time away

Fulltime

Site Reliability Engineer

As Site Reliability Engineer you will contribute to the overarching implementati...

Location

Romania , Bucuresti

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Engineering, or related field
Minimum 5 years proven work experience as a Reliability Engineer or similar role
Expert knowledge and hands-on experience with applications hosted on cloud platforms such as Google Cloud Platform as well as with Docker / Kubernetes in combination with Google Kubernetes Engine (GKE), Terraform or similar technology
Experience in resilient software development in Python/JAVA and the usage of modern CI/CD pipelines e.g. Github, Github Actions, Bitbucket, Helm
Strong experience in the setup of observability, monitoring and self-healing solutions for instance with New Relic, Splunk, Google Cloud Operations, Lightstep and Ansible
Very good knowledge of security standards (e.g.: TLS, OAuth2, KMS, Vault, Admission Controllers, let's encrypt), microservice architectures and experience with API Management with Apigee or WSO2
Proactive attitude and collaborative Team player mindset paired with self confidence
Not losing your coolness and keep your eye for details even in stressful situations where time matters
Having a creative approach towards solving technical problems
Excellent communication skills in English

Job Responsibility

Define Service Level Objectives (SLOs), and enable an end-to-end view on customer satisfaction based on best practices for setting up Service Level Indicators (SLIs) to create effective strategies for maintaining and improving system performance and availability
Collaborate with Business Functional Analysts and Solution Architects to find improvements in the solution design to improve the resilience of technical solutions early on
Consult and guide the squad on the prioritization of reliability improvement and actively deliver them as part of the sprint
Hands-on experience in implementing reliability and resilience patterns like auto-scaling, circuit breakers, bulk-heads, rate limiter, retry mechanisms, etc.
Actively work on service request fulfilment, incident and problem mgmt. to identify and reduce toil and the MTTR with engineering best practices
Align and contribute on state-of-the-art SRE best practices e.g. Distributed Tracing, Open Telemetry and Chaos Engineering with the SRE chapter function
Be a knowledge- and skill multiplicator of your profession by being a Lead of the Site Reliability engineer population
Increase the seniority of the overall Site Reliability Engineer chapter by establishing events and procedures, and foster a culture of high standards
Lead people of your engineer profession and make them become better each day

What we offer

Smooth integration and a supportive mentor
Pick your working style: choose from Remote, Hybrid or Office work opportunities
Our projects have different working hours to suit your needs
Sponsored certifications, trainings and top e-learning platforms
Private Health Insurance – custom-made for you
Individual coaching sessions or accredited Coaching School
Epic parties or themed events – lovingly designed for our people and their families

Fulltime

Site Reliability Engineer

Build the tools and systems that make M365 sovereign cloud operations faster, sm...

Location

United States , Multiple Locations

Salary:

102100.00 - 219200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Passionate about distributed systems and working with highly scalable services
Enjoys new technological challenges and is motivated to solve them
Excited about making better software and continuously improving the development, integration, and deployment processes
Self-starter who thrives in a bottoms-up, fast-paced, highly technical environment
Effective collaborator, experienced in creating technical partnerships across teams
Committed to ensuring exceptional customer satisfaction through technical excellence
Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role
The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI)

Job Responsibility

Creates and implements code for a product, service, or feature, reusing code as applicable with minimal supervision
Acts as a designated responsible individual (DRI), working on-call to monitor a system/product feature/service for degradation, downtime, or interruptions
Maintains operations of live site service, following security best practices when responding quickly to mitigate issues while using the minimum required permissions to do so that arise on a rotational, on-call basis
Contributes to identifying dependencies, and incorporates them into the development of design documents for a product area with little oversight
Contributes to the identification of requirements for, and development of automation within production and deployment of a complex product feature, targeting zero-touch deployment when possible
Works with appropriate internal stakeholders to understand and determine customer/user requirements for a set of features
Remains current in skills by investing time and effort into being informed of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale

What we offer

Certain roles may be eligible for benefits and other compensation

Fulltime

Site Reliability Engineer

We are looking for a Site Reliability Engineer to support the stability, perform...

Location

United States , New York

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related discipline, or equivalent practical experience in infrastructure or operations
Working knowledge of Linux and/or Windows server administration fundamentals
Understanding of core networking principles such as TCP/IP, DNS, VLANs, routing, and firewall concepts
Experience with at least one scripting or automation language such as Python, Bash, or PowerShell
Familiarity with cloud infrastructure concepts in at least one major platform, such as Azure or AWS
Exposure to automation and configuration tools such as Terraform or Ansible
Strong analytical thinking, troubleshooting ability, and a willingness to learn in a fast-moving technical environment
Clear written and verbal communication skills with the ability to document operational procedures effectively

Job Responsibility

Oversee the health of production platforms through monitoring tools, assist with incident response, and help refine alerts, dashboards, and issue tracking processes
Support day-to-day operations for infrastructure spanning on-premises facilities and cloud environments, including servers, storage, network components, and middleware services
Contribute to the administration of multi-cloud resources across platforms such as Azure and Amazon EC2, with involvement in compute, networking, storage, and identity-related tasks
Build and enhance automation solutions using Infrastructure as Code practices to streamline repeatable work and improve platform consistency
Participate in DevSecOps and GitOps processes by assisting with CI/CD workflows, configuration management, and policy adherence
Help strengthen cloud security by identifying configuration gaps, assisting with remediation efforts, and supporting vulnerability reduction initiatives
Join the on-call rotation, respond to operational events, and contribute to post-incident reviews focused on continuous improvement
Create and maintain runbooks, technical procedures, and system documentation to improve operational readiness and knowledge sharing
Assist with containerized and orchestrated environments, including platforms that use Kubernetes, to support scalable application operations

What we offer

medical
vision
dental
life and disability insurance
company 401(k) plan

Fulltime

Site Reliability Engineer

We are looking for a Site Reliability Engineer (SRE) to support reliable, high-p...

Location

United States , Novi

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Information Technology, Computer Science, Computer Engineering, or comparable practical experience
At least 5 years of experience supporting production environments in a corporate, startup, or similarly fast-paced technical setting
Hands-on expertise with infrastructure as code, including Terraform, along with experience in cloud platforms and related services
Working knowledge of container technologies such as Docker and orchestration platforms like Kubernetes
Experience supporting live systems, participating in on-call rotations, and contributing to incident reviews and corrective actions
Proficiency with automation and scripting using Bash and Python to reduce manual operational effort
Strong communication skills with the ability to explain technical decisions and tradeoffs to cross-functional or non-technical stakeholders
Willingness and ability to travel to customer or plant locations as business needs require

Job Responsibility

Maintain dependable and secure production environments across plant-edge and cloud-based systems, with a focus on uptime, responsiveness, and operational stability
Design, refine, and support monitoring dashboards, alerting frameworks, and operational runbooks using tools such as Prometheus, Grafana, and modern telemetry solutions
Build and manage infrastructure through code using Terraform, applying version control standards, peer reviews, and controlled deployment processes
Create automation scripts and lightweight tools in Bash and Python to streamline routine operations, recovery procedures, backup workflows, and environment setup
Take part in incident response and on-call coverage, troubleshoot service disruptions, coordinate initial communication, and document follow-up actions through blameless reviews
Establish and measure service reliability indicators and objectives, helping stakeholders balance system dependability with release speed and operational risk
Support secure connectivity between factory networks and cloud resources by configuring and maintaining VPNs, routing, private networking, and access controls
Administer and optimize relational or time-series databases, including backup planning, replication, performance tuning, and long-term storage health
Contribute to CI/CD delivery practices by improving deployment pipelines, supporting controlled release strategies, and preparing rollback procedures when needed
Partner with controls, software, and data teams to enable reliable data flow from industrial systems and ensure safe deployment to edge infrastructure

What we offer

medical, vision, dental, and life and disability insurance
401(k) plan

Site Reliability Engineer

Barclays is seeking a Site Reliability Engineer to join its Securitized Products...

Location

United States , Whippany

Salary:

170000.00 - 230000.00 USD / Year

Barclays

Expiration Date

Until further notice

Requirements

Programming or scripting experience (Python, Go, PowerShell, Bash, or similar) and SQL
Linux/Unix/Windows systems and systems engineering fundamentals
Client-server model architecture and scalability knowledge monitoring high traffic by distributing load across multiple backend servers
Performance monitoring and reducing latency in request-response cycles
Containers and orchestration (Docker, Kubernetes)
Networking (TCP/IP, DNS, HTTP, SFTP) and relational databases
and monitoring and observability tools (Geneos ITRS, Prometheus, Grafana, APM, Observe)

Job Responsibility

Development and delivery of high-quality software solutions by using industry aligned programming languages, frameworks, and tools
Cross-functional collaboration with product managers, designers, and other engineers to define software requirements, devise solution strategies, and ensure seamless integration and alignment with business objectives
Collaboration with peers, participate in code reviews, and promote a culture of code quality and knowledge sharing
Stay informed of industry technology trends and innovations and actively contribute to the organization’s technology communities to foster a culture of technical excellence and growth
Adherence to secure coding practices to mitigate vulnerabilities, protect sensitive data, and ensure secure software solutions
Implementation of effective unit testing practices to ensure proper code design, readability, and reliability

What we offer

medical, dental and vision coverage
401(k)
life insurance
other paid leave
incentive award
Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Site Reliability Engineer

Take full ownership of smartclip’s internal utility and platform tooling. Focus ...

Location

Germany , Berlin

Salary:

Not provided

Smartclip Europe GmbH

Expiration Date

Until further notice

Requirements

Apply an Observability Mindset: Implement a clear strategy for metrics, logs, and traces. Transform 'noisy alerts' into 'actionable insights.'
Embrace Ownership: Live the 'you build it, you run it' philosophy. Stop the ticket ping-pong and end the excuses.

Job Responsibility

Take full ownership of smartclip’s internal utility and platform tooling.
Focus your energy on the intersection of observability, automation, and developer infrastructure.
Evolve existing systems, research cutting-edge open-source alternatives, and implement them.
Invest in deep in-house expertise.
Understand our systems end-to-end, maintain total flexibility, and contribute back to the open-source ecosystem.
Build & Evolve: Operate and advance our observability stack (including Prometheus, Grafana, and Forgejo).
Go Open Source First: Replace 'buy' decisions with robust 'build & maintain' strategies.
Engineer the Platform: Design observability as a platform capability. Define SLOs and create actionable alerting to stop incidents before they start.
Secure the Stack: Embed security engineering into the delivery process. Find vulnerabilities before the pen tests do.
Master the Infrastructure: Navigate Linux systems and distributed tooling. Balance bold exploration with production stability.

What we offer

30 days of vacation + Dec 24 & 31 off
Smart Fridays (4 days week possible)
mobility (Germany ticket & JobRad)
sports & health offerings
mental health support
corporate benefits
RTL+ access

Fulltime

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language