Site Reliability Engineer Job at NTT DATA (Westlake)

Site Reliability Engineer

RED Global is currently supporting one of our international clients in their sea...

Location

Netherlands , Utrecht

Salary:

Not provided

RED Commerce - The Global SAP Solutions Provider

Expiration Date

Until further notice

Requirements

Strong experience as a Site Reliability Engineer
Experience supporting and maintaining reliable, scalable production environments
Strong troubleshooting and incident management capabilities
Experience working within complex enterprise environments
Strong communication and stakeholder management skills

Site Reliability Engineer

Qargo is a cloud-based (SaaS) Transport Management Platform. We are a scale-up b...

Location

Belgium , Ghent

Salary:

Not provided

Qargo

Expiration Date

Until further notice

Requirements

Experience as a Software Engineer, with an interest in infrastructure, scalability, reliability
Strong programming skills (preferably Python or similar backend languages)
Experience working with cloud platforms, container orchestrators, serverless (preferably Google Cloud)
Familiarity with distributed systems and scalability challenges
Experience with CI/CD pipelines and automation
Solid understanding of databases and performance tuning (SQL and/or NoSQL)
Familiarity with monitoring and observability tools
A problem-solving mindset and the ability to think in systems
Strong collaboration skills and a proactive approach to improving systems

Job Responsibility

Build and maintain systems and tooling that improve the reliability, scalability, and performance of our platform
Improve software delivery cycle, focusing on automation and developer experience
Develop internal tools and services to reduce manual operational work
Improve observability by implementing monitoring, logging, and alerting across systems
Optimize system performance, including databases such as PostgreSQL and Firestore
Collaborate with backend engineers and other engineering teams to design reliable and scalable system architectures
Troubleshoot complex production issues and implement long-term fixes
Continuously improve infrastructure (Infrastructure as Code, automation, etc.)

What we offer

A fast-growing SaaS company with a strong mission and an impact-driven team
A flexible work environment with flexible hours and hybrid working
A green office with a great atmosphere and lots of initiatives
A role with a lot of responsibility, ownership, and tangible impact
The opportunity to grow with us and shape both your career and our platform

Fulltime

Site Reliability Engineer

We are looking for a Site Reliability Engineer (SRE) to support reliable, high-p...

Location

United States , Novi

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Information Technology, Computer Science, Computer Engineering, or comparable practical experience
At least 5 years of experience supporting production environments in a corporate, startup, or similarly fast-paced technical setting
Hands-on expertise with infrastructure as code, including Terraform, along with experience in cloud platforms and related services
Working knowledge of container technologies such as Docker and orchestration platforms like Kubernetes
Experience supporting live systems, participating in on-call rotations, and contributing to incident reviews and corrective actions
Proficiency with automation and scripting using Bash and Python to reduce manual operational effort
Strong communication skills with the ability to explain technical decisions and tradeoffs to cross-functional or non-technical stakeholders
Willingness and ability to travel to customer or plant locations as business needs require

Job Responsibility

Maintain dependable and secure production environments across plant-edge and cloud-based systems, with a focus on uptime, responsiveness, and operational stability
Design, refine, and support monitoring dashboards, alerting frameworks, and operational runbooks using tools such as Prometheus, Grafana, and modern telemetry solutions
Build and manage infrastructure through code using Terraform, applying version control standards, peer reviews, and controlled deployment processes
Create automation scripts and lightweight tools in Bash and Python to streamline routine operations, recovery procedures, backup workflows, and environment setup
Take part in incident response and on-call coverage, troubleshoot service disruptions, coordinate initial communication, and document follow-up actions through blameless reviews
Establish and measure service reliability indicators and objectives, helping stakeholders balance system dependability with release speed and operational risk
Support secure connectivity between factory networks and cloud resources by configuring and maintaining VPNs, routing, private networking, and access controls
Administer and optimize relational or time-series databases, including backup planning, replication, performance tuning, and long-term storage health
Contribute to CI/CD delivery practices by improving deployment pipelines, supporting controlled release strategies, and preparing rollback procedures when needed
Partner with controls, software, and data teams to enable reliable data flow from industrial systems and ensure safe deployment to edge infrastructure

What we offer

medical, vision, dental, and life and disability insurance
401(k) plan

Site Reliability Engineer

As a Site Reliability Engineer, you are passionate about experience innovation a...

Location

India , Bengaluru

Salary:

Not provided

Valtech

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
2+ years in DevOps, SRE, or Support Engineering roles
Experience with incident management in high-traffic, public-facing platforms
Strong scripting skills (Python, Bash, or PowerShell)
Familiarity with CI/CD tools: GitHub Actions, Azure DevOps, GitLab, Jenkins
Experience with monitoring/APM tools: Datadog, New Relic, Dynatrace, Prometheus, Grafana
Basic knowledge of serverless services in AWS, Azure, or GCP
Proficiency with Docker and containerized environments
Excellent English communication skills (B2+ level)
Experience working in international, cross-cultural teams

Job Responsibility

Maintain and improve observability systems (monitoring, logging, alerting)
Define, adjust, and maintain Service Level Objectives (SLOs)
Participate in incident resolution and on-call rotations (max 1 week/month)
Drive proactive reliability improvements across platforms
Collaborate with teams to analyze failure scenarios and implement mitigations
Create and maintain runbooks for incident response and prevention
Eliminate non-value-adding tasks through automation and process optimization

What we offer

Flexibility, with hybrid work options (country-dependent)
Learning and development, with access to cutting-edge tools, training and industry experts

Fulltime

Site Reliability Engineer

NetApp is looking for a Senior TechOps Engineer - Cassandra to join our growing ...

Location

India , Bengaluru

Salary:

Not provided

NetApp

Expiration Date

Until further notice

Requirements

Strong experience in Apache Cassandra administration and architecture, with a desire to continuously learn and develop to an expert level
Experience in diagnosing and recommending mitigation strategies for Cassandra-related issues, including performance degradation due to resource bottlenecks, suboptimal data modeling leading to hot partitions, excessive tombstones, and inefficiencies caused by range slices and poorly constructed queries
Hands-on experience with Cassandra architecture and core administrative tasks, including compactions, repairs, backup and recovery, schema disagreement resolution, and configuration management
Experience handling Cassandra maintenance activities, including upgrades and migrations
Ability to investigate and research Cassandra issues by reviewing the Apache Cassandra codebase
Strong knowledge and experience with Linux, with the ability to work comfortably from the command line
Exceptional ability to communicate clearly and professionally in written and verbal English
Experience working with at least one public cloud platform, preferably AWS
Prior IT customer service or support experience within an ITIL-based environment
Strong fundamental computer science and software engineering skills, particularly in operating system internals, memory management, and networking

Job Responsibility

Your work will ensure the security, reliability, and performance of world-class systems and databases
You will collaborate with the technical teams of our customers, who are globally recognized companies in the gaming, banking, and logistics industries, ranging from large multinationals to emerging start-ups

What we offer

Volunteer time off
Well-being
Time away

Fulltime

Site Reliability Engineer

As Site Reliability Engineer you will contribute to the overarching implementati...

Location

Romania , Bucuresti

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Engineering, or related field
Minimum 5 years proven work experience as a Reliability Engineer or similar role
Expert knowledge and hands-on experience with applications hosted on cloud platforms such as Google Cloud Platform as well as with Docker / Kubernetes in combination with Google Kubernetes Engine (GKE), Terraform or similar technology
Experience in resilient software development in Python/JAVA and the usage of modern CI/CD pipelines e.g. Github, Github Actions, Bitbucket, Helm
Strong experience in the setup of observability, monitoring and self-healing solutions for instance with New Relic, Splunk, Google Cloud Operations, Lightstep and Ansible
Very good knowledge of security standards (e.g.: TLS, OAuth2, KMS, Vault, Admission Controllers, let's encrypt), microservice architectures and experience with API Management with Apigee or WSO2
Proactive attitude and collaborative Team player mindset paired with self confidence
Not losing your coolness and keep your eye for details even in stressful situations where time matters
Having a creative approach towards solving technical problems
Excellent communication skills in English

Job Responsibility

Define Service Level Objectives (SLOs), and enable an end-to-end view on customer satisfaction based on best practices for setting up Service Level Indicators (SLIs) to create effective strategies for maintaining and improving system performance and availability
Collaborate with Business Functional Analysts and Solution Architects to find improvements in the solution design to improve the resilience of technical solutions early on
Consult and guide the squad on the prioritization of reliability improvement and actively deliver them as part of the sprint
Hands-on experience in implementing reliability and resilience patterns like auto-scaling, circuit breakers, bulk-heads, rate limiter, retry mechanisms, etc.
Actively work on service request fulfilment, incident and problem mgmt. to identify and reduce toil and the MTTR with engineering best practices
Align and contribute on state-of-the-art SRE best practices e.g. Distributed Tracing, Open Telemetry and Chaos Engineering with the SRE chapter function
Be a knowledge- and skill multiplicator of your profession by being a Lead of the Site Reliability engineer population
Increase the seniority of the overall Site Reliability Engineer chapter by establishing events and procedures, and foster a culture of high standards
Lead people of your engineer profession and make them become better each day

What we offer

Smooth integration and a supportive mentor
Pick your working style: choose from Remote, Hybrid or Office work opportunities
Our projects have different working hours to suit your needs
Sponsored certifications, trainings and top e-learning platforms
Private Health Insurance – custom-made for you
Individual coaching sessions or accredited Coaching School
Epic parties or themed events – lovingly designed for our people and their families

Fulltime

Site Reliability Engineer

Build the tools and systems that make M365 sovereign cloud operations faster, sm...

Location

United States , Multiple Locations

Salary:

102100.00 - 219200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Passionate about distributed systems and working with highly scalable services
Enjoys new technological challenges and is motivated to solve them
Excited about making better software and continuously improving the development, integration, and deployment processes
Self-starter who thrives in a bottoms-up, fast-paced, highly technical environment
Effective collaborator, experienced in creating technical partnerships across teams
Committed to ensuring exceptional customer satisfaction through technical excellence
Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role
The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI)

Job Responsibility

Creates and implements code for a product, service, or feature, reusing code as applicable with minimal supervision
Acts as a designated responsible individual (DRI), working on-call to monitor a system/product feature/service for degradation, downtime, or interruptions
Maintains operations of live site service, following security best practices when responding quickly to mitigate issues while using the minimum required permissions to do so that arise on a rotational, on-call basis
Contributes to identifying dependencies, and incorporates them into the development of design documents for a product area with little oversight
Contributes to the identification of requirements for, and development of automation within production and deployment of a complex product feature, targeting zero-touch deployment when possible
Works with appropriate internal stakeholders to understand and determine customer/user requirements for a set of features
Remains current in skills by investing time and effort into being informed of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale

What we offer

Certain roles may be eligible for benefits and other compensation

Fulltime

Site Reliability Engineer

We are looking for a Site Reliability Engineer to support the stability, perform...

Location

United States , New York

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related discipline, or equivalent practical experience in infrastructure or operations
Working knowledge of Linux and/or Windows server administration fundamentals
Understanding of core networking principles such as TCP/IP, DNS, VLANs, routing, and firewall concepts
Experience with at least one scripting or automation language such as Python, Bash, or PowerShell
Familiarity with cloud infrastructure concepts in at least one major platform, such as Azure or AWS
Exposure to automation and configuration tools such as Terraform or Ansible
Strong analytical thinking, troubleshooting ability, and a willingness to learn in a fast-moving technical environment
Clear written and verbal communication skills with the ability to document operational procedures effectively

Job Responsibility

Oversee the health of production platforms through monitoring tools, assist with incident response, and help refine alerts, dashboards, and issue tracking processes
Support day-to-day operations for infrastructure spanning on-premises facilities and cloud environments, including servers, storage, network components, and middleware services
Contribute to the administration of multi-cloud resources across platforms such as Azure and Amazon EC2, with involvement in compute, networking, storage, and identity-related tasks
Build and enhance automation solutions using Infrastructure as Code practices to streamline repeatable work and improve platform consistency
Participate in DevSecOps and GitOps processes by assisting with CI/CD workflows, configuration management, and policy adherence
Help strengthen cloud security by identifying configuration gaps, assisting with remediation efforts, and supporting vulnerability reduction initiatives
Join the on-call rotation, respond to operational events, and contribute to post-incident reviews focused on continuous improvement
Create and maintain runbooks, technical procedures, and system documentation to improve operational readiness and knowledge sharing
Assist with containerized and orchestrated environments, including platforms that use Kubernetes, to support scalable application operations

What we offer

medical
vision
dental
life and disability insurance
company 401(k) plan

Fulltime

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language