Site Reliability Engineer

Site Reliability Engineer - Data Platform Operation

Join our Data & AI Platform team as a Site Reliability Engineer (SRE) – Platform...

Location

Brazil , Sao Paulo

Salary:

Not provided

Amaris Consulting

Expiration Date

Until further notice

Requirements

Academic background: Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (minimum 3 years of experience)
Experience: 5+ years hands-on with cloud platforms (Azure, AWS, GCP), programming (Bash, PowerShell, Terraform, Python, Java), and Infrastructure as Code (IaC)
English language: Professional working proficiency in English and the local language
Tools / software: Deep expertise in Azure, Databricks, Unity Catalog, Kubernetes, Helm, Docker, Power BI, Datadog, Grafana, GitHub, Azure DevOps, ArgoCD, Airflow, SSIS, Power Query, and relational/NoSQL databases
AI experience: Experience supporting enterprise Data & AI platforms
Soft skills: Analytical problem-solving
Effective communication and active listening
Team player with respect for others
Strong troubleshooting and platform monitoring skills
Automation (Python, PowerShell, CLI, KQL, Terraform)

Job Responsibility

Support, manage, and maintain Azure resources: Azure SQL, Synapse, Data Factory, Databricks, Unity Catalog
Monitor Azure workloads, troubleshoot incidents, alerts, and performance bottlenecks
Implement and manage RBAC, identity & access policies, and compliance controls
Optimize Azure cost and performance using Azure Monitor, DataDog, and Cost Management tools
Automate tasks using PowerShell, Azure CLI, Terraform, and Python
Utilize Git, GitHub Actions, and Airflow for workflow automation
Provide L2/L3 support for data pipelines, reporting, and cloud services
Conduct incident response, root cause analysis (RCA), and proactive issue resolution
Collaborate with Cloud Engineering, Data Engineers, BI Developers, and Cloud Architects
Follow ITSM processes: Incident, Change, and Problem Management

What we offer

An international community bringing together 110+ different nationalities
An environment where trust has a central place: 70% of our key leaders started their careers at the first level of responsibility
A robust training system with our internal Academy and 250+ available modules
A vibrant workplace that frequently gathers for internal events (afterworks, team buildings, etc.)
Strong commitments to CSR, notably through participation in our WeCare Together program

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...

Location

United States , Reston

Salary:

Not provided

Tier4 Group

Expiration Date

Until further notice

Requirements

5+ years hands-on operating and managing Kubernetes and OpenShift clusters
Strong experience with Microsoft Azure (compute, networking, storage, and data services)
Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
Proficiency with observability tooling (Datadog, Prometheus, Grafana)
Scripting/coding ability in Bash, Python, or Go

Job Responsibility

Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
Map current hybrid topology and critical delivery pipelines
identify toil and prioritize automation (Terraform/Ansible)
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
Drive GitOps-first workflows
harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
Lead incident response and postmortems
institutionalize RCA, blameless learning, and continuous improvement

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

New

Lead Site Reliability Engineer

We're building a Site Reliability Engineering center in Mexico City, and we're h...

Location

Mexico , Mexico City

Salary:

Not provided

Capital One

Expiration Date

Until further notice

Requirements

Professional English fluency
Bachelor's degree
At least 6 years of experience in SRE, production operations, or reliability engineering
Experience in DevOps Engineering (internship experience does not apply)
5+ years of experience in at least one of the following: Java, Python, Go
At least 4 years of experience with Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform)
3+ years of experience with container orchestration services including Docker or Kubernetes
Experience with Shell or Bash scripting
At least 3 years of Unix or Linux system administration experience

Job Responsibility

Own reliability for batch settlement systems - ensure cycle completion windows are met, data integrity is maintained, and failures are detected before they reach downstream consumers
Build and improve observability for settlement pipelines - dashboards, alerts, and anomaly detection that make system health legible and reduce reliance on tribal knowledge
Drive automation of operational toil - certificate rotation, environment provisioning, compliance artifact generation, and manual validation steps that currently require human intervention
Partner with UK-based settlement engineers - acquire domain expertise on Durbin compliance windows, cross-border DCI routing, and acquirer/issuer SLA adherence
Participate in incident management - respond to settlement failures, drive root cause analysis, and implement durable fixes that prevent recurrence
Contribute to regulatory readiness - ensure SRE practices produce audit-ready artifacts for SOX and PCI-DSS exams without manual toil

What we offer

Healthy Body, Healthy Mind
Save Money, Make Money
Time, Family and Advice

Fulltime

Staff Site Reliability Engineer - Cloud

Elevate Global Operations as our Next Cloud Site Reliability Engineer (OpenTelem...

Location

United Kingdom

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

Hands-on experience with the OpenTelemetry Collector, APIs, and SDKs
Extensive experience with observability tools like NewRelic, Datadog, or Splunk
Strong proficiency in Infrastructure as Code (Terraform, Ansible) and cloud platforms (AWS, GCP, or Azure)
Deep understanding of containerization and orchestration using Docker and Kubernetes
Advanced coding skills in Python, Go, or Java for building robust automation and monitoring tools
Experience leveraging AI coding assistants like GitHub Co-Pilot to accelerate development

Job Responsibility

Lead a global "OTel First" strategy, implementing OpenTelemetry at scale across a diverse technological landscape
Spearhead the development of automation scripts and Infrastructure as Code using Terraform to ensure seamless, reproducible platform delivery
Optimize platform performance and cost-efficiency, ensuring our observability tools scale economically as our data grows
Collaborate with engineering teams to embed reliability and security standards into new features from the ground up
Drive root cause analysis and problem management to proactively prevent incidents and improve the customer experience

We are currently seeking a Site Reliability Engineer to join our team in Guadala...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
Understand the concept of container orchestration platforms (e.g. Kubernetes)
Understand the concept of scripts: Powershell, Python
Understand the difference between NoSQL and SQL databases, and how to maintain them

Job Responsibility

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)

Fulltime

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...

Location

United States

Salary:

116633.00 - 181243.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements

Job Responsibility

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
Partner with engineering team members to embed reliability best practices early in the development lifecycle
Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
Reduce operational toil by identifying repetitive work and implementing automation-first solutions

Fulltime

Select Country

Site Reliability Engineer - Kubernetes - Data Platforms

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Site Reliability Engineer - Kubernetes - Data Platforms

Site Reliability Engineer - Data Platform Operation

Staff Engineer, Site Reliability Engineer

Site Reliability Engineer Platform Engineer

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

Lead Site Reliability Engineer

Staff Site Reliability Engineer - Cloud

Site Reliability Engineer

Senior Site Reliability Engineer, Wikimedia Enterprise

Our AI answers in your language