Site Reliability Engineering (SRE) / Lead Engineer Job at NTT DATA (Guadalajara)

Site Reliability Engineering (SRE) Team Lead

We are looking for a highly skilled and experienced Site Reliability Engineering...

Location

United States , Irving

Salary:

Not provided

OneMain Financial

Expiration Date

Until further notice

Requirements

BA/BS in Computer Science, Engineering, related field, or equivalent experience
7+ years of experience in site reliability engineering, systems engineering, or related roles, with at least 2 years in a leadership position
Proven experience leading and scaling high-performing engineering teams
Deep expertise in cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker)
Strong skills in infrastructure as code tools (Terraform, Ansible, CloudFormation) and CI/CD pipelines
Proficiency with monitoring and alerting systems (Prometheus, Grafana, ELK, Datadog)
Solid programming and scripting skills (Python, Go, Bash, or similar)
Strong understanding of distributed systems, networking, security, and databases
Excellent leadership, communication, and collaboration skills
Experience managing incident response and on-call rotations

Job Responsibility

Lead, mentor, and grow a team of site reliability engineers, promoting a culture of reliability, automation, and continuous improvement
Drive the design, implementation, and maintenance of scalable and fault-tolerant infrastructure to support high-availability services
Oversee incident management processes, including triage, root cause analysis, and postmortems to improve system reliability and prevent recurrence
Collaborate cross-functionally with software engineering, product, and operations teams to integrate reliability best practices into the software development lifecycle
Define and implement operational metrics, SLIs/SLOs, and dashboards to monitor system health and drive proactive improvements
Manage and assess the observability of critical environments proactively addressing gaps that may arise
Oversee the release management processes, artifacts and tools that drive a repeatable software delivery lifecycle
Champion automation efforts to reduce manual intervention, improve deployment pipelines, and optimize infrastructure management
Lead capacity planning, disaster recovery, and performance tuning efforts
Ensure security and compliance standards are upheld across infrastructure and operations

What we offer

Health and wellbeing options including medical, prescription, dental, vision, hearing, accident, hospital indemnity, and life insurances
Up to 4% matching 401(k)
Employee Stock Purchase Plan (10% share discount)
Tuition reimbursement
Paid time off (15 days’ vacation per year, plus 2 personal days, prorated based on start date)
Paid sick leave as determined by state or local ordinance, prorated based on start date
Paid holidays (7 days per year, based on start date)
Paid volunteer time (3 days per year, prorated based on start date)
Access to Talkspace and Hinge for on-demand physical therapy via an app
Family back-up care

Fulltime

Site Reliability Engineering (SRE) / Observability Technical Lead

Join a dynamic team as a Site Reliability Engineer, leading observability and re...

Location

United Kingdom , London

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
Hands-on experience with OpenTelemetry (OTel) for distributed tracing and observability instrumentation
Strong proficiency in Infrastructure as Code (IaC) using Terraform
Solid understanding of cloud platforms including AWS, GCP, or Azure
Experience with automation/configuration management tools like Ansible, Chef, or Puppet
Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
Experience managing Kubernetes and containerized environments (Docker, Helm)
Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
Excellent leadership, communication, and collaboration skills

Job Responsibility

Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

What we offer

Tailored benefits that support your physical, emotional, and financial wellbeing
Continuous growth and development opportunities
Flexible work options

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

Senior Site Reliability Engineer (SRE)

The Senior SRE is responsible for deployment, updates, and operational support f...

Location

India , Chennai

Salary:

Not provided

Dalet

Expiration Date

Until further notice

Requirements

Cloud platforms: AWS, Azure
Containerisation & Orchestration: Kubernetes
Infrastructure as Code: Terraform
Configuration Management: Ansible
Packaging & Deployment: Helm
Databases: MariaDB, MongoDB
Monitoring, observability, networking, and cloud security.

Job Responsibility

Act as a senior technical authority for APAC Site Reliability Engineering activities
Drive best practices in reliability, operations, and engineering standards
Promote technical excellence, collaboration, and accountability across stakeholders
Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
Collaborate closely with engineering to improve platform components, automation, and operational processes

What we offer

Great career opportunities around the world
Truly collaborative environment with supportive leadership
Cutting edge technologies (AI, Cloud, Cybersecurity...)
Talented and passionate team members
Fun working environment

Fulltime

Manager of Site Reliability Engineering (SRE)

The Manager of Site Reliability Engineering leads and develops a team of SRE pra...

Location

United States , Birmingham

Salary:

Not provided

Genuine Parts Company

Expiration Date

Until further notice

Requirements

Typically requires a bachelor's degree and 7 years of experience in a technology and/or software engineering role or an equivalent combination
Proven experience working in large, complex enterprise environments (Fortune 500 or equivalent)
Strong understanding and demonstrated implementation of Site Reliability Engineering (SRE) principles at scale
Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, and ArgoCD
In-depth knowledge and practical experience with CI/CD pipelines and automation of software delivery
Championing DevOps practices and embedding reliability early in the SDLC
Significant hands-on experience in Site Reliability Engineering or related roles focused on cloud infrastructure reliability
Strong software engineering background with proficiency in infrastructure-as-code tools (e.g., Terraform, ArgoCD) and CI/CD automation
Deep knowledge of cloud platforms, specifically Google Cloud Platform (GCP), Kubernetes, container orchestration, and cloud-native architecture
Familiarity with monitoring and observability tools such as Dynatrace, Datadog, or equivalents

Job Responsibility

Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence
Implement and champion Site Reliability Engineering principles and DevOps best practices within the team to ensure service reliability, availability, and performance
Define and track key SRE metrics such as service uptime, incident response and resolution times
Drive automation efforts including CI/CD pipeline enhancements, infrastructure-as-code practices, and self-service infrastructure provisioning to increase deployment velocity while reducing manual toil
Own and continuously improve observability practices including system monitoring, logging, alerting, and diagnostics to ensure rapid issue detection and resolution
Participate in incident response processes including incident management, root cause analysis, post-mortems, and continuous improvement to enhance system resilience
Partner closely with software engineering, product management, architecture, and security teams to embed reliability and security early in the software development lifecycle (SDLC)
Oversee the management and scalability of cloud infrastructure environments, primarily on Google Cloud Platform (GCP), with a focus on Kubernetes, container orchestration, and hybrid cloud integrations
Advocate for and apply best practices in performance tuning, capacity planning, and system design for high availability
Develop and execute a long-term roadmap for our hybrid cloud platform, aligning with evolving business objectives and technology trends

What we offer

comprehensive benefit plans and programs designed to support your health and wellness, provide income protection and build financial security for your retirement

Fulltime

Site Reliability Engineer (Lead)

10Pearls is an award-winning end-to-end digital innovation company that helps bu...

Location

Pakistan , Islamabad

Salary:

Not provided

10Pearls

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science or related field
5–8 years in SRE or production-engineering roles running distributed systems at scale
Deep Kubernetes expertise — operators, RBAC, network policy, storage, upgrades
Hands-on with Keycloak / Vault / MinIO / Harbor / Kong or equivalent identity/secrets/storage/registry/gateway stacks
Strong Linux fundamentals and at least one systems language (Go, Rust) or shell/Python for tooling
Proven SLO/SLI authorship and error-budget-driven decision-making
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Loki, Tempo)
Calm, clear communication during incidents
strong post-mortem writing
Hands-on with infra-as-code — Helm, Kustomize, Terraform

Job Responsibility

Substrate operation — own the Kubernetes cluster plus Keycloak (identity), Vault (secrets), MinIO (object storage), Harbor (registry), Kong (gateway) — from bootstrap to day-2 operations
SLO framework — define, publish, and defend SLOs for every tier-1 service
own error budgets and burn-rate alerting
Incident response — build the on-call rotation, paging, runbook library, and post mortem culture
lead incident command during P1/P2 events
Release operations — co-own the blue-green / canary release model with L6 Delivery
sign off production-bound releases
Air-gap operations — ensure every operational runbook works in a fully offline environment — no assumption of external dependencies
Lead the Platform squad — technically lead 1 Infrastructure Engineer, 1 Observability Engineer, 2 DevOps Engineers
set standards for infra-as-code and automation

Fulltime

Site Reliability Engineering Lead

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...

Location

Canada , Mississauga

Salary:

120800.00 - 170800.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

6–10 years of relevant experience in a hands‑on technical role
Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
Experience working with senior stakeholders or technology partners
Demonstrated experience supporting IT service improvements or platform stability initiatives
Strong communication and presentation skills, with the ability to convey technical concepts clearly
Experience supporting or contributing to technical roadmaps or operational workstreams
Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
Ability to collaborate with cross‑functional support teams and technology groups
Strong organizational and workload‑planning skills
Consistently demonstrates clear and concise written and verbal communication skills

Job Responsibility

Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
Assist with vendor relationship management, including coordination with offshore managed services
Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
Partner with development teams to guide improvements in application stability and supportability
Contribute to frameworks for managing capacity, throughput, and latency
Assist in defining and implementing application onboarding guidelines and standards
Support team members by fostering a collaborative environment and encouraging skill development
Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
Participate in business review meetings to help align technology tools and strategies with business requirements
Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program

Fulltime

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...

Location

United States , Austin

Salary:

Not provided

Dutech Systems

Expiration Date

Until further notice

Requirements

8+ years of experience in SRE, DevOps, or Systems Engineering
Strong expertise in Linux/Unix systems and system internals
Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
Experience designing and operating distributed systems
Hands-on experience with cloud platforms (AWS or GCP)
Experience with Docker and Kubernetes
Strong understanding of monitoring, alerting, and logging concepts
Experience managing SLIs, SLOs, and error budgets
Experience with incident management and RCA processes

Job Responsibility

Design, implement, and manage highly available, distributed systems
Maintain and optimize cloud infrastructure (AWS/GCP)
Develop automation scripts using Python, Go, Java, or Bash
Manage containerized environments using Docker and Kubernetes
Define and monitor SLIs, SLOs, and error budgets
Implement monitoring, logging, and alerting solutions
Lead incident management, root cause analysis (RCA), and postmortems
Ensure system security and compliance within operational workflows
Improve system reliability through performance tuning and optimization
Collaborate with engineering teams to enhance deployment and release processes

Select Country

Site Reliability Engineering (SRE) / Lead Engineer

Job Description

Job Responsibility

Requirements

Looking for more opportunities?