CrawlJobs Logo

Site Reliability Engineering (SRE) / Observability Technical Lead

nttdata.com Logo

NTT DATA

Location Icon

Location:
United Kingdom , London

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Join a dynamic team as a Site Reliability Engineer, leading observability and reliability projects. Leverage your expertise in APM, IaC, and automation to enhance system performance and scalability. Collaborate with cross-functional teams and mentor junior engineers to foster a culture of operational excellence.

Job Responsibility:

  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

Requirements:

  • 5+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Hands-on experience with OpenTelemetry (OTel) for distributed tracing and observability instrumentation
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills
What we offer:
  • Tailored benefits that support your physical, emotional, and financial wellbeing
  • Continuous growth and development opportunities
  • Flexible work options

Additional Information:

Job Posted:
January 26, 2026

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Engineering (SRE) / Observability Technical Lead

Software Engineer, Site Reliability

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical r...
Location
Location
United States , San Mateo
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
  • 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
  • Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
  • Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
  • Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
  • Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
  • Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
  • In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
  • Proven ability to troubleshoot complex issues across the entire stack
  • Excellent communication, collaboration, and problem-solving skills
Job Responsibility
Job Responsibility
  • Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
  • Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
  • Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
  • Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
  • Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
  • Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
  • On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right

Staff Observability Operations Engineer

We are currently seeking several experienced and highly skilled Staff Observabil...
Location
Location
United States , Hartford
Salary
Salary:
130295.00 - 260590.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ Years of experience in IT operations, with significant responsibilities in system monitoring, performance tuning, and troubleshooting enterprise applications
  • 5+ Years in a Site Reliability Engineering (SRE) role deploying and managing modern observability solutions
  • 5+ Years managing and implementing observability and event management platforms (e.g., AppDynamics, Splunk, Prometheus, Grafana)
  • Experience developing and administering ServiceNow ITOM event management solutions
  • Experience deploying and managing service reliability platforms (e.g., xMatters, OpsGenie, PagerDuty)
  • Experience with and deep knowledge of cloud environments, cloud monitoring platforms, and container orchestration tools (e.g., AWS/CloudTrail, Azure/Monitor, GCP/GCM, Kubernetes, OpenShift)
  • Proficiency in Python and other scripting languages such as Ansible, PowerShell, Bash for automation and configuration
  • Hands-on experience deploying, managing, and administering observability platforms
  • Hands-on experience leading, coordinating, and performing migration of application, platform, and infrastructure observability solutions
  • Proven ability to troubleshoot and resolve complex technical issues
Job Responsibility
Job Responsibility
  • Deploy and implement modern observability solutions
  • Manage and administer observability and event management platforms
  • Coordinate and manage release cycles for observability platforms
  • Troubleshoot and resolve incidents related to observability platforms
  • Continuously monitor and enhance platform performance
  • Collaborate with cross-functional stakeholders
  • Provide training and mentoring to junior engineers
  • Ensure compliance and security of observability platforms
  • Maintain documentation of observability platform configurations
  • Generate and analyze reports on platform performance and capacity
What we offer
What we offer
  • Affordable medical plan options
  • a 401(k) plan (including matching company contributions)
  • an employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs
  • confidential counseling and financial coaching
  • Paid time off
  • flexible work schedules
  • family leave
  • dependent care resources
  • colleague assistance programs
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Manager

RUCKUS Networks is seeking an experienced Site Reliability Engineering (SRE) Man...
Location
Location
United States , Sunnyvale
Salary
Salary:
135600.00 - 200000.00 USD / Year
commscope.com Logo
CommScope
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in Site Reliability Engineering (SRE), with 6+ years leading SRE, DevOps, or infrastructure teams
  • Proven experience mentoring engineering managers and developing leadership talent
  • Track record of transforming traditional operations or NOC teams into modern SRE organizations
  • Strong project management skills with Agile/Kanban experience and JIRA proficiency
  • Excellent communication skills, including executive-level presentations
  • Deep SRE expertise: incident management, on-call systems, monitoring, and reliability engineering
  • Infrastructure automation experience with Terraform, Kubernetes, Docker, and CI/CD pipelines
  • Cloud platform proficiency (GCP/AWS), including networking, security, and cost optimization
  • Monitoring and observability experience with Prometheus, Grafana, APM tools, and log aggregation
  • 24/7 operations experience with global team coordination and escalation management
Job Responsibility
Job Responsibility
  • Lead and develop engineering managers and technical operations engineers across India and APAC time zones
  • Build a collaborative team culture that emphasizes knowledge sharing, automation, and operational excellence
  • Mentor engineering managers to strengthen leadership capabilities and technical expertise
  • Set clear performance expectations and provide ongoing coaching for growth
  • Partner cross-functionally with Product, Security, Development, and global operations teams
  • Own 24/7 operational stability for India/APAC, including incident response, escalation, and post-incident reviews
  • Drive comprehensive incident management: alert handling, outage response, and root cause analysis (RCA/CAR)
  • Transform traditional operations into modern SRE practices using SLOs, error budgets, and reliability engineering
  • Implement robust monitoring and alerting with APM tools, dashboards, and automation frameworks
  • Lead technical project delivery with clear timelines, resource planning, and stakeholder communication
What we offer
What we offer
  • medical, dental, and vision plans
  • life and accidental death insurance
  • a 401(k) plan
  • participation in the Company’s Incentive Plan
  • eleven paid holidays in a full calendar year
  • two weeks of paid vacation (prorated based on start date)
  • other leave options
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
United States
Salary
Salary:
150000.00 - 225000.00 USD / Year
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ years in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
What we offer
What we offer
  • Equity
  • Generous benefits program
  • Fulltime
Read More
Arrow Right
New

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Delhi
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right