CrawlJobs Logo

Site Reliability Engineering Specialist

India, Bengaluru · Job Posted March 19, 2026
Apply Position
Job Link Share

Job Description

The Site Reliability Engineering Specialist independently executes activities that help ensures BT is in the best position to deliver the service performance, reliability and availability that internal and external customers expect, through enabling cross-team engineering discussions to achieve scalable, measurable, fault-tolerant, and cost-effective cloud services.

Job Responsibility

  • Executes the implementation of new software development life cycle automation tools, frameworks, and code pipelines
  • Coordinates a diverse team and creates the initial test schedule
  • Executes the implementation of automation technologies
  • Proactively identifies and manages risk
  • Leads scale testing to measure, tune and optimise system performance
  • Executes metric/monitoring analysis
  • Designs, analyses, develops and troubleshoots highly distributed large-scale production systems
  • Executes approaches that scale systems sustainably
  • Writes and delivers infrastructure as code software
  • Implements robust monitoring and alerting systems and performs root cause analysis
  • Inspects queue and support processing
  • Executes retrospective and preventive actions after each high severity production incident
  • Analyses complex systems from a reliability and resilience perspective
  • Champions, continuously develops and shares with team knowledge on emerging trends
  • Mentors other site reliability engineers
  • Uses the network of site reliability engineers, removing BTs organisational boundaries

Requirements

  • A degree in IT, Maths or Science
  • A deep understanding of full stack monitoring solutions such as Dynatrace
  • Strong proficiency in one or more programming languages (e.g. Java, Python)
  • Experience with cloud platforms (AWS, Azure, or GCP)
  • Solid understanding of software architecture, design patterns, and microservices
  • Familiarity with CI/CD tools and DevOps practices
  • High levels of quality presentation and reporting capabilities
  • Resilience to ensure support teams are engaged 24x7x365
  • Ability to adapt to latest industry trends
  • CI/CD/CT Pipeline management
  • Micro-Service functionality
  • Business Process Improvement
  • Growth mindset
  • AI driven Observability & AIOps
  • Incident Response with AI
  • ML Ops for Reliability
  • AI enhanced Automation & CI/CD
  • AI + Chaos Engineering (Resilience)
  • Platform & Tool Literacy (AI ready)
  • Governance, Safety & Measurement

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineering Specialist

8 matching positions

Site Reliability Engineering Specialist

Professional Services was formed as a progressive development towards the conver...
Location
Location
United Kingdom , Snowhill, Birmingham; Ipswich; London
Salary
Salary:
Not provided
plus.net Logo
Plusnet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A strong understanding of multi-vendor IP/MPLS networks (Nokia, Cisco, Juniper etc)
  • A strong understanding of network routing protocols such as IS-IS, LDP, RSVP, segment routing, OSPF, eBGP, iBGP, MP-BGP
  • A strong understanding of fundamental protocols such as DNS, DHCP & NTP
  • A strong understanding of network change & incident management best practice
  • A good understanding of Linux operating systems
  • An intermediate level of proficiency in atleast one programming language preferably Python
  • You will be confident and professional in communicating with all stakeholders, both locally and with members of the Senior Management Team.
  • You will have the ability to work in a high-pressure environment.
Job Responsibility
Job Responsibility
  • Builds network engineering change processes for complex end-to-end technology introduction in the live network, utilising automation & CI/CD pipelines.
  • Leads on major incident resolution acting as a final technical escalation point within BT.
  • Leads blameless post‑incident reviews to uncover systemic root causes and convert learnings into concrete reliability, automation, and process improvements.
  • Champions a reliability‑first change culture, promoting safe deployment patterns, blameless learning, and continuous improvement across engineering teams.
  • Collaborates with design & platform teams to support the implementation of flawless change into the live network.
  • Acts as a subject matter expert within the network engineering domain. Applying this expertise to troubleshoot faults on our infrastructure crossing multiple platform domains.
  • Embeds secure by design principles when building new change processes and solutions.
  • Will champion and build effective working relationships, both internally and externally to deliver business outcomes.
  • Champions the adoption of Site Reliability Engineering practices within Professional Services, driving cultural change towards automation, observability, and reduced operational toil.
What we offer
What we offer
  • Tailored training and development opportunities to continue to build your career
  • 10% on target bonus
  • 25 days’ annual leave (not including bank holidays), increasing with service
  • Life Assurance
  • Pension scheme - If you pay in a minimum of 5% of your pensionable salary every month we will pay in 10%
  • Direct Share scheme
  • Option to join the Healthcare Cash Plan or other benefits such as dental insurance, gym memberships etc.
  • 50% off EE mobile pay monthly or SIM only plans
  • Exclusive colleague discounts on our latest and greatest BT broadband packages BT TV, including TNT Sports and NOW entertainment
  • Shared Parental leave - maximum amount of leave you can share with your partner is 50 weeks
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Specialist

BTI Professionals provide expert third-line reliability and operational support ...
Location
Location
Hungary , Budapest
Salary
Salary:
Not provided
plus.net Logo
Plusnet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience supporting large-scale, high-availability services in an ISP / NaaS / network-centric environment
  • Strong Linux troubleshooting and systems knowledge
  • Hands-on Kubernetes experience operating applications in production
  • Experience delivering changes using GitOps and CI/CD pipelines (including release validation and rollback awareness)
  • Working knowledge of incident/problem management in ServiceNow and delivery tracking in Jira (Scrum / PI planning)
  • Experience with observability tooling: Dynatrace, Prometheus, Elasticsearch, plus event/messaging platforms such as Kafka
  • Solid networking fundamentals to support effective troubleshooting
  • Automation experience with Ansible and at least one of Python / Go / Bash
  • Experience integrating or operating services with LDAP (authentication/authorisation, troubleshooting access issues)
Job Responsibility
Job Responsibility
  • Provide SRE ownership for the Global Fabric NaaS service, ensuring availability, performance, and resilience
  • Support safe, automated change into production using CI/CD, GitOps, and automated testing
  • Operate and improve monitoring and observability using Dynatrace, Prometheus, and Elasticsearch
  • Troubleshoot incidents across Kubernetes-hosted applications, Linux systems, networking, and service integrations
  • Act as a third-line escalation point, participating in a 24x7 on-call rota
  • Manage incidents via ServiceNow and track defects and improvements in Jira
  • Contribute to Scrum ceremonies and PI planning, supporting Agile delivery
  • Drive automation using Ansible and scripting to reduce operational toil
  • Mentor and support L2 engineers, improving runbooks, troubleshooting practices, and operational readiness
What we offer
What we offer
  • Cafeteria package - HUF 600,000/ year
  • Performance-based bonus
  • Comprehensive private health care package for all the employees, which can be extended to family members
  • Nursery support for mothers returning from maternity
  • Extended paternity leave: 10+10 day fully paid days
  • Commuting allowance
  • Home office allowance
  • Employee discount opportunities
  • Highly affordable mobile packages for the family as well
  • Car allowance
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Specialist

This role will specialise in system administration and server management with a ...
Location
Location
United Kingdom , Birmingham
Salary
Salary:
Not provided
plus.net Logo
Plusnet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in an ISP Environment: Proven experience in a fast-paced ISP setting, managing and troubleshooting large-scale networks
  • Sysadmin/Server Management: Strong skills in system administration, server management, and compute resources with experience in deploying and managing containerised applications using orchestration tools such as Kubernetes
  • Technical Proficiency: Strong understanding of network architecture, design, and implementation
  • Monitoring and Logging Solutions: Familiarity with monitoring and logging solutions such as Elastic search, Apache Kafka, and Prometheus
  • Programming Proficiency: Proficiency in at least one programming language, such as Python, Ansible or Go
  • Growth Mindset: Self-driven attitude towards learning new skills and aiding the development of others
Job Responsibility
Job Responsibility
  • Network Delivery: Support the Implementation of flawless change into the live network, utilising automation and CI/CD pipelines
  • Network Monitoring: Configure, maintain, and monitor systems and network infrastructure to ensure optimal health, performance, and reliability
  • Automation Tools: Utilise tools such as Ansible to provision and manage infrastructure resources in a scalable and efficient manner
  • Technical Acumen: Apply your understanding of network principles to troubleshoot network faults within our systems and look at how you can optimise performance and enhance security across our infrastructure
  • Incident Management and Resolution: Be prepared to support a 365x24/7 callout, providing third line technical resolution covering an extensive range of technologies
  • Customer Focus: Be a technical expert who understands the end-to-end journey of our customers
  • Growth and Development: As a technically talented expert you should enhance the brand of the team and support those around you to be accountable and perform at their best
What we offer
What we offer
  • Competitive salary
  • 10% on target bonus
  • BT Pension scheme, minimum 5% Employee contribution, BT contribution 10%
  • 25 days annual leave (not including bank holidays), increasing with service
  • Huge range of flexible benefits including cycle to work, healthcare, season ticket loan
  • World-class training and development opportunities
  • Option to join BT Shares Saving schemes
  • Discounted broadband, mobile and TV packages
  • Access to 100’s of retail discounts including the BT shop
  • On call allowances and overtime
  • Fulltime
Read More
Arrow Right

AI Platform Site Reliability Engineering Specialist

The AI Platform Site Reliability Engineering Specialist will operate and maintai...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science or related field, or equivalent job experience
  • 5 years of production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking and systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
Job Responsibility
Job Responsibility
  • Operate, monitor, and maintain the infrastructure supporting GenAI applications ( training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor and enforce SLOs/SLIs/LSAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling and resource forecasting
  • Optimize cost vs. performance trade-offs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
  • Define disaster recover (DR) strategies, back/restore practices, fault tolerance mechanisms
Read More
Arrow Right

Sr Engineering Specialist

The Automations and Electrical Controls Engineer has responsibility for overall ...
Location
Location
United States , Aiken
Salary
Salary:
Not provided
owenscorning.com Logo
Owens Corning
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Engineering
  • Five years of automation and control engineering experience in a manufacturing environment
  • Strong safety awareness, commitment and safety leadership
  • Experience working with 480 V
  • Experience leading projects (Capital, Focused Improvement)
  • Strong knowledge of PLC based controls, HMI applications, and programming (Siemens)
  • Availability to work nights, weekends, and holidays as required by operational support needs
Job Responsibility
Job Responsibility
  • Lead Safety for an injury free work environment
  • Educates team members on safe maintenance work processes and procedures
  • Adheres to, and continuously improves, all Plant and position-specific safety policies, procedures, and standards
  • Ensures a safe, clean and environmentally compliant work environment and builds a culture where safety is a first priority
  • Effectively communicates Owens Corning’s stand of safety with external parties and ensures that they work according to our safety standards
  • Good knowledge of NEC NFPA 70 and 70E, including Arc Flash
  • Developing Talent
  • Develops and executes training plans for maintenance personnel and creates a continuous learning environment for employees
  • Co-leads and coaches’ primary maintenance workforce and drives their engagement
  • Promotes a work environment characterized by mutual trust and respect, open and honest communications, teamwork and a passion for winning
  • Fulltime
Read More
Arrow Right

Asset Health and Reliability Specialist

Reporting to the Mobile Maintenance Manager, the Asset Health and Reliability Sp...
Location
Location
Australia , Pilbara
Salary
Salary:
Not provided
pls.com Logo
PLS
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Relevant nationally recognized trade qualification
  • Current Driver’s Licence (C Class minimum)
  • Minimum 5 years of experience in a mining or heavy industry environment, with a focus on HME reliability or maintenance
  • Strong knowledge of heavy mobile equipment systems and components (e.g., engines, hydraulics, powertrain, electrical, undercarriage)
  • Experience conducting root cause analysis and implementing reliability improvements
  • Excellent communication and interpersonal skills for cross-functional collaboration
  • Strong commitment to safety and continuous improvement
  • Highly developed attention to detail with the ability to analysis condition monitoring information and provide accurate reported recommendation
  • Data exploration skills and capability to review datasets across systems, inclusive of time-series VIMS, KOMTRAX and equivalent data
Job Responsibility
Job Responsibility
  • Monitor and analyse the reliability performance of Heavy Mobile Equipment (HME) fleet
  • Develop and maintain equipment health strategies using reliability tools such as RCM, FMEA, and condition monitoring techniques
  • Collaborate with maintenance, operations, and OEMs to drive continuous improvement in equipment performance and availability
  • Identify and implement initiatives to improve mean time between failures (MTBF) and reduce mean time to repair (MTTR)
  • Review and optimise preventative and predictive maintenance strategies and schedules
  • Prepare and present reliability reports, KPIs, and improvement plans to senior management
  • Support the implementation and usage of reliability software systems (e.g., Pronto, AMT or similar CMMS tools)
  • Ensure all activities comply with site safety standards, environmental policies, and legislative requirements
What we offer
What we offer
  • Quarterly short-term incentive bonus recognising individual and business performance
  • PLS employee share scheme
  • Access to newly refurbished facilities at Pilgangoora, including gym, tennis, pickleball and volleyball courts, sports oval and scenic walking tracks
  • 18 weeks parental leave for primary carers and 4 weeks for secondary carers
  • Health and wellbeing allowance
  • Novated leasing through salary sacrifice
  • Paid community leave
  • Monthly employee recognition awards
  • Access to PLS’ KidsCo School Holiday Program
  • Access to our Employee Assistance Program and Company Chaplains
  • Fulltime
Read More
Arrow Right

Sr Reliability Specialist

The Reliability Specialist is responsible for providing technical guidance on op...
Location
Location
United States , Tyler
Salary
Salary:
Not provided
delekus.com Logo
Delek US
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2 year / Associate Degree (Required)
  • Six (6) or more years Oil & Gas or related experience (Required)
  • No Licensure or Certification Required
  • Reliability Management
  • Asset Management
  • Fixed Equipment
  • Rotating Equipment
  • Oil & Gas Refining
  • Pipeline Knowledge (DKL)
  • Pressure Control Devices
Job Responsibility
Job Responsibility
  • Provide technical guidance on operating units and equipment that maintains and improves the safety, environmental standards, overall reliability, and operating cost
  • Demonstrate accountability for increasing equipment reliability by improving time between failures of industrial equipment while reducing equipment downtime and manufacturing costs
  • Work collaboratively with the engineering functions as well as other departments to develop, implement and maintain standard mechanical and/or electrical processes incorporating industry best practices
  • Create maintenance technical standards and standardized work practices in collaboration with subject matter experts
  • Develop strategies to manage assets at peak performance, optimize lifetime return on investment, mitigate reliability risk, and supports capital improvements in support of long-term sustainable performance
What we offer
What we offer
  • Up to a 10% match on 401K on your hire start, with a vesting timeline of only one year
  • Medical benefits that start on day one with a 30% premium rebate annually
  • Access to the Calm app for FREE
  • Performance management program to earn additional annual incentives
  • Fulltime
Read More
Arrow Right

Cloud Engineering Specialist- VMware

Network Cloud is responsible for delivering BT’s strategic private cloud platfor...
Location
Location
United Kingdom , Manchester; Ipswich; London
Salary
Salary:
Not provided
plus.net Logo
Plusnet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • VMware vSphere and vCenter
  • VMware Cloud Foundation
  • Compute and virtualisation platform design and operation
  • Storage technologies including vSAN or equivalent
  • Experience producing High Level and Low Level Designs
  • Strong troubleshooting and diagnostic skills
  • Solid understanding of IP networking fundamentals
  • Ability to work across both design and operational activities
  • 3 years or more in a hands-on infrastructure role, including 2nd and 3rd line support, AND 3 years+ Design Experience
  • Solid understanding of IP networking fundamentals and data centre infrastructure
Job Responsibility
Job Responsibility
  • Design, build and support VMware-based private cloud infrastructure
  • Contribute to High Level and Low Level Designs aligned to BT architecture and engineering standards
  • Support the deployment and lifecycle management of VMware Cloud Foundation environments
  • Operate and optimise compute and storage platforms, including performance, availability and capacity planning
  • Troubleshoot and resolve complex, high-severity infrastructure issues
  • Work closely with network and security teams to integrate compute platforms with NSX and underlay networks
  • Contribute to automation and infrastructure-as-code initiatives to improve consistency and efficiency
  • Produce and maintain technical documentation, diagrams, runbooks and operational procedures
  • Drive standardisation across platforms to ensure operational consistency and reliability
  • Provide technical guidance and support to other engineering teams and stakeholders
What we offer
What we offer
  • On target 10% on target bonus
  • BT Pension scheme, minimum 5% Employee contribution, BT contribution 10%
  • From January 2025, equal family leave: receive 18 weeks at full pay, 8 weeks at half pay and 26 weeks at the statutory rate. It’s for all parents, no matter how your family is made up
  • Enhanced women’s health support: including help with menopause symptoms, cancer screenings, period care and more
  • 25 days annual leave (not including bank holidays), increasing with service
  • 24/7 private virtual GP appointments for UK colleagues
  • 2 weeks carer’s leave
  • World-class training and development opportunities
  • Option to join BT Shares Saving schemes
  • Fulltime
Read More
Arrow Right