This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Site Reliability Engineering Specialist independently executes activities that help ensures BT is in the best position to deliver the service performance, reliability and availability that internal and external customers expect, through enabling cross-team engineering discussions to achieve scalable, measurable, fault-tolerant, and cost-effective cloud services.
Job Responsibility:
Executes the implementation of new software development life cycle automation tools, frameworks, and code pipelines
Coordinates a diverse team and creates the initial test schedule
Executes the implementation of automation technologies
Proactively identifies and manages risk
Leads scale testing to measure, tune and optimise system performance
Executes metric/monitoring analysis
Designs, analyses, develops and troubleshoots highly distributed large-scale production systems
Executes approaches that scale systems sustainably
Writes and delivers infrastructure as code software
Implements robust monitoring and alerting systems and performs root cause analysis
Inspects queue and support processing
Executes retrospective and preventive actions after each high severity production incident
Analyses complex systems from a reliability and resilience perspective
Champions, continuously develops and shares with team knowledge on emerging trends
Mentors other site reliability engineers
Uses the network of site reliability engineers, removing BTs organisational boundaries
Requirements:
A degree in IT, Maths or Science
A deep understanding of full stack monitoring solutions such as Dynatrace
Strong proficiency in one or more programming languages (e.g. Java, Python)
Experience with cloud platforms (AWS, Azure, or GCP)
Solid understanding of software architecture, design patterns, and microservices
Familiarity with CI/CD tools and DevOps practices
High levels of quality presentation and reporting capabilities
Resilience to ensure support teams are engaged 24x7x365