CrawlJobs Logo

Site Reliability Engineer - Container Platform

India, Pune · Job Posted January 07, 2026
Apply Position
Job Link Share

Job Description

Join Barclays as a Site Reliability Engineer - Container Platform role, where you will report into the Application Platforms Engineering Lead, playing a key role in building the products, services, software, APIs, and infrastructure that will be central to this new strategy, ensuring we have a world-class product set which is simplified and provides long term sustainable business value. At Barclays, we don't just anticipate the future - we're creating it.

Job Responsibility

  • Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
  • Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
  • Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
  • Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
  • Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
  • Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth

Requirements

  • Minimum Qualification – bachelor’s degree
  • Experience configuring, using or maintaining Kubernetes (Openshift or EKS or AKS or Argocd)
  • Experience in developing and coding software using Python or Golang
  • Experience with Docker, Containers and Cloud-Native utilities and software

Nice to have

  • Experience in writing Ansible Playbooks or Chef Cookbooks
  • Foundational understanding of Cloud technologies within AWS or Azure
  • Familiarity with complex system integrations, REST APIs, Observability, Telemetry and microservice-based architectures

What we offer

  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer - Container Platform

8 matching positions

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...
Location
Location
United States , Reston
Salary
Salary:
Not provided
tier4group.com Logo
Tier4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years hands-on operating and managing Kubernetes and OpenShift clusters
  • Strong experience with Microsoft Azure (compute, networking, storage, and data services)
  • Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
  • Proficiency with observability tooling (Datadog, Prometheus, Grafana)
  • Scripting/coding ability in Bash, Python, or Go
Job Responsibility
Job Responsibility
  • Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
  • Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
  • Map current hybrid topology and critical delivery pipelines
  • identify toil and prioritize automation (Terraform/Ansible)
  • Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
  • Drive GitOps-first workflows
  • harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
  • Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
  • Lead incident response and postmortems
  • institutionalize RCA, blameless learning, and continuous improvement
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...
Location
Location
Salary
Salary:
175000.00 - 225000.00 USD / Year
zilliz.com Logo
Zilliz
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
  • Proficiency in scripting languages such as Python, Go, or Java
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
  • Experience with infrastructure as code tools such as Terraform or Ansible
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously
Job Responsibility
Job Responsibility
  • Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
  • Develop and implement strategies for monitoring, incident management, and disaster recovery
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
  • Collaborate with software engineers to enhance system reliability, scalability, and performance
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency
  • Fulltime
Read More
Arrow Right
New

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Location
Location
South Africa , Johannesburg
Salary
Salary:
Not provided
nintex.com Logo
Nintex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You provide guidance on infrastructure architecture and contribute to high-quality and successful product releases.
  • You contribute to your team and domain through successfully leading and consistently delivering on projects of ambiguous scope, high complexity, and critical business impact.
  • You contribute to relevant guilds, practice forums and other initiatives to improve Nintex’s DevOps and SRE discipline.
  • You have an in-depth understanding of distributed systems architecture, as well as monitoring and observability practices and tools.
  • You quickly resolve priority infrastructure issues and help other technical team members or Product Managers understand how to avoid them in the future.
  • You provide detailed estimates for work items you propose or assigned.
  • You assist in decision-making around tooling, automation practices, and testing solutions.
  • You stay up-to-date with technology trends and use this knowledge help your team and the broader Engineering practice.
  • You run Nintex infrastructure with IaC tools (as Terraform) and GitHub Actions for automation, containerize our environments (Kubernetes) and leverage cloud technologies to meet our goals
  • You build monitoring that alerts on symptoms rather than outages using tools like Prometheus, Grafana, Alertmanager and PagerDuty
Job Responsibility
Job Responsibility
  • You are highly skilled and sufficiently experienced in Nintex DevOps tools and processes to own a long-term program or technology such as Kubernetes, etc.
  • You write scripts, tools and utilities that support and integrate with delivery pipelines and you integrate telemetry where appropriate.
  • You are called into incidents and bring trusted knowledge in your platform domain.
  • You debug and fix infrastructure issues on production environments quickly using the relevant tools and guidelines to prevent recurrence.
  • You build, promote and support infrastructure patterns and practices within Nintex.
  • You provide coaching/mentoring to other Engineers on the team
  • You lead or contribute to post-mortems for incidents, including root cause analysis and identification of preventative and remedial actions.
  • You continuously monitor our platform performance and take immediate action to improve it
  • You review and advise on appropriate design patterns to solve automation and infrastructure problems without creating technical debt.
  • You design and build complex infrastructure components for distributed systems as Kubernetes.
What we offer
What we offer
  • Global Gratitude and Recharge Days
  • Flexible, paid time off policy
  • Employee wellness programs and counseling resources
  • Meaningful peer recognition and awards
  • Paid parental leave
  • Invention/patenting assistance
  • Community impact, paid volunteer time, and opportunities
  • Intercultural learning and celebration
  • Multiple tools through which to learn and grow, and an incredible global community
Read More
Arrow Right

Site Reliability Engineer

As a Staff Software Engineer, you will play a key role in designing, building, a...
Location
Location
United States , San Jose
Salary
Salary:
120500.00 - 243000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 5 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
  • Proficiency with Linux systems, especially Debian-based distributions
  • Strong experience with cloud platforms such as AWS and GCP
  • Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
  • Solid programming skills in Python and/or Golang
  • Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
  • Experience with GitOps workflows
  • Proven track record in implementing and maintaining CI/CD pipelines
  • Strong background in security and familiarity with security programs
  • Experience with monitoring and logging tools (Prometheus, Grafana, ELK)
Job Responsibility
Job Responsibility
  • Enhance Infrastructure as Code (IAC) and enforce best practices
  • Optimize cloud infrastructure for scalability, security, and cost-effectiveness
  • Develop internal tools to support and streamline cloud platform operations
  • Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
  • Address container image vulnerabilities and standardize remediation processes
  • Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
  • Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
  • Troubleshoot complex production issues to ensure system reliability and customer satisfaction
  • Fine-tune distributed systems such as Apache Kafka and Cassandra
  • Collaborate with development, security, and operations teams to align infrastructure with application needs
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Trade Floor Site Reliability Engineer

Join us at Barclays as a Trade Floor Site Reliability Engineer, providing real‑t...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in systems engineering, including Linux and Windows, networking, Kubernetes and cloud infrastructure
  • Proficiency in automation tools
  • Proficiency in implementing monitoring, alerting and observability for critical trading platforms
  • The ability to manage incidents effectively, troubleshoot issues swiftly, proactively communicate and perform root cause analysis to prevent future incidents
  • Prior experience in supporting Credit or any IB asset classes like Rates or Equities or FX
  • Experience working with PaaS products, including some experience of either virtualization, containerization, orchestration of compute/network/storage
Job Responsibility
Job Responsibility
  • Providing real‑time support to Credit EMEA traders and sales teams to keep critical trading platforms stable and performant
  • Ensuring seamless client service as electronic and algo trading rapidly expand
  • Provision of technical support for the service management function to resolve more complex issues
  • Execution of preventative maintenance tasks on hardware and software and utilisation of monitoring tools/metrics
  • Maintenance of a knowledge base containing detailed documentation
  • Analysis of system logs, error messages and user reports to identify root causes
  • Automation, monitoring enhancements, capacity management, resiliency, business continuity management, front office specific support and stakeholder management
  • Identification and remediation of potential service impacting risks and issues
  • Proactively assess support activities implementing automations where appropriate
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Qargo is a cloud-based (SaaS) Transport Management Platform. We are a scale-up b...
Location
Location
Belgium , Ghent
Salary
Salary:
Not provided
qargo.com Logo
Qargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience as a Software Engineer, with an interest in infrastructure, scalability, reliability
  • Strong programming skills (preferably Python or similar backend languages)
  • Experience working with cloud platforms, container orchestrators, serverless (preferably Google Cloud)
  • Familiarity with distributed systems and scalability challenges
  • Experience with CI/CD pipelines and automation
  • Solid understanding of databases and performance tuning (SQL and/or NoSQL)
  • Familiarity with monitoring and observability tools
  • A problem-solving mindset and the ability to think in systems
  • Strong collaboration skills and a proactive approach to improving systems
Job Responsibility
Job Responsibility
  • Build and maintain systems and tooling that improve the reliability, scalability, and performance of our platform
  • Improve software delivery cycle, focusing on automation and developer experience
  • Develop internal tools and services to reduce manual operational work
  • Improve observability by implementing monitoring, logging, and alerting across systems
  • Optimize system performance, including databases such as PostgreSQL and Firestore
  • Collaborate with backend engineers and other engineering teams to design reliable and scalable system architectures
  • Troubleshoot complex production issues and implement long-term fixes
  • Continuously improve infrastructure (Infrastructure as Code, automation, etc.)
What we offer
What we offer
  • A fast-growing SaaS company with a strong mission and an impact-driven team
  • A flexible work environment with flexible hours and hybrid working
  • A green office with a great atmosphere and lots of initiatives
  • A role with a lot of responsibility, ownership, and tangible impact
  • The opportunity to grow with us and shape both your career and our platform
  • Fulltime
Read More
Arrow Right