Development Platform Site Reliability Engineer Job at Barclays (Pune)

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...

Location

United States , Reston

Salary:

Not provided

Tier4 Group

Expiration Date

Until further notice

Requirements

5+ years hands-on operating and managing Kubernetes and OpenShift clusters
Strong experience with Microsoft Azure (compute, networking, storage, and data services)
Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
Proficiency with observability tooling (Datadog, Prometheus, Grafana)
Scripting/coding ability in Bash, Python, or Go

Job Responsibility

Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
Map current hybrid topology and critical delivery pipelines
identify toil and prioritize automation (Terraform/Ansible)
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
Drive GitOps-first workflows
harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
Lead incident response and postmortems
institutionalize RCA, blameless learning, and continuous improvement

Fulltime

Big Data/Data Platform Site Reliability Engineer

About PulsePoint: PulsePoint is a fast-growing healthcare technology company (wi...

Location

United Kingdom

Salary:

Not provided

PulsePoint

Expiration Date

Until further notice

Requirements

Strong hands-on experience operating large-scale Linux infrastructure in production (Rocky Linux or equivalent)
Deep practical knowledge of Apache Hadoop-based data platforms, including: HDFS architecture and failure modes, Kerberos-based security models, Operational lifecycle (upgrades, scaling, recovery)
Experience running Apache Kafka clusters in production, including KRaft-based setups
Proven ability to debug complex distributed system issues across storage, compute, and networking layers
Experience designing or improving automation, deployment, or GitOps-style workflows
Proficiency in scripting or automation (Python, Shell, etc.)
Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, basic network security concepts)
Comfortable taking technical ownership, driving reliability improvements, and participating in on-call / incident processes
Willing and able to work East Coast U.S. hours (9am–6pm EST)

Job Responsibility

Deploying, configuring, monitoring and maintaining multiple big data stores across multiple datacenters, with a strong focus on reliability, scalability, and operational excellence
Perform planning, configuration, deployment, and lifecycle management of critical data infrastructure
Managing large-scale Linux infrastructure to ensure maximum uptime and predictable performance
Developing and documenting system configuration standards, operational procedures, and best practices
Performance and reliability testing, including reviewing configuration, software choices, versions, and hardware specifications
Participating in incident response, root cause analysis, and driving long-term reliability improvements
Advancing our technology stack with innovative ideas and pragmatic solutions

Cloud Platform Engineer (Site Reliability)

We have an exciting opportunity for a Cloud Platform Engineer (Site Reliability)...

Location

United States , Houston

Salary:

Not provided

Amentum

Expiration Date

Until further notice

Requirements

Typically requires a bachelor’s degree or equivalent certification in a related area and normally possess 10 years of experience in the field or in a related area
Strong experience with Kubernetes in production
Ability to manage and use GitLab (preferably very proficient)
Hands-on experience with CI/CD pipeline tools
Observability Monitoring tools such as Grafana and SuperSet
Proficiency with Infrastructure-as-Code utilizing Terraform for infrastructure automation and/or open source alternatives (OpenTofu)
Extensive Linux experience (familiarity with Windows also preferred, but not required)
Expert in at least one programming language (Go and Python is preferred)
Experience with Python, SQL (and R is preferable)
Working understanding of Machine Learning Model Lifecycle management (is preferred)

Job Responsibility

Developing new cloud-native platform services spanning all three major cloud environments
Developing best practices for cloud-native application development and promoting them within the organization
Administering NASA cloud networks and managing requests for deployment of COTS and Cloud Native applications into cloud environments
Writing quality code, providing quality and engaged code reviews for peers
Working with Managed Kubernetes offering across all three major cloud providers
Integrating cloud managed AI and data services with other bespoke and open-source Kubernetes applications
Developing best practices for cloud-native application development and promoting them within the organization
Identifying opportunities to abstract Prospective Project requirements and develop Enterprise-grade, multi-tenant Platform Services
Collaborate with NASA security and compliance teams to ensure teams are adhering to industry best practices and regulatory requirements
Working directly with NASA human spaceflight missions like Orion, Lunar Gateway, Artemis

What we offer

Excellent personal and professional career growth
9/80 work schedule (every other Friday off), when applicable
Onsite cafeteria (breakfast & lunch)
Health, dental, and vision insurance
Paid time off and holidays
Retirement benefits (including 401(k) matching)
Educational reimbursement
Parental leave
Employee stock purchase plan
Tax-saving options

Fulltime

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...

Location

Salary:

175000.00 - 225000.00 USD / Year

Zilliz

Expiration Date

Until further notice

Requirements

4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
Proficiency in scripting languages such as Python, Go, or Java
Strong knowledge of container orchestration technologies like Kubernetes and Docker
Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
Experience with infrastructure as code tools such as Terraform or Ansible
Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
Proven ability to troubleshoot complex distributed systems and resolve issues promptly
Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously

Job Responsibility

Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
Develop and implement strategies for monitoring, incident management, and disaster recovery
Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
Collaborate with software engineers to enhance system reliability, scalability, and performance
Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency

Fulltime

Site Reliability Engineer - Container Platform

Join Barclays as a Site Reliability Engineer - Container Platform role, where yo...

Location

India , Pune; Chennai

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Minimum Qualification – bachelor’s degree
Experience configuring, using or maintaining Kubernetes (Openshift or EKS or AKS or Argocd)
Experience in developing and coding software using Python or Golang
Experience with Docker, Containers and Cloud-Native utilities and software

Job Responsibility

Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Site Reliability Engineer - Data Platform Operation

Join our Data & AI Platform team as a Site Reliability Engineer (SRE) – Platform...

Location

Brazil , Sao Paulo

Salary:

Not provided

Amaris Consulting

Expiration Date

Until further notice

Requirements

Academic background: Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (minimum 3 years of experience)
Experience: 5+ years hands-on with cloud platforms (Azure, AWS, GCP), programming (Bash, PowerShell, Terraform, Python, Java), and Infrastructure as Code (IaC)
English language: Professional working proficiency in English and the local language
Tools / software: Deep expertise in Azure, Databricks, Unity Catalog, Kubernetes, Helm, Docker, Power BI, Datadog, Grafana, GitHub, Azure DevOps, ArgoCD, Airflow, SSIS, Power Query, and relational/NoSQL databases
AI experience: Experience supporting enterprise Data & AI platforms
Soft skills: Analytical problem-solving
Effective communication and active listening
Team player with respect for others
Strong troubleshooting and platform monitoring skills
Automation (Python, PowerShell, CLI, KQL, Terraform)

Job Responsibility

Support, manage, and maintain Azure resources: Azure SQL, Synapse, Data Factory, Databricks, Unity Catalog
Monitor Azure workloads, troubleshoot incidents, alerts, and performance bottlenecks
Implement and manage RBAC, identity & access policies, and compliance controls
Optimize Azure cost and performance using Azure Monitor, DataDog, and Cost Management tools
Automate tasks using PowerShell, Azure CLI, Terraform, and Python
Utilize Git, GitHub Actions, and Airflow for workflow automation
Provide L2/L3 support for data pipelines, reporting, and cloud services
Conduct incident response, root cause analysis (RCA), and proactive issue resolution
Collaborate with Cloud Engineering, Data Engineers, BI Developers, and Cloud Architects
Follow ITSM processes: Incident, Change, and Problem Management

What we offer

An international community bringing together 110+ different nationalities
An environment where trust has a central place: 70% of our key leaders started their careers at the first level of responsibility
A robust training system with our internal Academy and 250+ available modules
A vibrant workplace that frequently gathers for internal events (afterworks, team buildings, etc.)
Strong commitments to CSR, notably through participation in our WeCare Together program

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare!...

Location

France , Paris

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of site reliability engineering experience
Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
Proactive, curious, collaborative and eager to learn
Proven experience with cloud services such as AWS, Azure or Google Cloud
Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles

Job Responsibility

Collaborating with Feature teams to ensure services align with developer needs
Driving improvements by evaluating new technologies and processes
Defining best practices (golden paths) for software development and deployment
Developing and maintaining tools and services that facilitate implementation of best practices
Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
Collaborating on roadmap delivery

What we offer

Free Health Insurance for you
Up to 14 days of RTT
A flexible workplace policy offering both hybrid and office-based modes
Flexibility days allowing to work in EU countries and the UK 10 days per year
Wellbeing program with free mental health and coaching through moka.care
Special support package for caregivers and workers with disabilities
Lunch voucher with Swile card
Work Council subsidy for sport club membership or creative activities
Bicycle subsidy
Public transportation reimbursement

Fulltime

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare....

Location

Germany , Berlin

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of site reliability engineering experience
Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
Proactive, curious, collaborative and eager to learn
Proven experience with cloud services such as AWS, Azure or Google Cloud
Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles

Job Responsibility

Collaborating with Feature teams to ensure services align with developer needs
Driving improvements by evaluating new technologies and processes
Defining best practices ("golden paths") for software development and deployment
Developing and maintaining tools and services that facilitate best practices
Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
Collaborating on roadmap delivery

What we offer

Company health insurance through partner Allianz
Minimum 28 days of paid leave
Parent Care Program: one additional month of leave on top of legal parental leave
Free mental health and coaching services through partner Moka.care
For caregivers and workers with disabilities, a package including adaptation of remote policy, extra days off for medical reasons, and psychological support
Flexible workplace policy offering both hybrid and office-based mode
Work from EU countries and the UK for up to 10 days per year
Reimbursement of public transportation

Fulltime

Select Country

Development Platform Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?