Data – Site Reliability Engineer Job at Optiver (Sydney)

Data Site Reliability Engineer

We are hiring Data Site Reliability Engineers (Data SREs) to join our Global Dat...

Location

China , Shanghai

Salary:

Not provided

Optiver

Expiration Date

Until further notice

Requirements

Experience operating or supporting production data systems, such as data pipelines, ingestion frameworks, or analytical platforms
Strong proficiency in Python, with experience using libraries like Pandas, Arrow, and Spark
Solid understanding of data modeling, normalization, and API development in support of large-scale analytical or trading systems
Experience working with lakehouse architectures (e.g., Delta Lake, Databricks, AWS) to manage large-scale, high-quality analytical datasets is preferred
An operations mindset with a strong sense of ownership, reliability, and continuous improvement
Ability to operate with a high degree of autonomy, making sound engineering and operational decisions for production systems
Strong communication skills and the ability to work effectively across teams and time zones
Experience with external data ingestion is a plus
Exposure to PCAP-based data workflows or market data capture pipelines is a plus
Willingness to contribute to setting best practices and mentoring engineers on operational excellence and production reliability

Job Responsibility

Configure, launch, and maintain robust data pipelines (ETL/ELT) for ingesting and transforming datasets critical to research and trading
Own the end-to-end reliability of critical production data pipelines, from ingestion through downstream consumption
Ensure data quality and consistency through validation, monitoring, and robust engineering practices
Act as the first point of contact for data incidents, investigating failures, data quality issues, and pipeline regressions, and driving them through to resolution
Participate in incident response, root cause analysis, and post-incident reviews, with a focus on preventing recurrence
Manage daily releases, backfills, and ad-hoc data runs, with a strong focus on safety, traceability, and environment segregation
Design and improve monitoring, alerting, data quality checks, and operational runbooks, ensuring issues are detected early and alerts are actionable
Build automation and tooling to reduce manual operational work and enable the platform to scale safely
Partner closely with data engineering, platform, and trading-facing teams to ensure data systems are reliable, well-understood, and fit for purpose

What we offer

A performance-based bonus structure unmatched anywhere in the industry
The chance to work alongside diverse and intelligent peers in a rewarding environment
Training, mentorship and personal development opportunities
Daily breakfast, lunch and snacks
Gym membership, sports and leisure activities, plus weekly in-house chair massages
Regular social events, clubs and Friday afternoon drinks

Site Reliability Engineer - Kubernetes - Data Platforms

You will be building the rails of a self-service data platform inside Adyen, cre...

Location

Netherlands , Amsterdam

Salary:

Not provided

Adyen

Expiration Date

Until further notice

Requirements

Experienced Platform/SRE Professional
Technical Expertise
Tooling & Ecosystems
Observability Mindset
Good to have: A background in Software Engineering, specialized networking, or GPU management. Familiarity with data ecosystem tools like Airflow and HDFS is highly appreciated.
Ambitious & Collaborative

Job Responsibility

Design & Build On-Premise (kubernetes) Infrastructure
Cluster Provisioning & Reliability
Mixed Workload Balancing
Advanced Scheduling & Hardware Management
Storage & Network Optimization
FinOps & Security
Automation & Operations

Fulltime

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...

Location

United States , Reston

Salary:

Not provided

Tier4 Group

Expiration Date

Until further notice

Requirements

5+ years hands-on operating and managing Kubernetes and OpenShift clusters
Strong experience with Microsoft Azure (compute, networking, storage, and data services)
Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
Proficiency with observability tooling (Datadog, Prometheus, Grafana)
Scripting/coding ability in Bash, Python, or Go

Job Responsibility

Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
Map current hybrid topology and critical delivery pipelines
identify toil and prioritize automation (Terraform/Ansible)
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
Drive GitOps-first workflows
harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
Lead incident response and postmortems
institutionalize RCA, blameless learning, and continuous improvement

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

New

Senior Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to join our growing ...

Location

Italy , Milan

Salary:

50000.00 - 70000.00 EUR / Year

iGenius

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field
At least 6 years of experience as a Site Reliability Engineer or in similar roles
Strong experience with observability and monitoring systems such as Prometheus, Thanos, Grafana, and OpenTelemetry
Experience with low-level system instrumentation and performance visibility using technologies such as eBPF
Experience with security monitoring and threat detection tools such as Zeek, Wazuh, or equivalent SIEM / security observability platforms
Strong experience with containerized and cloud-native environments, particularly Kubernetes
Strong software development skills, particularly in Python, with the ability to build automation, integrations, and custom tooling
Experience integrating heterogeneous infrastructure systems across multiple vendors, APIs, and evolving tool ecosystems
Familiarity with modern infrastructure automation and emerging agent-based frameworks such as MCP / A2A (or equivalent technologies)
Exposure to digital twin technologies and simulation platforms such as NVIDIA Omniverse or equivalent

Job Responsibility

Design and implement observability and control mechanisms that extract operational data from infrastructure and feed it into automated systems to enable continuous optimization, including key system budgets such as power, cooling and service level, security-level objectives
Actively guard and maintain these operational budgets as part of day-to-day system reliability and performance management
Contribute to operational excellence through blameless post-mortem analysis and structured incident learning, ensuring continuous improvement of system behavior and resilience
Work closely with Platform Engineering in a shared cybersecurity model, where SRE focuses on detection and monitoring, while Platform Engineering ensures the secure design and operation of the underlying infrastructure

What we offer

Learning Friday
Training budget for books, online courses or other training materials
Smart Working (remote work opportunities)
Opportunity to receive company equity
Stock options

Fulltime

Site Reliability Engineer

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...

Location

Canada , Mississauga

Salary:

94300.00 - 141500.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

5–8 years of relevant experience in technical support, platform operations, or engineering
Exposure to architecture concepts with the ability to contribute to technical discussions and understand design decisions
Experience working with business partners, engineering teams, or technology stakeholders
Demonstrated experience supporting IT services, platform operations, or infrastructure components
Strong verbal and written communication skills, with the ability to document technical issues clearly
Experience supporting operational workstreams or participating in platform improvement initiatives
Participation in resilience‑related or stability‑focused activities preferred
Ability to collaborate effectively with cross‑functional teams
Strong organizational skills and ability to manage daily workload and task priorities
Working knowledge of Generative AI concepts preferred

Job Responsibility

Understand how application support functions within the broader technology organization and contributes to business objectives
Assist with vendor coordination and day‑to‑day interactions with offshore managed services
Support efforts to improve service levels, including participating in incident management, problem management, and knowledge‑sharing initiatives
Partner with development and engineering teams to support application stability and operational readiness
Assist in collecting capacity, performance, and latency data to support platform planning efforts
Support application onboarding activities using established guidelines and standards
Contribute to fostering a collaborative and supportive team environment that encourages skill development
Participate in cost‑efficiency initiatives such as Root Cause Analysis reviews, knowledge management, and performance tuning
Assist in preparing materials for business review meetings and help align technology activities with business needs
Follow established support processes and tool standards and provide input on improvement opportunities

Fulltime

Site Reliability Engineer (SRE) - Identity Access Management IAM

Join us as a Site Reliability Engineer (SRE) - Identity Access Management. You w...

Location

India , Pune

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Experience in designing, implementing, deploying, and running highly available, fault-tolerant, auto-scaling and auto-healing systems
Strong expertise in AWS (essential), (Azure, and GCP (Google cloud platform) is a plus), including Kubernetes (ECS is essential, Fargate and GCE is a plus) and server-less architectures
Strong experience in running disaster recovery, zero downtime solutions and in designing and implementing continuous delivery across large-scale, distributed, cloud-based micro service and API service solutions with 99.9%+ uptime
Hands-on experience coding in Python, Bash and JSON/Yaml (Configuration as Code)
The ability to drive reliability best practices across engineering teams, embed SRE principles into the DevSecOps lifecycle and partner with engineering, security and product teams, to balance reliability and feature velocity
Experience in hands-on configuration, deployment and operation of ForgeRock COTS based IAM (Identity Access management) solutions (PingGateway, PingAM, PingIDM, PingDS) with embedded security gates, HTTP header signing, access token and data at rest encryption, PKI based self-sovereign identity, or open source

Job Responsibility

Applying software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them
Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Select Country

Data – Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Data – Site Reliability Engineer

Data Site Reliability Engineer

Site Reliability Engineer - Kubernetes - Data Platforms

Staff Engineer, Site Reliability Engineer

Site Reliability Engineer Platform Engineer

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

Senior Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer (SRE) - Identity Access Management IAM

Our AI answers in your language