CrawlJobs Logo

Data Site Reliability Engineer

China, Shanghai · Job Posted January 29, 2026
Apply Position
Job Link Share

Job Description

We are hiring Data Site Reliability Engineers (Data SREs) to join our Global Data Engineering organization. This is an operations-focused role, responsible for ensuring the reliability, correctness, and availability of Optiver’s most critical data pipelines and platforms, many of which directly support trading and research workflows. You will be part of a globally distributed team, operating and monitoring data systems across regions. In practice, this role combines deep technical ownership with operational responsibility. Data SREs work closely with data engineers, platform teams, and trading-facing stakeholders to ensure production data systems meet strict reliability, freshness, and correctness expectations. Beyond keeping systems running, they help evolve how we operate data platforms at scale by improving automation, observability, and operational standards across the firm. The systems supported by Data SREs include large-scale batch and streaming data pipelines, data quality and freshness monitoring, and globally operated platforms supported by follow-the-sun operational models. These systems are often on the critical path for trading and research, where data delays or inaccuracies can have immediate business impact.

Job Responsibility

  • Configure, launch, and maintain robust data pipelines (ETL/ELT) for ingesting and transforming datasets critical to research and trading
  • Own the end-to-end reliability of critical production data pipelines, from ingestion through downstream consumption
  • Ensure data quality and consistency through validation, monitoring, and robust engineering practices
  • Act as the first point of contact for data incidents, investigating failures, data quality issues, and pipeline regressions, and driving them through to resolution
  • Participate in incident response, root cause analysis, and post-incident reviews, with a focus on preventing recurrence
  • Manage daily releases, backfills, and ad-hoc data runs, with a strong focus on safety, traceability, and environment segregation
  • Design and improve monitoring, alerting, data quality checks, and operational runbooks, ensuring issues are detected early and alerts are actionable
  • Build automation and tooling to reduce manual operational work and enable the platform to scale safely
  • Partner closely with data engineering, platform, and trading-facing teams to ensure data systems are reliable, well-understood, and fit for purpose

Requirements

  • Experience operating or supporting production data systems, such as data pipelines, ingestion frameworks, or analytical platforms
  • Strong proficiency in Python, with experience using libraries like Pandas, Arrow, and Spark
  • Solid understanding of data modeling, normalization, and API development in support of large-scale analytical or trading systems
  • Experience working with lakehouse architectures (e.g., Delta Lake, Databricks, AWS) to manage large-scale, high-quality analytical datasets is preferred
  • An operations mindset with a strong sense of ownership, reliability, and continuous improvement
  • Ability to operate with a high degree of autonomy, making sound engineering and operational decisions for production systems
  • Strong communication skills and the ability to work effectively across teams and time zones
  • Experience with external data ingestion is a plus
  • Exposure to PCAP-based data workflows or market data capture pipelines is a plus
  • Willingness to contribute to setting best practices and mentoring engineers on operational excellence and production reliability

Nice to have

  • Experience with external data ingestion
  • Exposure to PCAP-based data workflows or market data capture pipelines

What we offer

  • A performance-based bonus structure unmatched anywhere in the industry
  • The chance to work alongside diverse and intelligent peers in a rewarding environment
  • Training, mentorship and personal development opportunities
  • Daily breakfast, lunch and snacks
  • Gym membership, sports and leisure activities, plus weekly in-house chair massages
  • Regular social events, clubs and Friday afternoon drinks

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Data Site Reliability Engineer

8 matching positions

Senior Site Reliability Engineer - Data Pipeline

Bloomreach is building the world’s premier agentic platform for personalization....
Location
Location
Czechia
Salary
Salary:
Not provided
bloomreach.com Logo
Bloomreach
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You can articulate how your contributions have transformed the way engineers work and think by fostering a strong DevOps/SRE culture
  • You can demonstrate how impactful your work as an SRE or DevOps Engineer can be in connection to business success
  • You understand the importance of you build - you run it principle and you love the feeling you own it
  • You are mindful of the costs associated with running our service, which translates into effective vertical and horizontal pod autoscaling and detailed telemetry insights
  • You believe the infrastructure as a code is the only thing that can bring stability into chaos
  • Terraform is your daily bread, and HELM deployments are your second-best friend
  • You use telemetry data and metrics to provide feedback to engineers on how the application and services behave
  • You can navigate yourself in complex service architecture by using distributed debugging
  • You have experience with Python and a solid grasp of engineering practices
  • You don’t hesitate to participate in OnCall rotation 24/7 support
Job Responsibility
Job Responsibility
  • Your task is to build and maintain an ecosystem where engineers can safely and efficiently develop, debug and operate their services running in GCP, Kubernetes using DataFlow, DataProc and Python with Go
  • You make sure the services have high level of observability, enabling us to provide quality service for our customers
  • Further services can scale vertically and horizontally based on current load, operational and telemetric data (OTEL, Prometheus, Victoria Metrics)
  • Team have enough insights about health of our services (Grafana, Alerting, PageDuty)
  • You helps the team to fulfill security requirements given ISO and SOC2 audits, by enforce security principles like key distribution, key rotation, authorisation & authentication on service level, data encryption at transit, data isolation, resource limitations, quality of service, audit logs (mainly by Enovy proxies)
  • You contribute to our tooling, so we have tools in place for debugging, troubleshoot and performance testing
  • You automate manual/semi-manual steps deployment and instance setup
  • You have hands on on L3 support and incident resolutions
  • CI pipelines have linters, security scans, code smell detection enabling engineers to produce quality MRs
What we offer
What we offer
  • A great deal of freedom and trust
  • We have defined our 5 values and the 10 underlying key behaviors that we strongly believe in
  • We believe in flexible working hours to accommodate your working style
  • We work virtual-first with several Bloomreach Hubs available across three continents
  • We organize company events to experience the global spirit of the company and get excited about what's ahead
  • We encourage and support our employees to engage in volunteering activities - every Bloomreacher can take 5 paid days off to volunteer
  • We have a People Development Program -- participating in personal development workshops on various topics run by experts from inside the company
  • Our resident communication coach Ivo Večeřa is available to help navigate work-related communications & decision-making challenges
  • Our managers are strongly encouraged to participate in the Leader Development Program
  • Bloomreachers utilize the $1,500 professional education budget on an annual basis to purchase education products (books, courses, certifications, etc.)
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Data Pipeline

The Data Pipeline team is a backend-focused engineering team that is built on st...
Location
Location
Slovakia
Salary
Salary:
3500.00 EUR / Month
bloomreach.com Logo
Bloomreach
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You can articulate how your contributions have transformed the way engineers work and think by fostering a strong DevOps/SRE culture.
  • You can demonstrate how impactful your work as an SRE or DevOps Engineer can be in connection to business success
  • You understand the importance of you build - you run it principle and you love the feeling you own it
  • You are mindful of the costs associated with running our service, which translates into effective vertical and horizontal pod autoscaling and detailed telemetry insights.
  • You believe the infrastructure as a code is the only thing that can bring stability into chaos
  • Terraform is your daily bread, and HELM deployments are your second-best friend
  • You use telemetry data and metrics to provide feedback to engineers on how the application and services behave
  • You can navigate yourself in complex service architecture by using distributed debugging
  • You have experience with Python and a solid grasp of engineering practices
  • A big advantage is, if you have an experience with Go, or with ETL pipelines
Job Responsibility
Job Responsibility
  • Build and maintain an ecosystem where engineers can safely and efficiently develop, debug and operate their services running in GCP, Kubernetes using DataFlow, DataProc and Python with Go
  • Make sure the services have high level of observability, enabling us to provide quality service for our customers
  • Ensure further services can scale vertically and horizontally based on current load, operational and telemetric data (OTEL, Prometheus, Victoria Metrics)
  • Ensure team have enough insights about health of our services (Grafana, Alerting, PageDuty)
  • Help the team to fulfill security requirements given ISO and SOC2 audits, by enforce security principles like key distribution, key rotation, authorisation & authentication on service level, data encryption at transit, data isolation, resource limitations, quality of service, audit logs (mainly by Enovy proxies)
  • Contribute to our tooling, so we have tools in place for debugging, troubleshoot and performance testing
  • Automate manual/semi-manual steps deployment and instance setup
  • Have hands on on L3 support and incident resolutions
  • Ensure CI pipelines have linters, security scans, code smell detection enabling engineers to produce quality MRs
What we offer
What we offer
  • A great deal of freedom and trust
  • Flexible working hours
  • Work virtual-first with several Bloomreach Hubs available across three continents
  • Company events
  • 5 paid days off to volunteer
  • People Development Program
  • Communication coach available
  • Leader Development Program
  • $1,500 professional education budget annually
  • Employee Assistance Program with counselors
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer - Data Platform Operation

Join our Data & AI Platform team as a Site Reliability Engineer (SRE) – Platform...
Location
Location
Brazil , Sao Paulo
Salary
Salary:
Not provided
amaris.com Logo
Amaris Consulting
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Academic background: Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (minimum 3 years of experience)
  • Experience: 5+ years hands-on with cloud platforms (Azure, AWS, GCP), programming (Bash, PowerShell, Terraform, Python, Java), and Infrastructure as Code (IaC)
  • English language: Professional working proficiency in English and the local language
  • Tools / software: Deep expertise in Azure, Databricks, Unity Catalog, Kubernetes, Helm, Docker, Power BI, Datadog, Grafana, GitHub, Azure DevOps, ArgoCD, Airflow, SSIS, Power Query, and relational/NoSQL databases
  • AI experience: Experience supporting enterprise Data & AI platforms
  • Soft skills: Analytical problem-solving
  • Effective communication and active listening
  • Team player with respect for others
  • Strong troubleshooting and platform monitoring skills
  • Automation (Python, PowerShell, CLI, KQL, Terraform)
Job Responsibility
Job Responsibility
  • Support, manage, and maintain Azure resources: Azure SQL, Synapse, Data Factory, Databricks, Unity Catalog
  • Monitor Azure workloads, troubleshoot incidents, alerts, and performance bottlenecks
  • Implement and manage RBAC, identity & access policies, and compliance controls
  • Optimize Azure cost and performance using Azure Monitor, DataDog, and Cost Management tools
  • Automate tasks using PowerShell, Azure CLI, Terraform, and Python
  • Utilize Git, GitHub Actions, and Airflow for workflow automation
  • Provide L2/L3 support for data pipelines, reporting, and cloud services
  • Conduct incident response, root cause analysis (RCA), and proactive issue resolution
  • Collaborate with Cloud Engineering, Data Engineers, BI Developers, and Cloud Architects
  • Follow ITSM processes: Incident, Change, and Problem Management
What we offer
What we offer
  • An international community bringing together 110+ different nationalities
  • An environment where trust has a central place: 70% of our key leaders started their careers at the first level of responsibility
  • A robust training system with our internal Academy and 250+ available modules
  • A vibrant workplace that frequently gathers for internal events (afterworks, team buildings, etc.)
  • Strong commitments to CSR, notably through participation in our WeCare Together program
Read More
Arrow Right

Data – Site Reliability Engineer

We’re building out our Data Reliability & Quality Engineering function in Sydney...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
optiver.com Logo
Optiver
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in data engineering, site reliability engineering (SRE) or data operations
  • Proficient in SQL, ETL processes and at least one programming language (Python, Scala or C++)
  • Strong focus on automation, system resilience and continuous improvement
  • Passionate about data quality, observability and performance optimization
  • Structured and methodical in problem-solving, with strong debugging skills
  • Excellent communication skills and a collaborative, proactive mindset
Job Responsibility
Job Responsibility
  • Monitor and optimize the performance and stability of data pipelines to meet SLAs
  • Design and maintain data quality tests for accuracy, consistency and completeness
  • Investigate data incidents, perform root-cause analysis and implement preventive measures
  • Apply SRE principles to enhance reliability, alerting and recovery mechanisms
  • Develop automation and tooling to reduce manual intervention and improve scalability
  • Collaborate with data engineers, researchers and market data teams to align reliability goals with business priorities
What we offer
What we offer
  • A performance-based bonus structure unmatched anywhere in the industry
  • The chance to work alongside diverse and intelligent peers in a rewarding environment
  • Training, mentorship and personal development opportunities
  • Daily breakfast, lunch and an in-house barista
  • Gym membership plus weekly in-house chair massages
  • Regular social events, including a company trip every two years
  • Guided relocation, a competitive relocation package and visa sponsorship where necessary
Read More
Arrow Right

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...
Location
Location
United States , Reston
Salary
Salary:
Not provided
tier4group.com Logo
Tier4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years hands-on operating and managing Kubernetes and OpenShift clusters
  • Strong experience with Microsoft Azure (compute, networking, storage, and data services)
  • Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
  • Proficiency with observability tooling (Datadog, Prometheus, Grafana)
  • Scripting/coding ability in Bash, Python, or Go
Job Responsibility
Job Responsibility
  • Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
  • Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
  • Map current hybrid topology and critical delivery pipelines
  • identify toil and prioritize automation (Terraform/Ansible)
  • Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
  • Drive GitOps-first workflows
  • harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
  • Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
  • Lead incident response and postmortems
  • institutionalize RCA, blameless learning, and continuous improvement
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right