CrawlJobs Logo

Data Reliability Engineer

Singapore · Job Posted January 09, 2026
Apply Position
Job Link Share

Job Description

We are looking for a hands-on Data Reliability Engineer to join our team. This role is focused on maintaining and optimizing the data infrastructure that powers Rimes’ global platforms. We seek strong technical engineers first—those with experience profiling, tuning, and scaling large-scale data and ETL systems. Over time, there is potential to grow into a team lead role.

Job Responsibility

  • Own and maintain the core data infrastructure supporting global data delivery
  • Profile, debug, and optimize large-scale ETL and data processing pipelines for performance and reliability
  • Write and maintain Python and SQL code to support data workflows, modules, and archiving processes
  • Use Datadog and other observability tools to proactively monitor, detect, and resolve system bottlenecks
  • Collaborate with Data Developers, SRE, and Infrastructure teams to ensure system scalability and disaster recovery readiness
  • Contribute to projects involving big data platforms, SMB clusters, and Palantir Foundry
  • Continuously improve automation, processes, and system resilience

Requirements

  • Hands-on engineer with strong proficiency in Python and SQL
  • Experienced in profiling and optimizing ETL systems and large-scale data pipelines
  • Familiar with IT infrastructure, working experience in datacentre or any cloud platform (AWS, Azure, etc.) will be a plus
  • Comfortable working with distributed systems / big data projects (SMB clusters, Palantir Foundry, Databricks, etc. are a plus)
  • Familiar with monitoring and observability tools (Datadog or similar)
  • Strong problem-solver with the ability to dive deep into complex technical issues
  • Financial services or exposure to data-intensive domains is preferred
  • Degree in Computer Science, Engineering, or equivalent practical experience

Nice to have

  • Familiar with IT infrastructure, working experience in datacentre or any cloud platform (AWS, Azure, etc.) will be a plus
  • Comfortable working with distributed systems / big data projects (SMB clusters, Palantir Foundry, Databricks, etc. are a plus)
  • Financial services or exposure to data-intensive domains is preferred

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Data Reliability Engineer

8 matching positions

Data Reliability Engineer

We’re looking for a Data Reliability Engineer to help keep our trading and data ...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
sig.com Logo
Susquehanna International Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in a technical or business discipline or equivalent industry experience of 1+ years
  • Demonstrated experience with Python or equivalent language
  • Excellent analytical & troubleshooting skills, self-motivated and curious
  • Willing to work shift hours, to cover early and late responsibilities (alternating)
  • Experience with Change Management, Incident Management Procedures
  • Experience of technical documentation & support cases
Job Responsibility
Job Responsibility
  • Ensure Platform Reliability - Monitor and maintain trading-critical Airflow DAGs and Python-based pipelines, ensuring jobs run on time and within SLAs
  • Incident Response & Recovery - Triage, troubleshoot, and resolve failures quickly
  • validate downstream impacts and maintain tested rollback/recovery procedures
  • Change & Release Management - Act as a release gatekeeper—review code/config changes, enforce safe deployment standards, and coordinate risk-aware releases via Git(lab) and Octopus Deploy
  • Collaboration & Communication - Partner with quants and engineers to assess change impacts, document runbooks, and communicate operational updates and risks
  • Continuous Improvement - Enhance monitoring, alerting, and automation
  • track KPIs and drive initiatives that strengthen platform resilience and reduce incident recurrence
Read More
Arrow Right

Senior Data Reliability Engineer

Your Mission Call of Duty is one of the most iconic and successful video game f...
Location
Location
Canada , Vancouver
Salary
Salary:
Not provided
activision.com Logo
Activision
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of programming experience
  • Extensive experience working in Python
  • familiarity with Go
  • Strong experience with data technologies such as SQL, Spark, and Airflow
  • Hands-on experience building observability systems using tools like OpenTelemetry, Prometheus, Loki, and Grafana
  • Experience with dashboarding and alerting for production systems
  • Secure automation of testing and deployments using GitHub Actions / Workflows (GitOps)
  • Experience with Linux system administration in production environments
  • Cloud-native deployment experience using Kubernetes, Helm, and ArgoCD
  • Experience supporting model deployments (batch and online APIs)
Job Responsibility
Job Responsibility
  • Create the ML Data pipeline used for our models including building the ML templates that are used, the observability of our models, the metrics and KPIs used to monitor their efficacy, and the automated retraining required as the data drifts
  • Design and operate large-scale, highly-available data pipelines and platforms for high-volume game telemetry
  • Ensure the integrity, trustworthiness, and quality of Anti-Cheat data
  • Partner closely with Machine Learning teams to support batch, streaming, online inference workflows, automated testing of ML artifacts, and observability and maintenance of automated deployment pipelines
  • Define and maintain GitOps workflows for secure, automated testing, integration, and deployment
  • Build comprehensive observability (metrics, logs, dashboards, alerts) into data pipelines and services
  • Own operational excellence, including incident response, root-cause analysis, and post-mortems
  • Contribute to deployment and release strategies such as canary, blue/green, and shadow deployments
What we offer
What we offer
  • Medical, dental, vision, health savings account or health reimbursement account, healthcare spending accounts, dependent care spending accounts, life and AD&D insurance, disability insurance
  • 401(k) with Company match, tuition reimbursement, charitable donation matching
  • Paid holidays and vacation, paid sick time, floating holidays, compassion and bereavement leaves, parental leave
  • Mental health & wellbeing programs, fitness programs, free and discounted games, and a variety of other voluntary benefit programs like supplemental life & disability, legal service, ID protection, rental insurance, and others
  • If the Company requires that you move geographic locations for the job, then you may also be eligible for relocation assistance
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer - Data Reliability

The Feedzai Platform Data Reliability play a pivotal role in managing core data ...
Location
Location
Portugal
Salary
Salary:
Not provided
feedzai.com Logo
Feedzai
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A bachelor's degree in Computer Science, Information Systems, or the equivalent combination of education, experience, and training
  • 4+ years of experience in data reliability, platform engineering, or operating data services at scale
  • Proficiency in programming languages such as Go, Java, or similar
  • Hands-on experience with Container Technologies and Orchestration (e.g., Docker, Kubernetes)
  • Valuable experience with data streaming and messaging platforms, like Kafka, Elasticsearch, RabbitMQ
  • Familiarity with CI/CD pipelines and tools such as Jenkins, Gitlab, or similar
  • Demonstrated commitment to staying updated with industry trends and emerging technologies, showcasing a proactive approach to continuous learning
  • Demonstrated knowledge of best practices in security, ensuring the implementation of secure coding standards
  • Experience working with Cloud Providers, with a preference for AWS Cloud
  • Expertise in utilizing monitoring and observability stacks like Grafana and Prometheus
Job Responsibility
Job Responsibility
  • Build and maintain Kubernetes Operators, including deployment, monitoring, operations, and analytics tools developed by the team
  • Engage in development tasks using Go, Java, or similar languages
  • Operate services such as Kafka, Elasticsearch, RabbitMQ, Redis, Relational databases and Couchbase at an enterprise scale
  • Contribute to the self-healing capabilities of applications in our enterprise environments
  • Develop playbooks associated with actionable alerts to streamline response procedures
  • Work with AI-assisted development tools (e.g. Cursor) as part of your daily workflow to ship faster and iterate effectively
  • Maintain and enhance our Infrastructure as Code (IaC) to efficiently manage end-to-end lifecycle operations (monitoring, alerting, security, cost optimization, configuration, backup, etc.) in production environments
  • Utilize your experience and problem solving skills to help prevent and investigate production issues
Read More
Arrow Right

Senior Site Reliability Engineer - Data Pipeline

Bloomreach is building the world’s premier agentic platform for personalization....
Location
Location
Czechia
Salary
Salary:
Not provided
bloomreach.com Logo
Bloomreach
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You can articulate how your contributions have transformed the way engineers work and think by fostering a strong DevOps/SRE culture
  • You can demonstrate how impactful your work as an SRE or DevOps Engineer can be in connection to business success
  • You understand the importance of you build - you run it principle and you love the feeling you own it
  • You are mindful of the costs associated with running our service, which translates into effective vertical and horizontal pod autoscaling and detailed telemetry insights
  • You believe the infrastructure as a code is the only thing that can bring stability into chaos
  • Terraform is your daily bread, and HELM deployments are your second-best friend
  • You use telemetry data and metrics to provide feedback to engineers on how the application and services behave
  • You can navigate yourself in complex service architecture by using distributed debugging
  • You have experience with Python and a solid grasp of engineering practices
  • You don’t hesitate to participate in OnCall rotation 24/7 support
Job Responsibility
Job Responsibility
  • Your task is to build and maintain an ecosystem where engineers can safely and efficiently develop, debug and operate their services running in GCP, Kubernetes using DataFlow, DataProc and Python with Go
  • You make sure the services have high level of observability, enabling us to provide quality service for our customers
  • Further services can scale vertically and horizontally based on current load, operational and telemetric data (OTEL, Prometheus, Victoria Metrics)
  • Team have enough insights about health of our services (Grafana, Alerting, PageDuty)
  • You helps the team to fulfill security requirements given ISO and SOC2 audits, by enforce security principles like key distribution, key rotation, authorisation & authentication on service level, data encryption at transit, data isolation, resource limitations, quality of service, audit logs (mainly by Enovy proxies)
  • You contribute to our tooling, so we have tools in place for debugging, troubleshoot and performance testing
  • You automate manual/semi-manual steps deployment and instance setup
  • You have hands on on L3 support and incident resolutions
  • CI pipelines have linters, security scans, code smell detection enabling engineers to produce quality MRs
What we offer
What we offer
  • A great deal of freedom and trust
  • We have defined our 5 values and the 10 underlying key behaviors that we strongly believe in
  • We believe in flexible working hours to accommodate your working style
  • We work virtual-first with several Bloomreach Hubs available across three continents
  • We organize company events to experience the global spirit of the company and get excited about what's ahead
  • We encourage and support our employees to engage in volunteering activities - every Bloomreacher can take 5 paid days off to volunteer
  • We have a People Development Program -- participating in personal development workshops on various topics run by experts from inside the company
  • Our resident communication coach Ivo Večeřa is available to help navigate work-related communications & decision-making challenges
  • Our managers are strongly encouraged to participate in the Leader Development Program
  • Bloomreachers utilize the $1,500 professional education budget on an annual basis to purchase education products (books, courses, certifications, etc.)
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Data Pipeline

The Data Pipeline team is a backend-focused engineering team that is built on st...
Location
Location
Slovakia
Salary
Salary:
3500.00 EUR / Month
bloomreach.com Logo
Bloomreach
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You can articulate how your contributions have transformed the way engineers work and think by fostering a strong DevOps/SRE culture.
  • You can demonstrate how impactful your work as an SRE or DevOps Engineer can be in connection to business success
  • You understand the importance of you build - you run it principle and you love the feeling you own it
  • You are mindful of the costs associated with running our service, which translates into effective vertical and horizontal pod autoscaling and detailed telemetry insights.
  • You believe the infrastructure as a code is the only thing that can bring stability into chaos
  • Terraform is your daily bread, and HELM deployments are your second-best friend
  • You use telemetry data and metrics to provide feedback to engineers on how the application and services behave
  • You can navigate yourself in complex service architecture by using distributed debugging
  • You have experience with Python and a solid grasp of engineering practices
  • A big advantage is, if you have an experience with Go, or with ETL pipelines
Job Responsibility
Job Responsibility
  • Build and maintain an ecosystem where engineers can safely and efficiently develop, debug and operate their services running in GCP, Kubernetes using DataFlow, DataProc and Python with Go
  • Make sure the services have high level of observability, enabling us to provide quality service for our customers
  • Ensure further services can scale vertically and horizontally based on current load, operational and telemetric data (OTEL, Prometheus, Victoria Metrics)
  • Ensure team have enough insights about health of our services (Grafana, Alerting, PageDuty)
  • Help the team to fulfill security requirements given ISO and SOC2 audits, by enforce security principles like key distribution, key rotation, authorisation & authentication on service level, data encryption at transit, data isolation, resource limitations, quality of service, audit logs (mainly by Enovy proxies)
  • Contribute to our tooling, so we have tools in place for debugging, troubleshoot and performance testing
  • Automate manual/semi-manual steps deployment and instance setup
  • Have hands on on L3 support and incident resolutions
  • Ensure CI pipelines have linters, security scans, code smell detection enabling engineers to produce quality MRs
What we offer
What we offer
  • A great deal of freedom and trust
  • Flexible working hours
  • Work virtual-first with several Bloomreach Hubs available across three continents
  • Company events
  • 5 paid days off to volunteer
  • People Development Program
  • Communication coach available
  • Leader Development Program
  • $1,500 professional education budget annually
  • Employee Assistance Program with counselors
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer - Data Platform Operation

Join our Data & AI Platform team as a Site Reliability Engineer (SRE) – Platform...
Location
Location
Brazil , Sao Paulo
Salary
Salary:
Not provided
amaris.com Logo
Amaris Consulting
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Academic background: Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (minimum 3 years of experience)
  • Experience: 5+ years hands-on with cloud platforms (Azure, AWS, GCP), programming (Bash, PowerShell, Terraform, Python, Java), and Infrastructure as Code (IaC)
  • English language: Professional working proficiency in English and the local language
  • Tools / software: Deep expertise in Azure, Databricks, Unity Catalog, Kubernetes, Helm, Docker, Power BI, Datadog, Grafana, GitHub, Azure DevOps, ArgoCD, Airflow, SSIS, Power Query, and relational/NoSQL databases
  • AI experience: Experience supporting enterprise Data & AI platforms
  • Soft skills: Analytical problem-solving
  • Effective communication and active listening
  • Team player with respect for others
  • Strong troubleshooting and platform monitoring skills
  • Automation (Python, PowerShell, CLI, KQL, Terraform)
Job Responsibility
Job Responsibility
  • Support, manage, and maintain Azure resources: Azure SQL, Synapse, Data Factory, Databricks, Unity Catalog
  • Monitor Azure workloads, troubleshoot incidents, alerts, and performance bottlenecks
  • Implement and manage RBAC, identity & access policies, and compliance controls
  • Optimize Azure cost and performance using Azure Monitor, DataDog, and Cost Management tools
  • Automate tasks using PowerShell, Azure CLI, Terraform, and Python
  • Utilize Git, GitHub Actions, and Airflow for workflow automation
  • Provide L2/L3 support for data pipelines, reporting, and cloud services
  • Conduct incident response, root cause analysis (RCA), and proactive issue resolution
  • Collaborate with Cloud Engineering, Data Engineers, BI Developers, and Cloud Architects
  • Follow ITSM processes: Incident, Change, and Problem Management
What we offer
What we offer
  • An international community bringing together 110+ different nationalities
  • An environment where trust has a central place: 70% of our key leaders started their careers at the first level of responsibility
  • A robust training system with our internal Academy and 250+ available modules
  • A vibrant workplace that frequently gathers for internal events (afterworks, team buildings, etc.)
  • Strong commitments to CSR, notably through participation in our WeCare Together program
Read More
Arrow Right

Data Reliability / Operations Engineer

We are looking for a Data Reliability/Operations Engineer who will contribute si...
Location
Location
USA , Indianapolis
Salary
Salary:
65000.00 - 66000.00 USD / Year
bhsg.com Logo
Beacon Hill
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Must have a degree in Computer Science, Engineering, Data Science, or related field
  • Must have 4+ years’ experience in data and AI operations or related field
  • Experience working in a data reliability engineering role is required
  • Must be proficient in AWS (or Azure/GCP), Snowflake, RDS, SQL, Python, Terraform, Airflow/Dagster, Data pipelines, and data monitoring tools
  • Experience with ML Ops is higher desired or some type of AI experience
  • Experience with building data pipelines, data stores, and working with data analytics is required
  • Must be able to implement comprehensive monitoring and alerting for data pipelines that support operations and analytics
  • Must be able to design and deploy automated data quality checks that ensure accuracy of data
  • Must be able to respond to and resolve data incidents, conducting root cause analysis and implementing preventive measures
  • Collaborate with data engineers to improve pipeline reliability, performance, and operational efficiency
Job Responsibility
Job Responsibility
  • Independently managing data quality monitoring and incident response and troubleshooting
  • Maintaining data quality and insights that drive business forward
  • Mentoring junior team members
  • Contributing to the mission of delivering outstanding value through data-driven insights and innovative solutions
  • Fulltime
Read More
Arrow Right

Data – Site Reliability Engineer

We’re building out our Data Reliability & Quality Engineering function in Sydney...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
optiver.com Logo
Optiver
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in data engineering, site reliability engineering (SRE) or data operations
  • Proficient in SQL, ETL processes and at least one programming language (Python, Scala or C++)
  • Strong focus on automation, system resilience and continuous improvement
  • Passionate about data quality, observability and performance optimization
  • Structured and methodical in problem-solving, with strong debugging skills
  • Excellent communication skills and a collaborative, proactive mindset
Job Responsibility
Job Responsibility
  • Monitor and optimize the performance and stability of data pipelines to meet SLAs
  • Design and maintain data quality tests for accuracy, consistency and completeness
  • Investigate data incidents, perform root-cause analysis and implement preventive measures
  • Apply SRE principles to enhance reliability, alerting and recovery mechanisms
  • Develop automation and tooling to reduce manual intervention and improve scalability
  • Collaborate with data engineers, researchers and market data teams to align reliability goals with business priorities
What we offer
What we offer
  • A performance-based bonus structure unmatched anywhere in the industry
  • The chance to work alongside diverse and intelligent peers in a rewarding environment
  • Training, mentorship and personal development opportunities
  • Daily breakfast, lunch and an in-house barista
  • Gym membership plus weekly in-house chair massages
  • Regular social events, including a company trip every two years
  • Guided relocation, a competitive relocation package and visa sponsorship where necessary
Read More
Arrow Right