Data Reliability Engineer Job at Rimes Technologies

Data Reliability Engineer

We’re looking for a Data Reliability Engineer to help keep our trading and data ...

Location

Ireland , Dublin

Salary:

Not provided

Susquehanna International Group

Expiration Date

Until further notice

Requirements

Degree in a technical or business discipline or equivalent industry experience of 1+ years
Demonstrated experience with Python or equivalent language
Excellent analytical & troubleshooting skills, self-motivated and curious
Willing to work shift hours, to cover early and late responsibilities (alternating)
Experience with Change Management, Incident Management Procedures
Experience of technical documentation & support cases

Job Responsibility

Ensure Platform Reliability - Monitor and maintain trading-critical Airflow DAGs and Python-based pipelines, ensuring jobs run on time and within SLAs
Incident Response & Recovery - Triage, troubleshoot, and resolve failures quickly
validate downstream impacts and maintain tested rollback/recovery procedures
Change & Release Management - Act as a release gatekeeper—review code/config changes, enforce safe deployment standards, and coordinate risk-aware releases via Git(lab) and Octopus Deploy
Collaboration & Communication - Partner with quants and engineers to assess change impacts, document runbooks, and communicate operational updates and risks
Continuous Improvement - Enhance monitoring, alerting, and automation
track KPIs and drive initiatives that strengthen platform resilience and reduce incident recurrence

Senior Data Reliability Engineer

Your Mission Call of Duty is one of the most iconic and successful video game f...

Location

Canada , Vancouver

Salary:

Not provided

Activision

Expiration Date

Until further notice

Requirements

10+ years of programming experience
Extensive experience working in Python
familiarity with Go
Strong experience with data technologies such as SQL, Spark, and Airflow
Hands-on experience building observability systems using tools like OpenTelemetry, Prometheus, Loki, and Grafana
Experience with dashboarding and alerting for production systems
Secure automation of testing and deployments using GitHub Actions / Workflows (GitOps)
Experience with Linux system administration in production environments
Cloud-native deployment experience using Kubernetes, Helm, and ArgoCD
Experience supporting model deployments (batch and online APIs)

Job Responsibility

Create the ML Data pipeline used for our models including building the ML templates that are used, the observability of our models, the metrics and KPIs used to monitor their efficacy, and the automated retraining required as the data drifts
Design and operate large-scale, highly-available data pipelines and platforms for high-volume game telemetry
Ensure the integrity, trustworthiness, and quality of Anti-Cheat data
Partner closely with Machine Learning teams to support batch, streaming, online inference workflows, automated testing of ML artifacts, and observability and maintenance of automated deployment pipelines
Define and maintain GitOps workflows for secure, automated testing, integration, and deployment
Build comprehensive observability (metrics, logs, dashboards, alerts) into data pipelines and services
Own operational excellence, including incident response, root-cause analysis, and post-mortems
Contribute to deployment and release strategies such as canary, blue/green, and shadow deployments

What we offer

Medical, dental, vision, health savings account or health reimbursement account, healthcare spending accounts, dependent care spending accounts, life and AD&D insurance, disability insurance
401(k) with Company match, tuition reimbursement, charitable donation matching
Paid holidays and vacation, paid sick time, floating holidays, compassion and bereavement leaves, parental leave
Mental health & wellbeing programs, fitness programs, free and discounted games, and a variety of other voluntary benefit programs like supplemental life & disability, legal service, ID protection, rental insurance, and others
If the Company requires that you move geographic locations for the job, then you may also be eligible for relocation assistance

Fulltime

Senior Platform Engineer - Data Reliability

The Feedzai Platform Data Reliability play a pivotal role in managing core data ...

Location

Portugal

Salary:

Not provided

Feedzai

Expiration Date

Until further notice

Requirements

A bachelor's degree in Computer Science, Information Systems, or the equivalent combination of education, experience, and training
4+ years of experience in data reliability, platform engineering, or operating data services at scale
Proficiency in programming languages such as Go, Java, or similar
Hands-on experience with Container Technologies and Orchestration (e.g., Docker, Kubernetes)
Valuable experience with data streaming and messaging platforms, like Kafka, Elasticsearch, RabbitMQ
Familiarity with CI/CD pipelines and tools such as Jenkins, Gitlab, or similar
Demonstrated commitment to staying updated with industry trends and emerging technologies, showcasing a proactive approach to continuous learning
Demonstrated knowledge of best practices in security, ensuring the implementation of secure coding standards
Experience working with Cloud Providers, with a preference for AWS Cloud
Expertise in utilizing monitoring and observability stacks like Grafana and Prometheus

Job Responsibility

Build and maintain Kubernetes Operators, including deployment, monitoring, operations, and analytics tools developed by the team
Engage in development tasks using Go, Java, or similar languages
Operate services such as Kafka, Elasticsearch, RabbitMQ, Redis, Relational databases and Couchbase at an enterprise scale
Contribute to the self-healing capabilities of applications in our enterprise environments
Develop playbooks associated with actionable alerts to streamline response procedures
Work with AI-assisted development tools (e.g. Cursor) as part of your daily workflow to ship faster and iterate effectively
Maintain and enhance our Infrastructure as Code (IaC) to efficiently manage end-to-end lifecycle operations (monitoring, alerting, security, cost optimization, configuration, backup, etc.) in production environments
Utilize your experience and problem solving skills to help prevent and investigate production issues

Senior Site Reliability Engineer - Data Pipeline

Bloomreach is building the world’s premier agentic platform for personalization....

Location

Czechia

Salary:

Not provided

Bloomreach

Expiration Date

Until further notice

Requirements

You can articulate how your contributions have transformed the way engineers work and think by fostering a strong DevOps/SRE culture
You can demonstrate how impactful your work as an SRE or DevOps Engineer can be in connection to business success
You understand the importance of you build - you run it principle and you love the feeling you own it
You are mindful of the costs associated with running our service, which translates into effective vertical and horizontal pod autoscaling and detailed telemetry insights
You believe the infrastructure as a code is the only thing that can bring stability into chaos
Terraform is your daily bread, and HELM deployments are your second-best friend
You use telemetry data and metrics to provide feedback to engineers on how the application and services behave
You can navigate yourself in complex service architecture by using distributed debugging
You have experience with Python and a solid grasp of engineering practices
You don’t hesitate to participate in OnCall rotation 24/7 support

Job Responsibility

Your task is to build and maintain an ecosystem where engineers can safely and efficiently develop, debug and operate their services running in GCP, Kubernetes using DataFlow, DataProc and Python with Go
You make sure the services have high level of observability, enabling us to provide quality service for our customers
Further services can scale vertically and horizontally based on current load, operational and telemetric data (OTEL, Prometheus, Victoria Metrics)
Team have enough insights about health of our services (Grafana, Alerting, PageDuty)
You helps the team to fulfill security requirements given ISO and SOC2 audits, by enforce security principles like key distribution, key rotation, authorisation & authentication on service level, data encryption at transit, data isolation, resource limitations, quality of service, audit logs (mainly by Enovy proxies)
You contribute to our tooling, so we have tools in place for debugging, troubleshoot and performance testing
You automate manual/semi-manual steps deployment and instance setup
You have hands on on L3 support and incident resolutions
CI pipelines have linters, security scans, code smell detection enabling engineers to produce quality MRs

What we offer

A great deal of freedom and trust
We have defined our 5 values and the 10 underlying key behaviors that we strongly believe in
We believe in flexible working hours to accommodate your working style
We work virtual-first with several Bloomreach Hubs available across three continents
We organize company events to experience the global spirit of the company and get excited about what's ahead
We encourage and support our employees to engage in volunteering activities - every Bloomreacher can take 5 paid days off to volunteer
We have a People Development Program -- participating in personal development workshops on various topics run by experts from inside the company
Our resident communication coach Ivo Večeřa is available to help navigate work-related communications & decision-making challenges
Our managers are strongly encouraged to participate in the Leader Development Program
Bloomreachers utilize the $1,500 professional education budget on an annual basis to purchase education products (books, courses, certifications, etc.)

Fulltime

Senior Site Reliability Engineer - Data Pipeline

The Data Pipeline team is a backend-focused engineering team that is built on st...

Location

Slovakia

Salary:

3500.00 EUR / Month

Bloomreach

Expiration Date

Until further notice

Requirements

You can articulate how your contributions have transformed the way engineers work and think by fostering a strong DevOps/SRE culture.
You can demonstrate how impactful your work as an SRE or DevOps Engineer can be in connection to business success
You understand the importance of you build - you run it principle and you love the feeling you own it
You are mindful of the costs associated with running our service, which translates into effective vertical and horizontal pod autoscaling and detailed telemetry insights.
You believe the infrastructure as a code is the only thing that can bring stability into chaos
Terraform is your daily bread, and HELM deployments are your second-best friend
You use telemetry data and metrics to provide feedback to engineers on how the application and services behave
You can navigate yourself in complex service architecture by using distributed debugging
You have experience with Python and a solid grasp of engineering practices
A big advantage is, if you have an experience with Go, or with ETL pipelines

Job Responsibility

Build and maintain an ecosystem where engineers can safely and efficiently develop, debug and operate their services running in GCP, Kubernetes using DataFlow, DataProc and Python with Go
Make sure the services have high level of observability, enabling us to provide quality service for our customers
Ensure further services can scale vertically and horizontally based on current load, operational and telemetric data (OTEL, Prometheus, Victoria Metrics)
Ensure team have enough insights about health of our services (Grafana, Alerting, PageDuty)
Help the team to fulfill security requirements given ISO and SOC2 audits, by enforce security principles like key distribution, key rotation, authorisation & authentication on service level, data encryption at transit, data isolation, resource limitations, quality of service, audit logs (mainly by Enovy proxies)
Contribute to our tooling, so we have tools in place for debugging, troubleshoot and performance testing
Automate manual/semi-manual steps deployment and instance setup
Have hands on on L3 support and incident resolutions
Ensure CI pipelines have linters, security scans, code smell detection enabling engineers to produce quality MRs

What we offer

A great deal of freedom and trust
Flexible working hours
Work virtual-first with several Bloomreach Hubs available across three continents
Company events
5 paid days off to volunteer
People Development Program
Communication coach available
Leader Development Program
$1,500 professional education budget annually
Employee Assistance Program with counselors

Fulltime

Site Reliability Engineer - Data Platform Operation

Join our Data & AI Platform team as a Site Reliability Engineer (SRE) – Platform...

Location

Brazil , Sao Paulo

Salary:

Not provided

Amaris Consulting

Expiration Date

Until further notice

Requirements

Academic background: Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (minimum 3 years of experience)
Experience: 5+ years hands-on with cloud platforms (Azure, AWS, GCP), programming (Bash, PowerShell, Terraform, Python, Java), and Infrastructure as Code (IaC)
English language: Professional working proficiency in English and the local language
Tools / software: Deep expertise in Azure, Databricks, Unity Catalog, Kubernetes, Helm, Docker, Power BI, Datadog, Grafana, GitHub, Azure DevOps, ArgoCD, Airflow, SSIS, Power Query, and relational/NoSQL databases
AI experience: Experience supporting enterprise Data & AI platforms
Soft skills: Analytical problem-solving
Effective communication and active listening
Team player with respect for others
Strong troubleshooting and platform monitoring skills
Automation (Python, PowerShell, CLI, KQL, Terraform)

Job Responsibility

Support, manage, and maintain Azure resources: Azure SQL, Synapse, Data Factory, Databricks, Unity Catalog
Monitor Azure workloads, troubleshoot incidents, alerts, and performance bottlenecks
Implement and manage RBAC, identity & access policies, and compliance controls
Optimize Azure cost and performance using Azure Monitor, DataDog, and Cost Management tools
Automate tasks using PowerShell, Azure CLI, Terraform, and Python
Utilize Git, GitHub Actions, and Airflow for workflow automation
Provide L2/L3 support for data pipelines, reporting, and cloud services
Conduct incident response, root cause analysis (RCA), and proactive issue resolution
Collaborate with Cloud Engineering, Data Engineers, BI Developers, and Cloud Architects
Follow ITSM processes: Incident, Change, and Problem Management

What we offer

An international community bringing together 110+ different nationalities
An environment where trust has a central place: 70% of our key leaders started their careers at the first level of responsibility
A robust training system with our internal Academy and 250+ available modules
A vibrant workplace that frequently gathers for internal events (afterworks, team buildings, etc.)
Strong commitments to CSR, notably through participation in our WeCare Together program

Data Reliability / Operations Engineer

We are looking for a Data Reliability/Operations Engineer who will contribute si...

Location

USA , Indianapolis

Salary:

65000.00 - 66000.00 USD / Year

Beacon Hill

Expiration Date

Until further notice

Requirements

Must have a degree in Computer Science, Engineering, Data Science, or related field
Must have 4+ years’ experience in data and AI operations or related field
Experience working in a data reliability engineering role is required
Must be proficient in AWS (or Azure/GCP), Snowflake, RDS, SQL, Python, Terraform, Airflow/Dagster, Data pipelines, and data monitoring tools
Experience with ML Ops is higher desired or some type of AI experience
Experience with building data pipelines, data stores, and working with data analytics is required
Must be able to implement comprehensive monitoring and alerting for data pipelines that support operations and analytics
Must be able to design and deploy automated data quality checks that ensure accuracy of data
Must be able to respond to and resolve data incidents, conducting root cause analysis and implementing preventive measures
Collaborate with data engineers to improve pipeline reliability, performance, and operational efficiency

Job Responsibility

Independently managing data quality monitoring and incident response and troubleshooting
Maintaining data quality and insights that drive business forward
Mentoring junior team members
Contributing to the mission of delivering outstanding value through data-driven insights and innovative solutions

Fulltime

Data – Site Reliability Engineer

We’re building out our Data Reliability & Quality Engineering function in Sydney...

Location

Australia , Sydney

Salary:

Not provided

Optiver

Expiration Date

Until further notice

Requirements

2+ years of experience in data engineering, site reliability engineering (SRE) or data operations
Proficient in SQL, ETL processes and at least one programming language (Python, Scala or C++)
Strong focus on automation, system resilience and continuous improvement
Passionate about data quality, observability and performance optimization
Structured and methodical in problem-solving, with strong debugging skills
Excellent communication skills and a collaborative, proactive mindset

Job Responsibility

Monitor and optimize the performance and stability of data pipelines to meet SLAs
Design and maintain data quality tests for accuracy, consistency and completeness
Investigate data incidents, perform root-cause analysis and implement preventive measures
Apply SRE principles to enhance reliability, alerting and recovery mechanisms
Develop automation and tooling to reduce manual intervention and improve scalability
Collaborate with data engineers, researchers and market data teams to align reliability goals with business priorities

What we offer

A performance-based bonus structure unmatched anywhere in the industry
The chance to work alongside diverse and intelligent peers in a rewarding environment
Training, mentorship and personal development opportunities
Daily breakfast, lunch and an in-house barista
Gym membership plus weekly in-house chair massages
Regular social events, including a company trip every two years
Guided relocation, a competitive relocation package and visa sponsorship where necessary

Select Country

Data Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Data Reliability Engineer

Data Reliability Engineer

Senior Data Reliability Engineer

Senior Platform Engineer - Data Reliability

Senior Site Reliability Engineer - Data Pipeline

Senior Site Reliability Engineer - Data Pipeline

Site Reliability Engineer - Data Platform Operation

Data Reliability / Operations Engineer

Data – Site Reliability Engineer

Our AI answers in your language