CrawlJobs Logo

Site Reliability Engineer - Data Platform Operation

Brazil, Sao Paulo · Job Posted January 09, 2026
Apply Position
Job Link Share

Job Description

Join our Data & AI Platform team as a Site Reliability Engineer (SRE) – Platform Operation. You will support and maintain scalable, resilient, and efficient infrastructure for our Data & AI Platform, ensuring reliable infrastructure availability and enhancing business as usual. You will collaborate closely with Platform Engineers, Architects, Data Engineers, DevOps, and Security teams to maintain and optimize our platforms.

Job Responsibility

  • Support, manage, and maintain Azure resources: Azure SQL, Synapse, Data Factory, Databricks, Unity Catalog
  • Monitor Azure workloads, troubleshoot incidents, alerts, and performance bottlenecks
  • Implement and manage RBAC, identity & access policies, and compliance controls
  • Optimize Azure cost and performance using Azure Monitor, DataDog, and Cost Management tools
  • Automate tasks using PowerShell, Azure CLI, Terraform, and Python
  • Utilize Git, GitHub Actions, and Airflow for workflow automation
  • Provide L2/L3 support for data pipelines, reporting, and cloud services
  • Conduct incident response, root cause analysis (RCA), and proactive issue resolution
  • Collaborate with Cloud Engineering, Data Engineers, BI Developers, and Cloud Architects
  • Follow ITSM processes: Incident, Change, and Problem Management
  • Ensure platform security and compliance with frameworks like MICS

Requirements

  • Academic background: Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (minimum 3 years of experience)
  • Experience: 5+ years hands-on with cloud platforms (Azure, AWS, GCP), programming (Bash, PowerShell, Terraform, Python, Java), and Infrastructure as Code (IaC)
  • English language: Professional working proficiency in English and the local language
  • Tools / software: Deep expertise in Azure, Databricks, Unity Catalog, Kubernetes, Helm, Docker, Power BI, Datadog, Grafana, GitHub, Azure DevOps, ArgoCD, Airflow, SSIS, Power Query, and relational/NoSQL databases
  • AI experience: Experience supporting enterprise Data & AI platforms
  • Soft skills: Analytical problem-solving
  • Effective communication and active listening
  • Team player with respect for others
  • Strong troubleshooting and platform monitoring skills
  • Automation (Python, PowerShell, CLI, KQL, Terraform)
  • ITIL-based workflow experience

What we offer

  • An international community bringing together 110+ different nationalities
  • An environment where trust has a central place: 70% of our key leaders started their careers at the first level of responsibility
  • A robust training system with our internal Academy and 250+ available modules
  • A vibrant workplace that frequently gathers for internal events (afterworks, team buildings, etc.)
  • Strong commitments to CSR, notably through participation in our WeCare Together program

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer - Data Platform Operation

8 matching positions

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...
Location
Location
United States , Reston
Salary
Salary:
Not provided
tier4group.com Logo
Tier4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years hands-on operating and managing Kubernetes and OpenShift clusters
  • Strong experience with Microsoft Azure (compute, networking, storage, and data services)
  • Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
  • Proficiency with observability tooling (Datadog, Prometheus, Grafana)
  • Scripting/coding ability in Bash, Python, or Go
Job Responsibility
Job Responsibility
  • Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
  • Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
  • Map current hybrid topology and critical delivery pipelines
  • identify toil and prioritize automation (Terraform/Ansible)
  • Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
  • Drive GitOps-first workflows
  • harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
  • Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
  • Lead incident response and postmortems
  • institutionalize RCA, blameless learning, and continuous improvement
  • Fulltime
Read More
Arrow Right

Big Data/Data Platform Site Reliability Engineer

About PulsePoint: PulsePoint is a fast-growing healthcare technology company (wi...
Location
Location
United Kingdom
Salary
Salary:
Not provided
pulsepoint.com Logo
PulsePoint
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong hands-on experience operating large-scale Linux infrastructure in production (Rocky Linux or equivalent)
  • Deep practical knowledge of Apache Hadoop-based data platforms, including: HDFS architecture and failure modes, Kerberos-based security models, Operational lifecycle (upgrades, scaling, recovery)
  • Experience running Apache Kafka clusters in production, including KRaft-based setups
  • Proven ability to debug complex distributed system issues across storage, compute, and networking layers
  • Experience designing or improving automation, deployment, or GitOps-style workflows
  • Proficiency in scripting or automation (Python, Shell, etc.)
  • Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, basic network security concepts)
  • Comfortable taking technical ownership, driving reliability improvements, and participating in on-call / incident processes
  • Willing and able to work East Coast U.S. hours (9am–6pm EST)
Job Responsibility
Job Responsibility
  • Deploying, configuring, monitoring and maintaining multiple big data stores across multiple datacenters, with a strong focus on reliability, scalability, and operational excellence
  • Perform planning, configuration, deployment, and lifecycle management of critical data infrastructure
  • Managing large-scale Linux infrastructure to ensure maximum uptime and predictable performance
  • Developing and documenting system configuration standards, operational procedures, and best practices
  • Performance and reliability testing, including reviewing configuration, software choices, versions, and hardware specifications
  • Participating in incident response, root cause analysis, and driving long-term reliability improvements
  • Advancing our technology stack with innovative ideas and pragmatic solutions
Read More
Arrow Right

Data Site Reliability Engineer

We are hiring Data Site Reliability Engineers (Data SREs) to join our Global Dat...
Location
Location
China , Shanghai
Salary
Salary:
Not provided
optiver.com Logo
Optiver
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience operating or supporting production data systems, such as data pipelines, ingestion frameworks, or analytical platforms
  • Strong proficiency in Python, with experience using libraries like Pandas, Arrow, and Spark
  • Solid understanding of data modeling, normalization, and API development in support of large-scale analytical or trading systems
  • Experience working with lakehouse architectures (e.g., Delta Lake, Databricks, AWS) to manage large-scale, high-quality analytical datasets is preferred
  • An operations mindset with a strong sense of ownership, reliability, and continuous improvement
  • Ability to operate with a high degree of autonomy, making sound engineering and operational decisions for production systems
  • Strong communication skills and the ability to work effectively across teams and time zones
  • Experience with external data ingestion is a plus
  • Exposure to PCAP-based data workflows or market data capture pipelines is a plus
  • Willingness to contribute to setting best practices and mentoring engineers on operational excellence and production reliability
Job Responsibility
Job Responsibility
  • Configure, launch, and maintain robust data pipelines (ETL/ELT) for ingesting and transforming datasets critical to research and trading
  • Own the end-to-end reliability of critical production data pipelines, from ingestion through downstream consumption
  • Ensure data quality and consistency through validation, monitoring, and robust engineering practices
  • Act as the first point of contact for data incidents, investigating failures, data quality issues, and pipeline regressions, and driving them through to resolution
  • Participate in incident response, root cause analysis, and post-incident reviews, with a focus on preventing recurrence
  • Manage daily releases, backfills, and ad-hoc data runs, with a strong focus on safety, traceability, and environment segregation
  • Design and improve monitoring, alerting, data quality checks, and operational runbooks, ensuring issues are detected early and alerts are actionable
  • Build automation and tooling to reduce manual operational work and enable the platform to scale safely
  • Partner closely with data engineering, platform, and trading-facing teams to ensure data systems are reliable, well-understood, and fit for purpose
What we offer
What we offer
  • A performance-based bonus structure unmatched anywhere in the industry
  • The chance to work alongside diverse and intelligent peers in a rewarding environment
  • Training, mentorship and personal development opportunities
  • Daily breakfast, lunch and snacks
  • Gym membership, sports and leisure activities, plus weekly in-house chair massages
  • Regular social events, clubs and Friday afternoon drinks
Read More
Arrow Right

Senior Data Engineer - Data Platform

We are looking for a Senior Data Engineer - Data Platform to join our Data & AI ...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • More than 7 years of experience as Site Reliability Engineer, Data Ops, Data Platform Engineer or in a similar role, with a proven track record of building and maintaining complex data infrastructures
  • Strong proficiency in data engineering and infrastructure tools and technologies, such as stream and events processing (Kafka, PubSub, Firehose) and Kubernetes
  • Expertise in programming languages like Python
  • Familiar with cloud infrastructure and services, preferably AWS, Azure, or GCP, and have experience with infrastructure-as-code tools such as Terraform
  • Excellent problem-solving skills with a focus on identifying and resolving data infrastructure bottlenecks and performance issues
Job Responsibility
Job Responsibility
  • Design and implement a scalable and reliable data infrastructure that supports the collection, processing, storage, and analysis of large-scale datasets while pushing security and privacy best practices
  • Build and maintain data pipelines that efficiently extract, transform, and load data from various sources into our data warehouse
  • Implement automation and orchestration tools to streamline infrastructure provisioning, data workflows, reduce manual effort, and improve operational efficiency
  • Monitor data platform for performance and reliability, identify and troubleshoot issues, and implement proactive solutions to ensure data quality and availability
  • Streamline and monitor platform costs, identify optimizations and saving opportunities while collaborating with data engineers, data scientists, and other stakeholders
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive one additional month of leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
  • Up to 14 days of RTT
  • A subsidy from the work council to refund part of the membership to a sport club or a creative class
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...
Location
Location
United States
Salary
Salary:
116633.00 - 181243.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
  • Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
  • CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
  • Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
  • SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
  • Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
  • Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
  • Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
  • Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
  • Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements
Job Responsibility
Job Responsibility
  • Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
  • Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
  • Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
  • Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
  • Partner with engineering team members to embed reliability best practices early in the development lifecycle
  • Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
  • Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
  • Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
  • Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
  • Reduce operational toil by identifying repetitive work and implementing automation-first solutions
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Shape the Future of Intelligent Operations as a Site Reliability Engineer (AI Op...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1 to 2 years of professional experience in a DevOps, MLOps, or systems engineering environment
  • Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field
  • Direct experience with Microsoft Azure cloud platforms and its specialized ecosystem services (such as Azure ML and Azure DevOps)
  • Proficiency with Python or other scripting languages (Shell / Bash / PowerShell) for rapid system integration and task automation
  • Foundational understanding of containerization (Docker), basic orchestration concepts (Kubernetes fundamentals), and version control system workflows (Git)
  • Solid baseline knowledge of fundamental DevOps principles (CI/CD, system administration) and a basic understanding of the end-to-end machine learning model lifecycle
Job Responsibility
Job Responsibility
  • Assist in the deployment and maintenance of machine learning models in production under direct supervision, building skills in containerization and orchestration architectures
  • Support the development of robust continuous integration and deployment pipelines for ML workflows, including model versioning, automated testing, and release processes
  • Monitor production ML model performance, detect data drift, and track system health by implementing foundational logging, alerting, and metrics solutions
  • Contribute to infrastructure automation and configuration management for machine learning workloads, learning to treat infrastructure as software
  • Partner closely with ML engineers and data scientists to operationalize complex models, ensuring reliability, scale, and strict adherence to established operational patterns
What we offer
What we offer
  • Structured environment to accelerate technical skills
  • Direct guidance from experienced engineering professionals
  • Projects that improve productivity, quality, safety, transparency and sustainability
  • Collaborative and supportive team
  • Entrepreneurial spirit empowering proactive doers
  • Flexible work arrangements
  • Fulltime
Read More
Arrow Right