This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Site Reliability Engineer / Data Platform Analyst at BLK will support the stability, reliability, and operational effectiveness of the platforms that enable Aladdin Studio, EDP and internal developer applications. This role is part of the AMRS Data Platform Operations team and contributes directly to the operational resilience of critical data and compute environments, this position requires strong technical foundations, attention to detail, and the ability to collaborate across global teams.
Job Responsibility:
Operational Reliability: Monitor and support applications, workflows, and platform components across Aladdin Studio and related environments
Participate in incident response, troubleshooting, root cause analysis, and recovery activities
Ensure compliance with operational standards, SLAs, and internal procedures
Infrastructure & Cloud Native Distributed Containerized Microservice Orchestration: Support operational tasks across AKS, TKGI, HAL, and EWD containerized clusters
Assist with upgrades, migrations, and platform decommissioning efforts
Validate deployments, service configurations, networking behavior, and storage (PV/PVC)
Observability & Monitoring: Maintain and enhance dashboards, alerts, telemetry through Visualization Analytics Platform for metrics and logs and alerting toolkits for cloud native environments
Develop SLIs/SLOs and contribute to observability improvements across distributed systems
Support network layer visibility and layer visibility and service mesh observability (Istio)
Automation & Continuous Improvement: Develop automation scripts in Python to reduce manual tasks and increase operational consistency
Write SQL queries for data validation, incident investigation, and operational analysis
Improve tooling workflows, integrations, and documentation across operational platforms
Collaboration: Work closely with engineering, product, and global operations teams in AMRS, EMEA, and APAC
Support key programs such as platform modernization, developer experience improvements, and operational readiness for new environments
Communicate clearly with stakeholders during incidents, escalations, and operational reviews
Requirements:
Bachelor’s degree in Computer Science, Data Science, Software Engineering, Systems Engineering, or a related field
2+ years of experience in Site Reliability Engineering, DevOps, Platform Operations, or Data Operations
Proficiency with Python (automation, monitoring, data validation, log analysis)
Proficiency with SQL (queries for diagnostics, validation, and analysis)
Strong understanding of Linux, containers, Cloud native Microservices concepts, and YAML based configurations
Experience with observability tools and Data visualization tools
Strong analytical and troubleshooting skills across distributed and cloud native environments
Excellent communication skills in English and Spanish