This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a technically sharp and detail-oriented Data Engineer to join HPEFS (Hewlett Packard Enterprises Financial Services - Advanced Analytics & BI team Bangalore. This role is the data backbone that powers our AI capabilities — working in close partnership with the AI Engineers to ensure that the data flowing into AI models, dashboards, and business workflows is clean, governed, and well-structured. This role will play a hands on role and own the backend data lifecycle: ingesting raw data from diverse sources, transforming it into reliable, analysis-ready datasets, enforcing data quality standards, and publishing governed data products via Microsoft Fabric and Databricks. You will also support reporting needs through Power BI and contribute to Collibra-based data governance initiatives. A working familiarity with Microsoft Copilot and AI-assisted data tooling is expected
Job Responsibility
Design, build, and maintain scalable ETL/ELT pipelines using Azure Data Factory, Databricks (PySpark / Delta Live Tables), and Microsoft Fabric Data Factory
Transform raw, multi-source data into clean, conformed, and analytics-ready datasets following Medallion Architecture principles (Bronze → Silver → Gold)
Develop and optimize SQL and PySpark-based transformation logic for structured, semi-structured, and unstructured data
Implement incremental load patterns, merge/upsert logic, and slowly changing dimension (SCD) strategies to support historical data tracking
Collaborate with the AI Engineers to prepare high-quality feature datasets for ML and LLM use cases
Define, implement, and monitor data quality rules including completeness, accuracy, consistency, timeliness, and uniqueness checks
Administer and extend the Collibra data governance platform — including business glossary management, data lineage documentation, and stewardship workflows
Build automated data quality validation frameworks using tools such as Great Expectations, dbt tests, or Unity Catalog data quality constraints in Databricks
Triage and resolve data quality incidents, root-cause data anomalies, and communicate impact to stakeholders proactively
Maintain metadata catalogues and ensure all critical datasets have documented ownership, lineage, and classification
Build and manage Lakehouses, Warehouses, and Dataflows Gen2 within the Microsoft Fabric ecosystem
Configure OneLake, shortcuts, and mirroring to unify data across sources without unnecessary duplication
Leverage Fabric Notebooks (PySpark / Python) and Spark job definitions for large-scale data processing
Support the semantic model layer in Fabric to ensure Power BI datasets are optimized and governed
Develop and maintain Power BI semantic models (star schema design, DAX measures, row-level security)
Build production-grade dashboards and reports for business stakeholders
ensure refresh reliability and performance
Apply Copilot-assisted authoring in Power BI and Fabric where applicable to accelerate report generation
Support self-service analytics adoption by publishing governed datasets to the Power BI service
Partner closely with the AI Engineers, peer data scientist and analytics team members to supply clean, structured data for RAG pipelines, model training, and agentic workflows
Contribute to the design of shared data contracts and API schemas between data engineering and AI engineering layers
Assist with AI-assisted data tasks using Microsoft Copilot (in Fabric, Power BI, and Azure environments)
Requirements
Bachelor's or Master's degree in Computer Science, Information Systems, Data Engineering, Mathematics, or a related discipline
4 – 5 years of hands-on experience in data engineering, ETL development, or analytics engineering roles
Demonstrable experience with Databricks and/or Microsoft Fabric in a production environment
Proficiency in Power BI report and semantic model development
Exposure to Collibra or equivalent data governance / cataloguing platforms is strongly preferred
Strong SQL and Python skills
PySpark experience is required
Familiarity with Azure cloud services and DevOps practices for data pipeline deployment