This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Data Engineer - Streaming role involves designing and implementing PySpark Structured Streaming pipelines for data ingestion into Apache Iceberg tables. Candidates should have a strong background in Apache Kafka and PySpark, with at least 4 years of experience. Responsibilities include ensuring compliance with technical constraints and writing comprehensive tests for the streaming application. This position offers an opportunity to work in a dynamic environment with innovative data solutions.
Job Responsibility:
Design and implement a PySpark Structured Streaming application that reads from Confluent Kafka topics, parses JSON and Avro payloads, applies schema mappings, and writes atomically to Iceberg tables using the Iceberg Spark runtime and foreachBatch micro-batch pattern
Ensure all functionality relies exclusively on public Apache-supported APIs — Apache Spark, Apache Kafka, and Apache Iceberg — with no unsupported Confluent connectors or proprietary sinks
Configure Kafka source parameters: bootstrap servers, consumer group IDs, offset management (startingOffsets, failOnDataLoss), checkpoint paths, and trigger intervals
Implement PII detection and Protegrity tokenization hooks within the ingestion pipeline before data lands in the Iceberg Bronze layer
Write comprehensive unit and integration tests: row count validation, schema conformance checks, Kafka offset commit verification, and data comparison against the source topic
Support PNC UAT — walk PNC engineers through the code, demonstrate no unsupported connectors are used, and address review findings
Own the two streaming ingestion workstreams of the PNC Bank Hadoop-to-Iceberg POC
Design and deliver production-grade PySpark Structured Streaming pipelines that ingest data into Apache Iceberg tables — operating under specific technical constraints
Work closely with GitHub CoPilot to scaffold, iterate, test, and document the streaming application code — acting as the technical reviewer and subject matter expert who ensures AI-generated pipelines are production-ready, PNC-compliant, and correctly integrated with the Iceberg catalog and Protegrity tokenization layer
Requirements:
Apache Kafka – Producer & Consumer
4+ years of hands-on experience with Apache Kafka, including both producer and consumer development in PySpark, Java, or Scala
Deep understanding of Kafka internals: topics, partitions, consumer groups, offsets, rebalancing, and exactly-once delivery semantics
Experience with Confluent Kafka: schema registry, Avro/JSON serialisation, and Confluent Cloud or on-prem cluster configuration
Proven ability to build ingestion pipelines without relying on unsupported or third-party sink connectors — using only native Kafka consumer APIs and Spark integration
Familiarity with Kafka Connect architecture to evaluate trade-offs and articulate why application-level ingestion is preferred in constrained environments
PySpark Structured Streaming
Strong practical experience with PySpark Structured Streaming: Kafka source, file source, foreachBatch, output modes (append/update/complete), and checkpoint management
Experience tuning streaming micro-batch trigger intervals, watermarking, and late data handling for production workloads
Hands-on experience writing streaming data directly to Apache Iceberg tables using the Iceberg Spark runtime
Ability to implement robust error handling: dead-letter queues, parse error isolation, and recovery from checkpoint failures
Data Engineering & Iceberg
Working knowledge of Apache Iceberg: catalog configuration, schema definition, append writes, and partition strategy for event and log data
Familiarity with S3-compatible object storage as an Iceberg warehouse destination
Understanding of medallion architecture — ability to correctly land streaming data in the Bronze layer with appropriate schema governance