This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are actively seeking a highly skilled and dedicated Big Data Engineer specializing in PySpark and Apache Airflow to enhance our data platform capabilities. This critical role involves designing, developing, and orchestrating complex data pipelines that underpin our advanced analytics and machine learning initiatives.
Job Responsibility:
Design, develop, and maintain robust, scalable, and efficient big data pipelines primarily using PySpark for data ingestion, transformation, and processing
Implement and manage data workflows using Apache Airflow, including designing DAGs (Directed Acyclic Graphs), configuring operators, and optimizing task dependencies for reliable and scheduled data pipeline execution
Optimize PySpark jobs and data workflows for performance, cost-efficiency, and resource utilization across distributed computing environments
Collaborate closely with data scientists, AI/ML engineers, and other stakeholders to translate analytical and machine learning requirements into highly performant and automated data solutions
Develop and implement data quality checks, validation rules, and monitoring mechanisms within PySpark jobs and Airflow DAGs to ensure data integrity and consistency
Troubleshoot, debug, and resolve issues in PySpark code and Airflow pipeline failures, ensuring high availability and reliability of data assets
Contribute to the architecture and evolution of our data platform, advocating for best practices in data engineering, automation, and operational excellence
Ensure data security, privacy, and compliance throughout the data lifecycle within the pipelines
Requirements:
7+ Years of Expert-level proficiency in PySpark for building and optimizing large-scale data processing applications
Strong hands-on experience with Apache Airflow, including DAG development, custom operators/sensors, connections, and deployment strategies
Proven experience in designing, building, and operating production-grade distributed data pipelines
Solid understanding of big data architectures, distributed computing principles, and data warehousing concepts
Proficiency in data modeling, schema design, and various data storage formats (e.g., Parquet, ORC, Delta Lake)
Experience with cloud platforms such as AWS, Azure, or Google Cloud Platform (GCP), specifically their big data services (e.g., EMR, Databricks, HDInsight, Dataflow) and object storage (S3, ADLS, GCS)
Demonstrated experience with version control systems, particularly Git
Excellent problem-solving, analytical, and debugging skills
Ability to work effectively both independently and as part of a collaborative, agile team
Nice to have:
Experience with containerization technologies (e.g., Docker, Kubernetes) for deploying PySpark applications or Airflow
Familiarity with CI/CD practices for data pipelines
Understanding of machine learning concepts and experience with data preparation for AI/ML models
Knowledge of other orchestration tools or workflow managers
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.