GithubHelp home page GithubHelp logo

airflow-spark-101's Introduction

Pyspark-example

Prerequisites

Project Structure

root/
 |-- dags/
 |   |-- project.py
 |-- src/
 |   |-- project/
 |   |-- |-- common/
 |   |-- |-- |-- spark.py
 |   |-- |-- jobs/
 |   |-- |-- transformations/
 |   |-- app.py
 |-- tests/
 |   |-- common/
 |   |-- | -- spark.py
 |   Dockerfile
 |   setup.py

The main Python module contains the ETL job app.py. By default app.py accepts a number of arguments:

  • --date the execution date
  • --env the environment we are executing in
  • --jobs one or more jobs that needs to be executed

Concepts

Pin your python dependencies

In building your Python application and its dependencies for production, you want to make sure that your builds are predictable and deterministic. Therefore, always pin your dependencies. You can read more in the article: Better package management

When using pip-tools to manage dependencies, you define your dependencies in the requirements.in file. This file can then be compiled into the requirements.txt file by running the command pip-compile requirements.in from your shell.

This compilation step makes sure every dependency gets pinned in the requirements.txt file, ensuring that project won't break because of transitive dependencies being silently updated. When a dependency does need to be updated, you can update the requirements.in file and re-compile it. With this method, package updates always happen as a conscious decision by the developer.

The pip-compile command should be run from the same virtual environment as your project so conditional dependencies that require a specific Python version, or other environment markers, resolve relative to your project's environment.

Adding another job to the spark application

If you want to run another job in your spark application create a file like app.py. You should:

  • Use argparse (or something similar) to parse argument to pass to your job
  • Have a main function that can be called
  • Make sure you have if __name__ == "__main__" construct in your file like below
  • Use your job file in the dag

The following python snippet makes sure that if you call this module from the command lind that the main() function will be executed:

if __name__ == "__main__":
    main()

Commands

Setup virtual environment:

  • pyenv local to use a correct python version
  • python -m venv venv to create a virtual environment
  • source ./venv/bin/activate to activate the virtual environment
  • pip install pip-tools to install pip tools

Tasks:

  • pip install -r requirements.txt to install dependencies
  • pip install -r dev-requirements.txt to install development dependencies
  • pip install -e . to install the project in editable mode
  • python -m pytest --cov=src tests runs all the tests and check coverage
  • python -m black dags src tests --check checks PEP8 compliance issues
  • python -m black dags src tests fixes PEP8 compliance issues
  • pip-compile requirements.in if you add new requirements this regenerates a new requirements.txt
  • pip-compile dev-requirements.in if you add new requirements this regenerates a new dev-requirements.txt, you should also do this when have updated your requirements.in

Running taxi job locally

spark-submit src/pysparkexample/app.py -d 2023-01-01 -e dev

airflow-spark-101's People

Contributors

nclaeys avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.