astronomer / airflow-quickstart Goto Github PK

Get started with Apache Airflow. Check the README for instructions on how to run your first DAGs today. 🚀

Python 99.86% Shell 0.05% Dockerfile 0.09%

airflow-quickstart's Introduction

Overview

Welcome to this hands-on repository to get started with Apache Airflow! 🚀

This repository contains a simple Airflow pipeline following an ELT pattern that you can run in GitHub codespaces or locally with the Astro CLI. Your pipeline will ingest climate data from a csv file and local weather data from an API to create interactive visualizations of temperature changes over time.

Your pipeline will accomplish this using six Airflow DAGs and the following tools:

The Astro Python SDK for ELT operations.
DuckDB, a relational database, for storing tables of the ingested data as well as the resulting tables after transformations.
Streamlit, a Python package for creating interactive apps, for displaying the data as a dashboard. The Streamlit app will retrieve its data from tables in DuckDB.

All tools used are open source, so you will not need to create additional accounts.

After completing all tasks, the Streamlit app will look similar to the following screenshots:

Part 1: Run a fully functional pipeline

Follow the Part 1 Instructions to get started!

The ready to run Airflow pipeline consists of 4 DAGs and will:

Retrieve the current weather for your city from an API.
Ingest climate data from a local CSV file.
Load the data into DuckDB using the Astro SDK.
Run a transformation on the data using the Astro SDK to create a reporting table powering a Streamlit App.

Part 2: Exercises

Follow the Part 2 Instructions to extend the pipeline to show historical weather data for cities of your choice in the Streamlit App. During this process you will learn about Airflow features like Datasets, dynamic task mapping and the Astro Python SDK.

Part 3: Play with it!

Use this repository to explore Airflow best practices, experiment with your own DAGs and as a template for your own projects!

This project was created with ❤️ by Astronomer.

If you are looking for an entry level written tutorial where you build your own DAG from scratch check out: Get started with Apache Airflow, Part 1: Write and run your first DAG.

How to use this repository

Setting up

Option 1: Use GitHub Codespaces

Run this Airflow project without installing anything locally.

Fork this repository.
Create a new GitHub Codespaces project on your fork. Make sure it uses at least 4 cores!
Run this command in the Codespaces terminal: bash ./.devcontainer/post_creation_script.sh.
The Astro CLI will automatically start up all necessary Airflow components as well as the streamlit app. This can take a few minutes.
Once the Airflow project has started, access the Airflow UI by clicking on the Ports tab and opening the forward URL for port 8080.
Once the Streamlit app is running, you can access it by by clicking on the Ports tab and opening the forward URL for port 8501.

Option 2: Use the Astro CLI

Download the Astro CLI to run Airflow locally in Docker. astro is the only package you will need to install.

Run git clone https://github.com/astronomer/airflow-quickstart.git on your computer to create a local clone of this repository.
Install the Astro CLI by following the steps in the Astro CLI documentation. Docker Desktop/Docker Engine is a prerequisite, but you don't need in-depth Docker knowledge to run Airflow with the Astro CLI.
Run astro dev start in your cloned repository.
After your Astro project has started. View the Airflow UI at localhost:8080.
View the Streamlit app at localhost:8501. NOTE: The Streamlit container can take a few minutes to start up.

Run the project

Part 1 Instructions

All DAGs tagged with part_1 are part of a pre-built, fully functional Airflow pipeline. To run them:

Go to include/global_variables/user_input_variables.py and enter your own info for MY_NAME and MY_CITY.
Trigger the start DAG and unpause all DAGs that are tagged with part_1 by clicking on the toggle on their lefthand side. Once the start DAG is unpaused, it will run once, starting the pipeline. You can also run this DAG manually to trigger further pipeline runs by clicking on the play button on the right side of the DAG.

The DAGs that will run are:
- start
- extract_current_weather_data
- in_climate_data
- transform_climate_data
Watch the DAGs run according to their dependencies, which have been set using Datasets.
Open the Streamlit app. If you are using Codespaces, go to the Ports tab and open the URL of the forwarded port 8501. If you are running locally go to localhost:8501.
View the Streamlit app, now showing global climate data and the current weather for your city.

Part 2 Instructions (Exercises)

The two DAGs tagged with part_2 are part of a partially built Airflow pipeline that handles historical weather data. You can find example solutions in the solutions_exercises folder.

Before you get started, go to include/global_variables/user_input_variables.py and enter your own info for HOT_DAY and BIRTHYEAR.

Exercise 1 - Datasets

Both the extract_historical_weather_data and transform_historical_weather_data DAG currently have their schedule set to None.

Use Datasets to make:

extract_historical_weather_data run after the start DAG has finished
transform_historical_weather_data run after the extract_historical_weather_data DAG has finished

You can find information about how to use the Datasets feature in this guide. See also the documentation on how the Astro Python SDK interacts with Datasets.

After running the two DAGs in order, view your Streamlit app. You will now see a graph with hot days per year. Additionally, parts of the historical weather table will be printed out.

Exercise 2 - Dynamic Task Mapping

The tasks in the extract_historical_weather_data currently only retrieve historical weather information for one city. Use dynamic task mapping to retrieve information for three cities.

You can find instructions on how to use dynamic task mapping in this guide. Tip: You only need to modify two lines of code!

After completing the exercise, rerun both extract_historical_weather_data and transform_historical_weather_data.

In your Streamlit app, you can now select the different cities from the dropdown box to see how many hot days they had per year.

Exercise 3 - Astro Python SDK

The Astro Python SDK is an open-source package built on top of Airflow to provide you with functions and classes that simplify common ELT and ETL operations such as loading files or using SQL or Pandas to transform data in a database-agnostic way. View the Astro Python SDK documentation for more information.

The transform_historical_weather_data uses the aql.dataframe decorator to use Pandas to transform data. The table returned by the find_hottest_day_birthyear task will be printed out at the end of your Streamlit app. By default, no transformation is made to the table in the task, so let's change that!

@aql.dataframe(pool="duckdb")
def find_hottest_day_birthyear(in_table: pd.DataFrame, birthyear: int):
    # print ingested df to the logs
    gv.task_log.info(in_table)

    output_df = in_table

    ####### YOUR TRANSFORMATION ##########

    # print result table to the logs
    gv.task_log.info(output_df)

    return output_df

Use Pandas to transform the data shown in in_table to search for the hottest day in your birthyear for each city for which you retrieved data.

Tip: Both, the in_table dataframe and the output_df dataframe are printed to the logs of the find_hottest_day_birthyear task. The goal is to have an output as in the screenshot below. If your table does not contain information for several cities, make sure you completed exercise 2 correctly.

How it works

Components and infrastructure

This repository uses a custom codespaces container to install the Astro CLI and forward ports.

5 Docker containers will be created and relevant ports will be forwarded for:

The Airflow scheduler
The Airflow webserver
The Airflow metastore
The Airflow triggerer

Additionally, when using codespaces, the command to run the Streamlit app is automatically run upon starting the environment.

Data sources

The global climate data in the local CSV file was retrieved from the Climate Change: Earth Surface Temperature Data Kaggle dataset by Berkely Earth and Kristen Sissener, which was uploaded under CC BY-NC-SA 4.0.

The current and historical weather data is queried from the Open Meteo API (CC BY 4.0).

Project Structure

This repository contains the following files and folders:

.astro: files necessary for Astro CLI commands.
.devcontainer: the GH codespaces configuration.
dags: all DAGs in your Airflow environment. Files in this folder will be parsed by the Airflow scheduler when looking for DAGs to add to your environment. You can add your own dagfiles in this folder.
- climate_and_current_weather: folder for DAGs which are used in part 1
  - extract_and_load: DAGs related to data extraction and loading
    - extract_current_weather_data.py
    - in_climate_data.py
  - transform: DAGs transforming data
    - transform_climate_data.py
- historical_weather: folder for DAGs which are used in part 2
  - extract_and_load: DAGs related to data extraction and loading
    - extract_historical_weather_data.py
  - transform: DAGs transforming data
    - transform_historical_weather.py
include: supporting files that will be included in the Airflow environment.
- climate_data: contains a CSV file with global climate data.
- global_variables: configuration files.
  - airflow_conf_variables.py: file storing variables needed in several DAGs.
  - constants.py: file storing table names.
  - user_input_variables.py: file with user input variables like MY_NAME and MY_CITY.
- meterology_utils.py: file containing functions performing calls to the Open Meteo API.
- streamlit_app.py: file defining the streamlit app.
plugins: folder to place Airflow plugins. Empty.
solutions_exercises: folder for part 2 solutions.
- solution_extract_historical_weather_data.py: solution version of the extract_historical_weather_data DAG.
- solution_transform_historical_weather.py: solution version of the transform_historical_weather DAG.
src: contains images used in this README.
tests: folder to place pytests running on DAGs in the Airflow instance. Contains default tests.
.dockerignore: list of files to ignore for Docker.
.env: environment variables. Contains the definition for the DuckDB connection.
.gitignore: list of files to ignore for git. NOTE that .env is not ignored in this project.
Dockerfile: the Dockerfile using the Astro CLI.
packages.txt: system-level packages to be installed in the Airflow environment upon building of the Dockerimage.
README.md: this Readme.
requirements.txt: python packages to be installed to be used by DAGs upon building of the Dockerimage.

airflow-quickstart's People

Contributors

Stargazers

Watchers

Forkers

moutasemalakkad finlaymcalpine abhishek-khanna24 minyisme ryanraaum takechanman1228 sophieniw lkgibbs20 clansing33 olbredes albertlai23 orkatzir jsully-7497 jchatterjee sc250072 davidoficce tundeshebolatan jasonpolyy thochit smarkov92 mans2singh tianyid7 mhalaida krzysztof1131 parassetia889 abriantran merav-lebron salmansrizon riwahl mehmetcanbudak wyaming89 riccardotommasini block-benjamin jeff-barlow-spady dbtraja alexcopp pedropostigo axiame sid-701 iamjasonbian ralphgutz mmaftouni nakula111 evertonmjunior infinityfirst joshuaolubori tanachevanton ni-lie himakund biacchi someonefrom93 iayaakhaled1 z3jva3mk frycast ashsic moodyutsubo jittirat darias31 tfayyaz bloold21 baru9 bks71 jtrubytoo ullasgm hassannaftabb cameronneylon parcheesime davidgerald-milligan joshuagetzdata reese-lee jgojiz gaoyang6789 tomsteadman gitdnath meehljd ashishjoshi99 davidlm28 riyadhb jamallc shruti2301 sootbag zhibosheng hpenros niamhnishiochain skhan-cadence tomerb ssn-petersen kafkaliu mj-life-is-once arshinalbab akashshah59 johndschell nmwari kiboomhan dakadabra merobi-hub ryandaydev obrseg 17bec1114 psychouranium

airflow-quickstart's Issues

duckdb.IOException: IO Error: Cannot open file "/usr/local/airflow/include/dwh": Permission denied

How to reproduce:

Follow "Option 2: Use the Astro CLI" instructions and run part 1, it'll result in the above issue.

Cause:

Astro CLI forces Airflow's docker container to set the astro user's UID to 50000.

Proposed Solution:

The only way I know how to get around this is to sudo chmod -R a=rwx project-folder to recursively allow any users to read write execute the folder Airflow needs access to.

Default Airflow Docker has an env variable that allows you to set the airflow UID to the same as your host's UID.

duckdb.ConversionException: Conversion Error: Unimplemented type for cast (STRUCT("0" DOUBLE, "1" DOUBLE, ... , "23011" DOUBLE) -> DOUBLE)

How to reproduce:

In "Part 2 Instructions (Exercises)," when modifying and running a part of extract_historical_weather_data.py as specified in Exercise 2, the above error occurs.

coordinates = get_lat_long_for_city.expand(city=["Bern", "Luzern", "Zurich"])
historical_weather = get_historical_weather.expand(coordinates=coordinates)

Cause:

When you run type(historical_weather) to check the type of the variable historical_weather, it outputs the following:
<class 'airflow.models.xcom.LazyXComAccess'>

This means that the code is entering the else part of the if type(historical_weather) == list : statement in the turn_json_into_table function, which is causing this issue.

    def turn_json_into_table(
        duckdb_conn_id: str,
        historical_weather_table_name: str,
        historical_weather: dict,
    ):
        """
        Convert the JSON input with info about historical weather into a pandas
        DataFrame and load it into DuckDB.
        Args:
            duckdb_conn_id (str): The connection ID for the DuckDB connection.
            historical_weather_table_name (str): The name of the table to store the historical weather data.
            historical_weather (list): The historical weather data to load into DuckDB.
        """
        from duckdb_provider.hooks.duckdb_hook import DuckDBHook

        if type(historical_weather) == list:
            list_of_df = []

            for item in historical_weather:
                df = pd.DataFrame(item)
                list_of_df.append(df)

            historical_weather_df = pd.concat(list_of_df, ignore_index=True)
        else:
            historical_weather_df = pd.DataFrame(historical_weather)

Proposed Solution:

By commenting out the entire block of the above if statement and modifying the code to run the following part directly, it worked successfully. If you have any better suggestions for a fix, please let me know.

            list_of_df = []

            for item in historical_weather:
                df = pd.DataFrame(item)
                list_of_df.append(df)

            historical_weather_df = pd.concat(list_of_df, ignore_index=True)

Quickstart fails to start using GitHub Codespaces

When I fork this repo and create a codespace, it starts in recovery mode due to an error. I changed the machine type to 4 cores and restarted and it has the same error.

"This codespace is currently running in recovery mode due to a container error."

Here is the startup log with container IDs removed:

2024-07-31 16:13:52.776Z: Host information
2024-07-31 16:13:52.783Z: ----------------
2024-07-31 16:13:52.784Z: OS: Ubuntu 22.04.4 LTS (stable release)
2024-07-31 16:13:52.784Z: Image details: https://github.com/github/codespaces-host-images/blob/main/README.md
2024-07-31 16:13:52.784Z: ----------------

=================================================================================
2024-07-31 16:13:52.784Z: Configuration starting...
2024-07-31 16:13:52.792Z: Cloning...
2024-07-31 16:13:52.834Z: Using image: mcr.microsoft.com/devcontainers/universal

=================================================================================
2024-07-31 16:13:52.854Z: Creating container...
2024-07-31 16:13:52.911Z: $ devcontainer up --id-label Type=codespaces --workspace-folder /var/lib/docker/codespacemount/workspace/airflow-quickstart --mount type=bind,source=/.codespaces/agent/mount/cache,target=/vscode --user-data-folder /var/lib/docker/codespacemount/.persistedshare --container-data-folder .vscode-remote/data/Machine --container-system-data-folder /var/vscode-remote --log-level trace --log-format json --update-remote-user-uid-default never --mount-workspace-git-root false --omit-config-remote-env-from-metadata --skip-non-blocking-commands --skip-post-create --expect-existing-container --config "/var/lib/docker/codespacemount/workspace/airflow-quickstart/.devcontainer/devcontainer.json" --override-config /root/.codespaces/shared/merged_devcontainer.json --default-user-env-probe loginInteractiveShell --container-session-data-folder /workspaces/.codespaces/.persistedshare/devcontainers-cli/cache --secrets-file /root/.codespaces/shared/user-secrets-envs.json
2024-07-31 16:13:53.083Z: @devcontainers/cli 0.56.1. Node.js v18.20.3. linux 6.5.0-1022-azure x64.
2024-07-31 16:13:53.231Z: $ docker start xxxxx
2024-07-31 16:13:53.481Z: xxxxxx

2024-07-31 16:13:53.483Z: Stop: Run: docker start xxx
2024-07-31 16:13:53.553Z: Shell server terminated (code: 126, signal: null)
2024-07-31 16:13:53.553Z: unable to find user codespace: no matching entries in passwd file
2024-07-31 16:13:53.554Z: {"outcome":"error","message":"An error occurred setting up the container.","description":"An error occurred setting up the container.","containerId”:”xxxx”}
2024-07-31 16:13:53.555Z: Error: An error occurred setting up the container.
2024-07-31 16:13:53.555Z: at O$ (/.codespaces/agent/bin/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:464:1253)
2024-07-31 16:13:53.555Z: at iK (/.codespaces/agent/bin/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:464:997)
2024-07-31 16:13:53.556Z: at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2024-07-31 16:13:53.556Z: at async gAA (/.codespaces/agent/bin/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:481:3660)
2024-07-31 16:13:53.556Z: at async BC (/.codespaces/agent/bin/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:481:4775)
2024-07-31 16:13:53.557Z: at async xeA (/.codespaces/agent/bin/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:614:11265)
2024-07-31 16:13:53.557Z: at async UeA (/.codespaces/agent/bin/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:614:11006)
2024-07-31 16:13:53.561Z: devcontainer process exited with exit code 1

====================================== ERROR ====================================
2024-07-31 16:13:53.565Z: Failed to create container.

2024-07-31 16:13:53.565Z: Error: An error occurred setting up the container.
2024-07-31 16:13:53.568Z: Error code: 1302 (UnifiedContainersErrorFatalCreatingContainer)

====================================== ERROR ====================================
2024-07-31 16:13:53.577Z: Container creation failed.

2024-07-31 16:13:53.796Z:

===================================== WARNING ===================================
2024-07-31 16:13:53.797Z: Creating recovery container.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble