GithubHelp home page GithubHelp logo

airflow-demo's Introduction

What is Airflow ?

Airflow is an open-source platform designed for programmatically authoring, scheduling, and monitoring workflows. It provides a flexible framework for defining complex data pipelines as Directed Acyclic Graphs (DAGs) and executing tasks within those pipelines. Airflow enables the orchestration of tasks, tracks their dependencies, handles retries and failures, and offers a web-based user interface for monitoring and managing workflows. It is widely used in the data engineering and data science communities for building and managing scalable, reliable, and maintainable data pipelines.

Airflow Components

Core Components

  • Airflow is an orchestrator, not a processing framework. Process your gigabytes of data outside of Airflow (i.e. You have a Spark cluster, you use an operator to execute a Spark job, and the data is processed in Spark).
  • A DAG is a data pipeline, an Operator is a task.
  • An Executor defines how your tasks are executed, whereas a worker is a process executing your task
  • The Scheduler schedules your tasks, the web server serves the UI, and the database stores the metadata of Airflow.

One Node Architecture

One Node Architecture

Multi Node Architecture

Multi Node Architecture

How does it Work ?

Step 01 Step 02 Step 03 Step 04

DAG

DAG Concept

Operator

Operator Do AND Don't

3 Types of Operator

  • Action Operators: Execute an action (Python Operators & Bash Operators)
  • Transfer Operators: Transfer data
  • Sensors: Wait for a Condition to be met

Hooks

Hook Concept

In Apache Airflow, a hook is a way to interact with external systems or services within your workflows. It provides a high-level interface to connect and interact with various systems, such as databases, cloud services, message queues, and more. Hooks abstract the implementation details of interacting with these systems, providing a consistent and simplified interface.

DAG Scheduling

  • start_date: the timestamp from which the scheduler will attempt to backfill
  • scheduler_interval: How often a DAG runs
  • end_date: The timestamp from which a DAG ends

Dag Scheduling Concept 01

A DAG is triggered AFTER the start_date/last_run + the schedule_interval

Dag Scheduling Concept 02

Backfilling

Backfill and Catchup

Dataset

In Apache Airflow, a dataset refers to a collection of data that is used or manipulated within a workflow. It represents a logical unit of data that can be read, transformed, or written during the execution of tasks within a data pipeline.
Datasets in Airflow are typically represented as variables or parameters within operators or as inputs and outputs between tasks. The task dependencies and relationships are defined based on the datasets, ensuring that the tasks are executed in the correct order, with the necessary inputs available.

  • Two Property
    • URI
      • Unique identifier of your data
      • Path to your data
      • Must compose of only ASCII characters
      • The URI scheme cannot be airflow
      • Case Sensitive
    • Extra
      from airflow import Dataset
      my_file = Dataset(
          uri='s3://dataset/file.csv',
          extra={'owner': 'nilanjan.deb'}
      )   

Dataset

Limitations

  • DAGs can only use Datasets in the same Airflow instance. A DAG cannot wait for a Dataset defined in another Airflow instance.
  • Consumer DAGs are triggered every time a task that updates datasets completes successfully. Airflow doesn't check whether the data has been effectively updated.
  • You can't combine different schedules like datasets with cron expressions.
  • If two tasks update the same dataset, as soon as one is done, that triggers the Consumer DAG immediately without waiting for the second task to complete.
  • Airflow monitors datasets only within the context of DAGs and Tasks. If an external tool updates the actual data represented by a Dataset, Airflow has no way of knowing that.

Executor

In Apache Airflow, an executor is a component responsible for executing tasks within workflows. The executor determines how tasks are executed, distributes the workload, and manages the resources required for task execution.
Airflow supports different types of executors, allowing you to choose the one that best suits your needs. The executor you choose affects the parallelism, scalability, and resource allocation of your workflows.

Sequential Executor:

This is the default executor in Airflow. It executes tasks sequentially in a single process, one after another, based on their dependencies and priority.

Airflow Cfg Task Execution

Local Executor:

The LocalExecutor allows for parallel task execution on a single machine. It leverages multiprocessing to execute multiple tasks simultaneously, providing increased parallelism compared to the SequentialExecutor.

Task Execution

Config

executor=LocalExecutor
sql_alchemy_conn=postgresql+psycopg2://<user>:<password>@<host>/<db>

Celery Executor

The CeleryExecutor utilizes Celery, a distributed task queue system, to parallelize task execution across multiple worker nodes. Tasks are distributed to the Celery workers for execution, allowing for horizontal scaling and improved performance.

Task Execution

Config

executor=CeleryExecutor
sql_alchemy_conn=postgresql+psycopg2://<user>:<password>@<host>/<db>
celery_result_backend=postgresql+psycopg2://<user>:<password>@<host>/<db>
celery_broker_url=redis://:@redis:6379/0

Kubernetes Executor

The KubernetesExecutor runs tasks in separate containers within a Kubernetes cluster. Each task is allocated its own container, providing isolated environments for execution.

Flower

Flower (also known as Celery Flower) is a web-based monitoring and administration tool for Celery, which is a distributed task queue system in Python. Flower provides a user-friendly interface to monitor and manage the execution of tasks and workers in a Celery cluster.

docker compose command

docker compose down && docker compose --profile flower up -d

Docker Containers

Celery Queue

Celery Queue

Code Example

from airflow.operators.bash import BashOperator

transform = BashOperator(
    task_id='transform',
    bash_command='sleep 10',
    queue='high_cpu' # by default queue='default'
)

Repetitive Patterns

Diagram

SubDAGs

In Apache Airflow, a SubDAG is a way to encapsulate a group of tasks within a DAG (Directed Acyclic Graph) as a separate unit. It allows you to organize and modularize complex workflows by creating a hierarchy of DAGs.

A SubDAG is defined within a parent DAG and behaves like a normal DAG but with a nested structure. It consists of its own tasks, dependencies, and scheduling properties. The SubDAG itself is treated as a single task within the parent DAG, enabling you to create more modular and manageable workflows.

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
from airflow.operators.subdag import SubDagOperator

def subdag_demo(parent_dag_id, child_dag_id, args):
    with DAG(
        dag_id=f"{parent_dag_id}.{child_dag_id}",
        schedule_interval=args['schedule_interval'],
        start_date=args['start_date'],
        catchup=args['catchup']
    ) as subdag:
        # define tasks here ...
        return subdag

with DAG(
        dag_id='group_dag_by_subdag',
        schedule_interval='@daily',
        start_date=datetime(year=2023, month=1, day=1),
        catchup=False
) as dag:
    args = {
        'schedule_interval': dag.schedule_interval,
        'start_date': dag.start_date,
        'catchup': dag.catchup
    }

    demo_tasks = SubDagOperator(
        task_id='demo',
        subdag=subdag_demo(dag.dag_id, 'demo', args)
    )
    
    other_task = BashOperator(...)
    
    demo_tasks >> other_task

TaskGroups

TaskGroups in Apache Airflow (introduced in Airflow 2.0) are a way to logically group and organize tasks within a DAG (Directed Acyclic Graph). TaskGroups provide a visual and conceptual grouping of tasks, making it easier to understand and manage complex workflows.

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
from airflow.utils.task_group import TaskGroup

def grouped_tasks():
    with TaskGroup(group_id='group_id', tooltip='group tooltip text') as group:
        # define tasks here ...
        return group

with DAG(
    dag_id='group_dag_by_task_group',
    schedule_interval='@daily',
    start_date=datetime(year=2023, month=1, day=1),
    catchup=False
) as dag:
    grouped = grouped_tasks()

    other_task = BashOperator(...)

    grouped >> other_task

XCom

In Apache Airflow, XCom (short for "cross-communication") is a feature that allows tasks within a workflow to exchange messages, small amount of data (2GB - sqlite, 1GB - postgresql, 64KB - mysql), and state information. It enables the sharing of information between tasks, facilitating communication and coordination within a DAG.

# push
def _t1(ti):
    ti.xcom_push(key='hello', value='world')

# pull
def _t2(ti):
    t1_value = ti.xcom_pull(task_ids='t1', key='hello')
    print(t1_value)

airflow-demo's People

Contributors

nil1729 avatar

Watchers

 avatar

airflow-demo's Issues

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

docker-compose
playgroud/docker-compose.yaml
  • postgres 13
  • apache/airflow 2.5.1
projects/ms_xlsx_dump/docker-compose.yaml
  • postgres 13
  • apache/airflow 2.5.1
projects/open_weather_etl/docker-compose.yaml
  • postgres 13
  • apache/airflow 2.5.1

  • Check this box to trigger a request for Renovate to run again on this repository

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.