pipeline-tools / gusty Goto Github PK

View Code? Open in Web Editor NEW

222.0 222.0 13.0 438 KB

Making DAG construction easier

Home Page: https://pipeline-tools.github.io/gusty-docs/

License: MIT License

Python 98.20% Jupyter Notebook 0.70% Dockerfile 0.43% Makefile 0.68%

airflow data-etl data-pipeline

gusty's People

Contributors

Stargazers

Watchers

Forkers

teaglebuilt zkan atvaccaro zzg6 ynyf josh-fell nicklausroach 5l1v3r1 frost3281 tanersucu8 liuye032378 cardosop binaysahoo

gusty's Issues

[BUG] gusty does not render templated fields not directly associated with an Operator

Issue

Gusty does not render templated fields not directly associated with an Operator.

Some Operator fields are used in kwargs to be passed down from the Operator to the underlying hook. This is the case with the SQLExecuteQueryOperator and the hook_params field. This field is passed to the corresponding SQL Hook and allows initializing the hook with specific configurations.

Even though I have implemented a custom SQLExecuteQueryOperator to add hook_params to the list of templated fields, Gusty does not render the template as expected.

While Gusty will correctly render SQLExecuteQueryOperator's standard templated fields like sql, it does not template hook_params.
hook_params renders correctly if you use Standard Airflow Dag declaration.

Here is the source code to reproduce:

Folder structure

.
├── dags
│   ├── gusty_dag
│   │   ├── execute_query.yml
│   │   └── METADATA.yml
│   └── gusty_dag.py
├── include
│   ├── custom_operators
│   │   └── custom.py
│   └── utils.py
└── ...

dags/gusty_dag/METADATA.yml

description: "A Gusty DAG"
schedule_interval: null
default_args:
    depends_on_past: false
    start_date: !days_ago 1

dags/gusty_dag/execute_query.yml

operator: include.custom_operators.custom.SQLExecuteQueryOperator
sql: SELECT 'gusty is causing problems';
parameters: !constant hook_params
hook_params: !constant hook_params
conn_id: snowflake_admin

dags/gusty_dag.py

import os
from typing import Any
from gusty import create_dag
from include import utils as c

dag_dir = os.path.join(os.environ["AIRFLOW_HOME"],"dags","gusty_dag")

def constant(x: str) -> Any:
    return getattr(c, x)

macro_dict = {
    "constant": constant
}

my_dag = create_dag(dag_dir, latest_only=False, user_defined_macros=macro_dict)

include/custom_operators/custom.py

from typing import Sequence
from airflow.providers.common.sql.operators import sql

class SQLExecuteQueryOperator(sql.SQLExecuteQueryOperator):
    template_fields: Sequence[str] = ("sql", "parameters", "hook_params")

include/utils.py

hook_params = {
    "session_parameters": {
        "query_tag": (
            "{"
            "'dag_id': '{{ dag.dag_id }}', "
            "'task_id': '{{ task.task_id }}', "
            "'run_id': '{{ run_id }}', "
            "'logical_date': '{{ logical_date }}', "
            "'started': '{{ ti.start_date }}', "
            "'operator': '{{ ti.operator }}'"
            "}"
        )
    }
}

Expected Behavior

Here is a standard Dag using the same custom Operator and correctly rendering hook_params

dags/working_dag.py

from airflow.decorators import dag
from airflow.utils.dates import days_ago

from include.custom_operators.custom import (
    SQLExecuteQueryOperator,
)
from include.utils import hook_params

snowflake_conn_id = "snowflake_admin"

@dag(schedule=None, start_date=days_ago(1))
def test_hook_params_standard_airflow():
    sql_execute_query_operator = SQLExecuteQueryOperator(
        task_id="sql_execute_query_operator_task",
        conn_id=snowflake_conn_id,
        sql="SELECT '1';",
        hook_params=hook_params,
    )

test_hook_params_standard_airflow()

Comparison to dag-factory or boundary-layer

The project looks great!

I have been searching for something just like this, I also came across other a little older projects which do similar things.

dag-factory - https://github.com/ajbosco/dag-factory

boundary-layer - https://github.com/etsy/boundary-layer

Is there a comparison between the three of them. I am looking for something light and feature rich but it should also be extensible in the future.

Allow gusty to ignore some specific folders

While creating a task from a single .py file is just amazing, I think that for more complex tasks it would be very useful to structure your code in a more complex way.

For example, inside my gusty dag folder I would like to have:

scripts
- utils.py
task_1.py -- importing from scripts.utils
task_2.sql
task_3.yml

Right now, if I do something like this a new task utils will be created. Is there already an option or could the parser implement an option to simply skip a file based on some specific header?

Use importlib.import_module, rather than manual approximation

See: https://docs.python.org/3/library/importlib.html#importlib.import_module

gusty does something similar to parts of this: https://docs.python.org/3/library/importlib.html#approximating-importlib-import-module

I think the main benefit is that this will make gusty's behavior identical to when a user uses something like import some_module. The one constraint might be that file names need to be valid module names (but I think the same constraint applies to the current approach).

The biggest potential issue this would fix is preventing loading and executing the same .py file multiple times (for example, if another file imported it).

running backfill without disabling pickling runs .py tasks without builtins

I need to create a small reproducible example, but noticed while backfilling. Running backfill from the CLI like...

airflow backfill some_dag -s 2021-04-16T00:00:00+00:00 --reset_dagruns -i -t some_task

Seems to not include __builtins__ when running the task. This means that tasks I was running failed with...

NameError: name 'print' is not defined

When I don't pickle, the problem goes away, so I'm guessing it has to do with the way the .py file is imported using spec_from_file_location? For reference here is one way you could trigger this error:

exec(
    """exec('print("hey")')""",
    {
        '__builtins__': {'exec': exec}
    },
    {}
)

It looks like there may be bugs in the operators

Taking a quick look over the new code, it seems like some of these operators will throw errors when executed.

For example, the postgres to csv operator references an undefined variable (should be self.postgres_conn_id):

https://github.com/chriscardillo/gusty/blob/master/gusty/operators/csv_to_postgres_operator.py#L61

Similar case in same file (csv_files never defined):

https://github.com/chriscardillo/gusty/blob/master/gusty/operators/csv_to_postgres_operator.py#L32

I wonder if it would be a good idea to set up some simple tests using pytest?

option in yaml header for adding documentation to task instance

Currently, gusty uses a custom field called dependencies to connect task instances together. Another area that might be helpful to handle for task instances is their documentation.

Note that currently task documentation seems to happen in a funky way:

I can't figure out how to set this using yaml topmatter in gusty, so wonder if it somehow would need to be set another way? It seems like it'd be really handy if documentation could be set in topmatter and then appear on this screen!:

raise error if task dependency does not exist

Currently, if a task dependency does not exist AFAICT gusty does not raise an error, but creates the task without unmatched dependencies. This has bitten us a couple times, when we rename tasks, or accidentally fat finger in the wrong name. Might be handy for Gusty to raise an error if it can't create a dependency!

Overall, gusty style dependencies has been super handy! Happy to dig into this if useful

Use importlib to import operators, rather than exec

It looks like gusty currently is executing strings to import modules. It seems like this is a job for the importlib module.

https://github.com/chriscardillo/gusty/blob/dd2077f05b5dfdcbb2371981d843f2e75f5aebe7/gusty/importing.py#L63

Allow tasks to have dependencies on task groups and vice versa

Currently, when we set a task to depend on a task group it doesn't work. Also, when setting a task group to depend on a task it doesn't work. It's not clear why either of these cases should be problematic. In fact, enjoy_breakfast.yml in the provided examples suggests their shouldn't be any issue.

Can anyone confirm whether there is some hidden complexity that I'm missing or if it should generally work as expected.

Documenting strategy for custom operators that use args and kwargs

From what I can tell, when gusty goes to instantiate an operator, it has two pieces in hand:

the parameters specified in the task file (e.g. yaml header)
the imported operator callable (e.g. a class)

It then does the following:

get the named parameters off operator.__init__
instantiate the operator, by passing only the parameters explicitly named in operator.__init__

AFAICT the reason for this is so additional information can be specified in yaml headers, even if it is not used to instantiate an operator. However, in practice, this also means that the common practice of wrapping/overloading a signature like __init__(self, *args, **kwargs) will not work.

This behavior makes sense to me, and is mentioned in the README. One area I wonder about though, is if the operator uses *args, or **kwargs. With gusty's current behavior, it might be worth mentioning that users will need to forward the __signature__ attribute when wrapping / subclassing an operator (see this SO post).

give clear error message when creating dag from directory that does not exist

Currently, I think gusty may hit an index error when this happens...

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-7c3553383b77> in <module>
----> 1 import dags.dags

/opt/airflow/gcs/dags/dags.py in <module>
     40         latest_only=False,
     41         user_defined_macros=user_defined_macros,
---> 42         user_defined_filters=user_defined_filters,
     43     )

~/.local/lib/python3.6/site-packages/gusty/__init__.py in create_dag(dag_dir, task_group_defaults, wait_for_defaults, latest_only, **kwargs)
     30         wait_for_defaults=wait_for_defaults,
     31         latest_only=latest_only,
---> 32         **kwargs
     33     )
     34     [setup.parse_metadata(level) for level in setup.levels]

~/.local/lib/python3.6/site-packages/gusty/building.py in __init__(self, dag_dir, **kwargs)
    217         # Solely for Airflow v2 and beyond
    218         self.levels = [level_id for level_id in self.schematic.keys()]
--> 219         self.levels = [self.levels[0]] if airflow_version < 2 else self.levels
    220 
    221         # For tasks gusty creates outside of specs provided by the directory

IndexError: list index out of range

Add dag_id to ExternalTaskSensor dependencies

Issue:
Common to have the same task_id names in different DAGs.

Workaround:
Add my own ExternalTaskSensor

Update:
Update ExternalTaskSensor tasks to be "dag_id.task_id" to ensure uniqueness.

external_dependencies fails for dags w/ schedule_interval "@once"

This seems like a tricky one, but it seems like for the externaltasksensor to succeed here, it would need to get the right execution_date_fn parameter. For monitoring a DAG that runs once, I think it'd be the dags start date (or one day earlier?).

For .py task, show python code as attribute on task instance

Currently, when a user creates a task using a .py file, Gusty automatically creates a handy task from it.

One added feature that would make it easy to browse these from the UI would be to add the python code that it ends up running as an attribute. For example, here is a task generated from a .sql file in https://github.com/cal-itp/data-infra, that shows the sql code in the task instance page.

If the PythonCallable produced had an attribute like python, with the full python code used to create a python_callable, it seems like it would enable this kind of behavior!