pipeline-tools / gusty Goto Github PK
View Code? Open in Web Editor NEWMaking DAG construction easier
Home Page: https://pipeline-tools.github.io/gusty-docs/
License: MIT License
Making DAG construction easier
Home Page: https://pipeline-tools.github.io/gusty-docs/
License: MIT License
Gusty does not render templated fields not directly associated with an Operator.
Some Operator fields are used in kwargs to be passed down from the Operator to the underlying hook. This is the case with the SQLExecuteQueryOperator and the hook_params field. This field is passed to the corresponding SQL Hook and allows initializing the hook with specific configurations.
Even though I have implemented a custom SQLExecuteQueryOperator to add hook_params
to the list of templated fields, Gusty does not render the template as expected.
While Gusty will correctly render SQLExecuteQueryOperator's standard templated fields like sql
, it does not template hook_params
.
hook_params
renders correctly if you use Standard Airflow Dag declaration.
Here is the source code to reproduce:
.
├── dags
│ ├── gusty_dag
│ │ ├── execute_query.yml
│ │ └── METADATA.yml
│ └── gusty_dag.py
├── include
│ ├── custom_operators
│ │ └── custom.py
│ └── utils.py
└── ...
description: "A Gusty DAG"
schedule_interval: null
default_args:
depends_on_past: false
start_date: !days_ago 1
operator: include.custom_operators.custom.SQLExecuteQueryOperator
sql: SELECT 'gusty is causing problems';
parameters: !constant hook_params
hook_params: !constant hook_params
conn_id: snowflake_admin
import os
from typing import Any
from gusty import create_dag
from include import utils as c
dag_dir = os.path.join(os.environ["AIRFLOW_HOME"],"dags","gusty_dag")
def constant(x: str) -> Any:
return getattr(c, x)
macro_dict = {
"constant": constant
}
my_dag = create_dag(dag_dir, latest_only=False, user_defined_macros=macro_dict)
from typing import Sequence
from airflow.providers.common.sql.operators import sql
class SQLExecuteQueryOperator(sql.SQLExecuteQueryOperator):
template_fields: Sequence[str] = ("sql", "parameters", "hook_params")
hook_params = {
"session_parameters": {
"query_tag": (
"{"
"'dag_id': '{{ dag.dag_id }}', "
"'task_id': '{{ task.task_id }}', "
"'run_id': '{{ run_id }}', "
"'logical_date': '{{ logical_date }}', "
"'started': '{{ ti.start_date }}', "
"'operator': '{{ ti.operator }}'"
"}"
)
}
}
Here is a standard Dag using the same custom Operator and correctly rendering hook_params
from airflow.decorators import dag
from airflow.utils.dates import days_ago
from include.custom_operators.custom import (
SQLExecuteQueryOperator,
)
from include.utils import hook_params
snowflake_conn_id = "snowflake_admin"
@dag(schedule=None, start_date=days_ago(1))
def test_hook_params_standard_airflow():
sql_execute_query_operator = SQLExecuteQueryOperator(
task_id="sql_execute_query_operator_task",
conn_id=snowflake_conn_id,
sql="SELECT '1';",
hook_params=hook_params,
)
test_hook_params_standard_airflow()
The project looks great!
I have been searching for something just like this, I also came across other a little older projects which do similar things.
dag-factory - https://github.com/ajbosco/dag-factory
boundary-layer - https://github.com/etsy/boundary-layer
Is there a comparison between the three of them. I am looking for something light and feature rich but it should also be extensible in the future.
While creating a task from a single .py file is just amazing, I think that for more complex tasks it would be very useful to structure your code in a more complex way.
For example, inside my gusty dag folder I would like to have:
Right now, if I do something like this a new task utils will be created. Is there already an option or could the parser implement an option to simply skip a file based on some specific header?
See: https://docs.python.org/3/library/importlib.html#importlib.import_module
gusty does something similar to parts of this: https://docs.python.org/3/library/importlib.html#approximating-importlib-import-module
I think the main benefit is that this will make gusty's behavior identical to when a user uses something like import some_module
. The one constraint might be that file names need to be valid module names (but I think the same constraint applies to the current approach).
The biggest potential issue this would fix is preventing loading and executing the same .py file multiple times (for example, if another file imported it).
I need to create a small reproducible example, but noticed while backfilling. Running backfill from the CLI like...
airflow backfill some_dag -s 2021-04-16T00:00:00+00:00 --reset_dagruns -i -t some_task
Seems to not include __builtins__
when running the task. This means that tasks I was running failed with...
NameError: name 'print' is not defined
When I don't pickle, the problem goes away, so I'm guessing it has to do with the way the .py file is imported using spec_from_file_location
? For reference here is one way you could trigger this error:
exec(
"""exec('print("hey")')""",
{
'__builtins__': {'exec': exec}
},
{}
)
Taking a quick look over the new code, it seems like some of these operators will throw errors when executed.
For example, the postgres to csv operator references an undefined variable (should be self.postgres_conn_id):
https://github.com/chriscardillo/gusty/blob/master/gusty/operators/csv_to_postgres_operator.py#L61
Similar case in same file (csv_files never defined):
https://github.com/chriscardillo/gusty/blob/master/gusty/operators/csv_to_postgres_operator.py#L32
I wonder if it would be a good idea to set up some simple tests using pytest?
Currently, gusty uses a custom field called dependencies
to connect task instances together. Another area that might be helpful to handle for task instances is their documentation.
Note that currently task documentation seems to happen in a funky way:
I can't figure out how to set this using yaml topmatter in gusty, so wonder if it somehow would need to be set another way? It seems like it'd be really handy if documentation could be set in topmatter and then appear on this screen!:
Currently, if a task dependency does not exist AFAICT gusty does not raise an error, but creates the task without unmatched dependencies. This has bitten us a couple times, when we rename tasks, or accidentally fat finger in the wrong name. Might be handy for Gusty to raise an error if it can't create a dependency!
Overall, gusty style dependencies has been super handy! Happy to dig into this if useful
It looks like gusty currently is executing strings to import modules. It seems like this is a job for the importlib module.
Currently, when we set a task to depend on a task group it doesn't work. Also, when setting a task group to depend on a task it doesn't work. It's not clear why either of these cases should be problematic. In fact, enjoy_breakfast.yml in the provided examples suggests their shouldn't be any issue.
Can anyone confirm whether there is some hidden complexity that I'm missing or if it should generally work as expected.
From what I can tell, when gusty goes to instantiate an operator, it has two pieces in hand:
It then does the following:
operator.__init__
operator.__init__
AFAICT the reason for this is so additional information can be specified in yaml headers, even if it is not used to instantiate an operator. However, in practice, this also means that the common practice of wrapping/overloading a signature like __init__(self, *args, **kwargs)
will not work.
This behavior makes sense to me, and is mentioned in the README. One area I wonder about though, is if the operator uses *args
, or **kwargs
. With gusty's current behavior, it might be worth mentioning that users will need to forward the __signature__
attribute when wrapping / subclassing an operator (see this SO post).
Currently, I think gusty may hit an index error when this happens...
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-1-7c3553383b77> in <module>
----> 1 import dags.dags
/opt/airflow/gcs/dags/dags.py in <module>
40 latest_only=False,
41 user_defined_macros=user_defined_macros,
---> 42 user_defined_filters=user_defined_filters,
43 )
~/.local/lib/python3.6/site-packages/gusty/__init__.py in create_dag(dag_dir, task_group_defaults, wait_for_defaults, latest_only, **kwargs)
30 wait_for_defaults=wait_for_defaults,
31 latest_only=latest_only,
---> 32 **kwargs
33 )
34 [setup.parse_metadata(level) for level in setup.levels]
~/.local/lib/python3.6/site-packages/gusty/building.py in __init__(self, dag_dir, **kwargs)
217 # Solely for Airflow v2 and beyond
218 self.levels = [level_id for level_id in self.schematic.keys()]
--> 219 self.levels = [self.levels[0]] if airflow_version < 2 else self.levels
220
221 # For tasks gusty creates outside of specs provided by the directory
IndexError: list index out of range
Issue:
Common to have the same task_id names in different DAGs.
Workaround:
Add my own ExternalTaskSensor
Update:
Update ExternalTaskSensor tasks to be "dag_id.task_id" to ensure uniqueness.
This seems like a tricky one, but it seems like for the externaltasksensor to succeed here, it would need to get the right execution_date_fn parameter. For monitoring a DAG that runs once, I think it'd be the dags start date (or one day earlier?).
Currently, when a user creates a task using a .py
file, Gusty automatically creates a handy task from it.
One added feature that would make it easy to browse these from the UI would be to add the python code that it ends up running as an attribute. For example, here is a task generated from a .sql file in https://github.com/cal-itp/data-infra, that shows the sql code in the task instance page.
If the PythonCallable produced had an attribute like python
, with the full python code used to create a python_callable, it seems like it would enable this kind of behavior!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.