getindata / kedro-airflow-k8s Goto Github PK
View Code? Open in Web Editor NEWKedro Plugin to support running pipelines on Kubernetes using Airflow.
Home Page: https://kedro-airflow-k8s.readthedocs.io
License: Apache License 2.0
Kedro Plugin to support running pipelines on Kubernetes using Airflow.
Home Page: https://kedro-airflow-k8s.readthedocs.io
License: Apache License 2.0
Template has grown big recently and entangles quite a bit of logic. Extracting operators would make code more modular and allows reuse of operators.
Similar to vanilla kedro
-p option, which selects pipeline to be operated, airflow-k8s
plugin could allow selection of specific pipeline for DAG generation.
That functionality could be useful in bigger projects, were pipelines are complex, can be separated but share common nodes.
Hello!
Is there a way to access airflow config parameters during a dag run? For example, like specified on this answer on stack overflow Accessing configuration parameters passed to Airflow through CLI.
If this is not possible, what code should I look at to allow for this change, so I can create a PR?
Line 154 says
kedor airflow-k8s init --with-github-actions --output ${AIRFLOW_DAG_FOLDER} https://airflow.url
It should say:
kedro airflow-k8s init --with-github-actions --output ${AIRFLOW_DAG_FOLDER} https://airflow.url
Right now it's not possible to override parameters when submitting a new run. It makes it harder to do operations like airflow-oriented hyperparameter tuning (that would involve running many runs with modified params)
Update plugin dependencies to recent versions:
kedro - 0.18.*
The command should upload the DAG to specified DAG folder with proper schedule
Check kedro kubeflow
plugin for reference.
Update plugin dependencies to recent versions:
kfp - 2.*
In Spark jobs, init script already sets PROJECT_HOME variable.
Line:
project_path = os.getenv('PROJECT_HOME','/opt/{{ project_name }}')
In some cases, I need to use different path for project. Thanks to that change, I could use spark.yarn.appMasterEnv.PROJECT_HOME
in operator factory to change default /opt path.
The command should upload the DAG to specified DAG folder with schedule=None
(for manual trigger only)
It would be great to have functionality similar to
in kedro_airflow_k8s/airflow_spark_task_template.j2. Probably after line with project_path variable.
Operators execute silently, therefore debugging and maintenance may be difficult. Operators should report at least basic data to standard output.
Currently we use kedro session store to properly tag airflow DAG
This causes build failure when git is not initialized in a project.
My suggestion is to obtain git sha the following way:
IF Check if KEDRO_CONFIG_COMMIT_ID env variable is set and then use it,
ELSE IF Check the session store for git info and if available use it
ELSE set git sha to "UNKNOWN"
Steps to reproduce:
kedro airflow-k8s compile
:bash-3.2$ kedro airflow-k8s -e pipelines compile
Traceback (most recent call last):
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/bin/kedro", line 8, in <module>
sys.exit(main())
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 268, in main
cli_collection = KedroCLI(project_path=Path.cwd())
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 181, in __init__
self._metadata = bootstrap_project(project_path)
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/startup.py", line 181, in bootstrap_project
configure_project(metadata.package_name)
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/project/__init__.py", line 219, in configure_project
settings.configure(settings_module)
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/base.py", line 182, in configure
self._wrapped = Settings(settings_module=settings_module, **kwargs)
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/base.py", line 235, in __init__
self.validators.validate(
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/validator.py", line 417, in validate
validator.validate(self.settings, only=only, exclude=exclude)
File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/validator.py", line 197, in validate
self._validate_items(
TypeError: _validate_items() got an unexpected keyword argument 'only'
kedro --version
kedro, version 0.17.5
The command should generate config file with reasonable defaults taken from project configuration.
Run once with wait for completion enabled checks only state of the dag run. However, with tasks trigger policy all_done
pipeline can be successful even though some of the intermediate tasks fail. This situation should be detected and handled according to specified by the user policy.
There are some configuration options required by commands that are tied to the environment. To make the work of the plugin easier, an allow storing some of the configuration in VCS we could move them to config file (airflow-k8s.yml
).
To make the configuration more robust and dynamic we could reuse TemplatedConfigLoader hook. Check how we do it in kedro-kubeflow-example for reference.
References from kedro-kubeflow:
Before implementing we may have a short discussion which options do we move to config and which of them we allow to override with console call.
Currently, the pipeline expects the experiment to be created
The code that requires mlflow
should execute only if the kedro-mlflow
is properly configured and dependencies are in place.
Move mlflow requirements to extras in setup.py
like here
The plugin should also support AWS S3 location as DAG definitions location. At the moment only local FS or Google Storage is supported via setup extras.
The plugin should have new extras extended with supporting profile of S3, by matching fsspec standard.
It's currently set as constant, to days_ago(2)
.
The command should use airflow API to list the currently deployed pipelines in the environment given in config. Given that not all pipelines in airflow may be produced by the plugin consider using tag to distinguish the ml ones and list only those.
Instead of launching a pod per node....could i launch a pod for the entire dagrun with the localexecutor? If so, I was curious if anyone had tried this?
I have tasks that are too quick to spin up pods, i was hoping to have a pod per dag/ dag run.
The command should upload the DAG to specified DAG folder and use Airflow API to trigger it immediately. There may be a need to implement additional logic to wait for the dag to be loaded.
Support for config flag wait_for_completion
is optional.
Check kedro kubeflow
plugin for reference.
Put startup_timeout_seconds in run_config.
Set up Dependabot and CodeQL scans to enable automated dependency updates and security alerts.
For upload_pipeline, compile, run_once.
Project with many pipelines may have dependencies among them. It should be possible to indicate such dependencies on upload to Airflow.
trigger_rule='all_done' is missing in DeletePipelineStorageOperator.
The documentation reads:
Every kedro node is transformed into Airflow DAG task.
As software engineers, we may want to have multiple kedro nodes correspond to a single DAG task when logical separation of code in multiple nodes is preferred (e.g. for readability or reuse). This would also allow datasets of type "MemoryDataset", that would need grouping by default.
For this kedro-airflow-k8s
could detect groups of nodes similar to spark grouping.
run-once
command should support optional wait-for-completion with configurable timeout.
When the pipeline is executed, every node finishing sets the status in Mlflow to "SUCEEDED". Therefore, the status in Mlflow doesn't reflect status of last pipeline, only the status of the last executed node.
One of the solutions would be to disable the hook that sets the status and set it after kedro
TaskGroup finishes, based on it's status
It looks one of the comments in sample config has a line break.
Steps to reproduce:
kedro-airflow-k8s==0.6.3
kedro airflow-k8s init airlfow-url
conf/base/airflow-k8s.yaml
, the result is:(<unknown>): found character that cannot start any token while scanning for the next token at line 68 column 13
In case all processing temporary data is stored outside of k8s or does not have to be persisted, usage of PV is not needed. Turn this into plugin option.
Hello
With:
kedro 0.17.4
kedro-airflow-k8s 0.7.3
python 3.8.12
I have a templated catalog:
training_data:
type: spark.SparkDataSet
filepath: data/${folders.intermediate}/training_data
file_format: parquet
save_args:
mode: 'overwrite'
layer: intermediate
with the parameter set in my globals.yml
folders:
intermediate: 02_intermediate
And when I run:
kedro airflow-k8s compile
I get the following error
Traceback (most recent call last):
File "/Users/user/miniconda3/envs/kedro/bin/kedro", line 8, in <module>
sys.exit(main())
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 265, in main
cli_collection()
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 210, in main
super().main(
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/cli.py", line 64, in compile
) = get_dag_filename_and_template_stream(
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/template.py", line 170, in get_dag_filename_and_template_stream
template_stream = _create_template_stream(
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/template.py", line 92, in _create_template_stream
pipeline_grouped=context_helper.pipeline_grouped,
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/context_helper.py", line 46, in pipeline_grouped
return TaskGroupFactory().create(self.pipeline, self.context.catalog)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/context.py", line 329, in catalog
return self._get_catalog()
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/context.py", line 365, in _get_catalog
conf_catalog = self.config_loader.get("catalog*", "catalog*/**", "**/catalog*")
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 191, in get
return _format_object(config_raw, self._arg_dict)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 264, in _format_object
new_dict[key] = _format_object(value, format_dict)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 264, in _format_object
new_dict[key] = _format_object(value, format_dict)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 279, in _format_object
return IDENTIFIER_PATTERN.sub(lambda m: str(_format_string(m)), val)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 279, in <lambda>
return IDENTIFIER_PATTERN.sub(lambda m: str(_format_string(m)), val)
File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 242, in _format_string
raise ValueError(
ValueError: Failed to format pattern '${folders.intermediate}': no config value found, no default provided
With this conf/base/airflow-k8s.yaml
host: https://airflow.url
output: dags
run_config:
image: spark_image
image_pull_policy: Always
startup_timeout: 600
namespace: namespace
experiment_name: experiment
run_name: experiment
cron_expression: "@daily"
description: "experiment Pipeline"
service_account_name: namespace-vault
volume:
disabled: True
macro_params: [ds, prev_ds]
variables_params: []
I add the fact that kedro run
works.
Do you have any hint?
Make sure plugin can work with python versions 3.8, 3.9 and 3.10. Modify test process to verify all three versions as working as expected (matrix builds).
Create MLFLOW_TRACKING_TOKEN from AuthHandler.
bash-3.2$ echo $KEDRO_ENV
pipelines
bash-3.2$ kedro airflow-k8s compile
No files found in ['/XXX/git/spaceflights-in-airflow/conf/base', '/XXX/git/spaceflights-in-airflow/conf/local'] matching the glob pattern(s): ['airflow-k8s*']
bash-3.2$ kedro airflow-k8s -e pipelines compile
### works as expected
Currently there are no versions on pypi that would fit the condition gcsfs>=0.6.3, <0.7.0
https://pypi.org/project/gcsfs/#history
At the moment tasks are scheduled with default k8s resource policy and node assignment is according to cluster policy. It's desired to optionally indicate which node to assign and how many resource to consume by specific task.
This should be possible via plugin configuration, where set of node configurations are available. For every configuration label of k8s node pool should be indicated as well as requested memory and cpu resources. It's up to node to indicate which configuration to use. If node does not specify the configuration, default configuration is used, unless not specified in plugin configuration - then pod is scheduled with the default from k8s cluster.
The command should open Airflow UI in a new browser tab, url to be specified in config file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.