getindata / kedro-airflow-k8s Goto Github PK

View Code? Open in Web Editor NEW

29.0 12.0 11.0 324 KB

Kedro Plugin to support running pipelines on Kubernetes using Airflow.

Home Page: https://kedro-airflow-k8s.readthedocs.io

License: Apache License 2.0

Python 89.55% Jinja 10.14% Shell 0.28% Smarty 0.04%

machinelearning airflow mlops kedro kedro-airflow k8s kuberentes kedro-plugin

kedro-airflow-k8s's Issues

Extract operators from DAG template

Template has grown big recently and entangles quite a bit of logic. Extracting operators would make code more modular and allows reuse of operators.

Select pipeline for DAG generation

Similar to vanilla kedro -p option, which selects pipeline to be operated, airflow-k8s plugin could allow selection of specific pipeline for DAG generation.
That functionality could be useful in bigger projects, were pipelines are complex, can be separated but share common nodes.

Airflow run dag with config

Hello!

Is there a way to access airflow config parameters during a dag run? For example, like specified on this answer on stack overflow Accessing configuration parameters passed to Airflow through CLI.

If this is not possible, what code should I look at to allow for this change, so I can create a PR?

Fix typo in docs/03_getting_started/01_quickstart.md

Line 154 says

kedor airflow-k8s init --with-github-actions --output ${AIRFLOW_DAG_FOLDER} https://airflow.url

It should say:

kedro airflow-k8s init --with-github-actions --output ${AIRFLOW_DAG_FOLDER} https://airflow.url

Migrate to kedro 0.17+

Make it possible to override Kedro parameters when triggering run

Right now it's not possible to override parameters when submitting a new run. It makes it harder to do operations like airflow-oriented hyperparameter tuning (that would involve running many runs with modified params)

Update kedro dependencies

Update plugin dependencies to recent versions:

kedro - 0.18.*

Add schedule command to the plugin

The command should upload the DAG to specified DAG folder with proper schedule
Check kedro kubeflow plugin for reference.

Update kfp dependencies

Update plugin dependencies to recent versions:

kfp - 2.*

Replace hardcoded '/opt' in project_path in spark job with PROJECT_HOME env

In Spark jobs, init script already sets PROJECT_HOME variable.

Line:

kedro-airflow-k8s/kedro_airflow_k8s/airflow_spark_task_template.j2

Line 9 in 83e7473

project_path = "/opt/{{ project_name }}"

should be replaced with something like this: project_path = os.getenv('PROJECT_HOME','/opt/{{ project_name }}')

In some cases, I need to use different path for project. Thanks to that change, I could use spark.yarn.appMasterEnv.PROJECT_HOME in operator factory to change default /opt path.

Add upload-pipeline command to the plugin

The command should upload the DAG to specified DAG folder with schedule=None (for manual trigger only)

Add custom script injection to airflow_spark_task_template

It would be great to have functionality similar to

kedro-airflow-k8s/kedro_airflow_k8s/dataproc_init_script_template.j2

Line 5 in 83e7473

in kedro_airflow_k8s/airflow_spark_task_template.j2. Probably after line with project_path variable.

Add logging statements to custom operators

Operators execute silently, therefore debugging and maintenance may be difficult. Operators should report at least basic data to standard output.

Ensure pipeline works when git not initialized

Currently we use kedro session store to properly tag airflow DAG

kedro-airflow-k8s/kedro_airflow_k8s/template.py

Line 96 in 3b30df2

git_info=context_helper.session.store["git"],

This causes build failure when git is not initialized in a project.

My suggestion is to obtain git sha the following way:
IF Check if KEDRO_CONFIG_COMMIT_ID env variable is set and then use it,
ELSE IF Check the session store for git info and if available use it
ELSE set git sha to "UNKNOWN"

Compile step broken with kedro==0.18.0

Steps to reproduce:

install kedro==0.18.0
Run kedro airflow-k8s compile:

bash-3.2$ kedro airflow-k8s -e pipelines compile
Traceback (most recent call last):
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 268, in main
    cli_collection = KedroCLI(project_path=Path.cwd())
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 181, in __init__
    self._metadata = bootstrap_project(project_path)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/startup.py", line 181, in bootstrap_project
    configure_project(metadata.package_name)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/project/__init__.py", line 219, in configure_project
    settings.configure(settings_module)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/base.py", line 182, in configure
    self._wrapped = Settings(settings_module=settings_module, **kwargs)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/base.py", line 235, in __init__
    self.validators.validate(
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/validator.py", line 417, in validate
    validator.validate(self.settings, only=only, exclude=exclude)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/validator.py", line 197, in validate
    self._validate_items(
TypeError: _validate_items() got an unexpected keyword argument 'only'

Downgrading to kedro 0.17.x resolves the issue

kedro --version
kedro, version 0.17.5

Add init command to the plugin

The command should generate config file with reasonable defaults taken from project configuration.

Detect failed tasks on run-once with wait for completion.

Run once with wait for completion enabled checks only state of the dag run. However, with tasks trigger policy all_done pipeline can be successful even though some of the intermediate tasks fail. This situation should be detected and handled according to specified by the user policy.

Move some of the command line options to a config file and add dynamic config loader support

There are some configuration options required by commands that are tied to the environment. To make the work of the plugin easier, an allow storing some of the configuration in VCS we could move them to config file (airflow-k8s.yml).

To make the configuration more robust and dynamic we could reuse TemplatedConfigLoader hook. Check how we do it in kedro-kubeflow-example for reference.

References from kedro-kubeflow:

config handler -> here
dynamic config loader hook -> here
config loader hook registration -> here

Before implementing we may have a short discussion which options do we move to config and which of them we allow to override with console call.

Create MLflow experiment if it doesn't exist yet

Currently, the pipeline expects the experiment to be created

Add support for --dag-name flag to schedule step.

Make mlflow an optional dependency

The code that requires mlflow should execute only if the kedro-mlflow is properly configured and dependencies are in place.

Move mlflow requirements to extras in setup.py like here

Suppor for S3 as DAG login

The plugin should also support AWS S3 location as DAG definitions location. At the moment only local FS or Google Storage is supported via setup extras.
The plugin should have new extras extended with supporting profile of S3, by matching fsspec standard.

Customize start_date in DAG

It's currently set as constant, to days_ago(2).

Add list-pipelines command to the plugin

The command should use airflow API to list the currently deployed pipelines in the environment given in config. Given that not all pipelines in airflow may be produced by the plugin consider using tag to distinguish the ml ones and list only those.

DagRun pod for pipeline (QUESTION)

Instead of launching a pod per node....could i launch a pod for the entire dagrun with the localexecutor? If so, I was curious if anyone had tried this?

I have tasks that are too quick to spin up pods, i was hoping to have a pod per dag/ dag run.

Add run-once command to the plugin

The command should upload the DAG to specified DAG folder and use Airflow API to trigger it immediately. There may be a need to implement additional logic to wait for the dag to be loaded.

Support for config flag wait_for_completion is optional.

Check kedro kubeflow plugin for reference.

Make pod startup timeout configurable

Put startup_timeout_seconds in run_config.

Set up Dependabot and CodeQL

Set up Dependabot and CodeQL scans to enable automated dependency updates and security alerts.

Allow override image

For upload_pipeline, compile, run_once.

Dependency on other Airflow task

Project with many pipelines may have dependencies among them. It should be possible to indicate such dependencies on upload to Airflow.

PVC not cleaned up when dag fails.

trigger_rule='all_done' is missing in DeletePipelineStorageOperator.

feature request: Group kedro nodes in same DAG task

The documentation reads:

Every kedro node is transformed into Airflow DAG task.

As software engineers, we may want to have multiple kedro nodes correspond to a single DAG task when logical separation of code in multiple nodes is preferred (e.g. for readability or reuse). This would also allow datasets of type "MemoryDataset", that would need grouping by default.

For this kedro-airflow-k8s could detect groups of nodes similar to spark grouping.

Support for optional config flag wait_for_completion for run-once command.

run-once command should support optional wait-for-completion with configurable timeout.

Add ability to specify service accounts and imagePullSecrets to run the pipeline with

Invalid status in Mlflow during pipeline execution

When the pipeline is executed, every node finishing sets the status in Mlflow to "SUCEEDED". Therefore, the status in Mlflow doesn't reflect status of last pipeline, only the status of the last executed node.

One of the solutions would be to disable the hook that sets the status and set it after kedro TaskGroup finishes, based on it's status

Invalid configuration generated by `kedro airflow-k8s init`

It looks one of the comments in sample config has a line break.

Steps to reproduce:

Install kedro-airflow-k8s==0.6.3
Run kedro airflow-k8s init airlfow-url
Run YAML linter on conf/base/airflow-k8s.yaml, the result is:

(<unknown>): found character that cannot start any token while scanning for the next token at line 68 column 13

Make usage of PV optional

In case all processing temporary data is stored outside of k8s or does not have to be persisted, usage of PV is not needed. Turn this into plugin option.

ValueError: Failed to format pattern '${xxx}': no config value found, no default provided

Hello

With:
kedro 0.17.4
kedro-airflow-k8s 0.7.3
python 3.8.12

I have a templated catalog:

training_data:
  type: spark.SparkDataSet
  filepath: data/${folders.intermediate}/training_data
  file_format: parquet
  save_args:
    mode: 'overwrite'
  layer: intermediate

with the parameter set in my globals.yml

folders:
    intermediate: 02_intermediate

And when I run:
kedro airflow-k8s compile

I get the following error

Traceback (most recent call last):
  File "/Users/user/miniconda3/envs/kedro/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 265, in main
    cli_collection()
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 210, in main
    super().main(
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/cli.py", line 64, in compile
    ) = get_dag_filename_and_template_stream(
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/template.py", line 170, in get_dag_filename_and_template_stream
    template_stream = _create_template_stream(
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/template.py", line 92, in _create_template_stream
    pipeline_grouped=context_helper.pipeline_grouped,
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/context_helper.py", line 46, in pipeline_grouped
    return TaskGroupFactory().create(self.pipeline, self.context.catalog)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/context.py", line 329, in catalog
    return self._get_catalog()
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/context.py", line 365, in _get_catalog
    conf_catalog = self.config_loader.get("catalog*", "catalog*/**", "**/catalog*")
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 191, in get
    return _format_object(config_raw, self._arg_dict)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 264, in _format_object
    new_dict[key] = _format_object(value, format_dict)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 264, in _format_object
    new_dict[key] = _format_object(value, format_dict)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 279, in _format_object
    return IDENTIFIER_PATTERN.sub(lambda m: str(_format_string(m)), val)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 279, in <lambda>
    return IDENTIFIER_PATTERN.sub(lambda m: str(_format_string(m)), val)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 242, in _format_string
    raise ValueError(
ValueError: Failed to format pattern '${folders.intermediate}': no config value found, no default provided

With this conf/base/airflow-k8s.yaml

host: https://airflow.url


output: dags

run_config:

  image: spark_image

  image_pull_policy: Always

  startup_timeout: 600

  namespace: namespace

  experiment_name: experiment

  run_name: experiment

  cron_expression: "@daily"

  description: "experiment Pipeline"


  service_account_name: namespace-vault

  volume:
      disabled: True

  macro_params: [ds, prev_ds]

  variables_params: []

I add the fact that kedro run works.

Do you have any hint?

Add support for python 3.9 and 3.10

Make sure plugin can work with python versions 3.8, 3.9 and 3.10. Modify test process to verify all three versions as working as expected (matrix builds).

Authentication to MLFlow

Create MLFLOW_TRACKING_TOKEN from AuthHandler.

kedro-airflow-k8s does not support KEDRO_ENV

bash-3.2$ echo $KEDRO_ENV
pipelines
bash-3.2$  kedro airflow-k8s compile
 No files found in ['/XXX/git/spaceflights-in-airflow/conf/base', '/XXX/git/spaceflights-in-airflow/conf/local'] matching the glob pattern(s): ['airflow-k8s*']
bash-3.2$ kedro airflow-k8s -e pipelines compile
### works as expected

kedro-airflow-k8s[gcp] depends on nonexistent library

Currently there are no versions on pypi that would fit the condition gcsfs>=0.6.3, <0.7.0
https://pypi.org/project/gcsfs/#history

Configuration for pod scaling

At the moment tasks are scheduled with default k8s resource policy and node assignment is according to cluster policy. It's desired to optionally indicate which node to assign and how many resource to consume by specific task.

This should be possible via plugin configuration, where set of node configurations are available. For every configuration label of k8s node pool should be indicated as well as requested memory and cpu resources. It's up to node to indicate which configuration to use. If node does not specify the configuration, default configuration is used, unless not specified in plugin configuration - then pod is scheduled with the default from k8s cluster.

Add ui command to the plugin

The command should open Airflow UI in a new browser tab, url to be specified in config file.

getindata / kedro-airflow-k8s Goto Github PK

kedro-airflow-k8s's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs