GithubHelp home page GithubHelp logo

getindata / kedro-airflow-k8s Goto Github PK

View Code? Open in Web Editor NEW
29.0 11.0 10.0 324 KB

Kedro Plugin to support running pipelines on Kubernetes using Airflow.

Home Page: https://kedro-airflow-k8s.readthedocs.io

License: Apache License 2.0

Python 89.55% Jinja 10.14% Shell 0.28% Smarty 0.04%
machinelearning airflow mlops kedro kedro-airflow k8s kuberentes kedro-plugin

kedro-airflow-k8s's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kedro-airflow-k8s's Issues

Dependency on other Airflow task

Project with many pipelines may have dependencies among them. It should be possible to indicate such dependencies on upload to Airflow.

Fix typo in docs/03_getting_started/01_quickstart.md

Line 154 says

kedor airflow-k8s init --with-github-actions --output ${AIRFLOW_DAG_FOLDER} https://airflow.url

It should say:

kedro airflow-k8s init --with-github-actions --output ${AIRFLOW_DAG_FOLDER} https://airflow.url

feature request: Group kedro nodes in same DAG task

The documentation reads:

Every kedro node is transformed into Airflow DAG task.

As software engineers, we may want to have multiple kedro nodes correspond to a single DAG task when logical separation of code in multiple nodes is preferred (e.g. for readability or reuse). This would also allow datasets of type "MemoryDataset", that would need grouping by default.

For this kedro-airflow-k8s could detect groups of nodes similar to spark grouping.

Add list-pipelines command to the plugin

The command should use airflow API to list the currently deployed pipelines in the environment given in config. Given that not all pipelines in airflow may be produced by the plugin consider using tag to distinguish the ml ones and list only those.

Replace hardcoded '/opt' in project_path in spark job with PROJECT_HOME env

In Spark jobs, init script already sets PROJECT_HOME variable.

Line:

project_path = "/opt/{{ project_name }}"

should be replaced with something like this: project_path = os.getenv('PROJECT_HOME','/opt/{{ project_name }}')

In some cases, I need to use different path for project. Thanks to that change, I could use spark.yarn.appMasterEnv.PROJECT_HOME in operator factory to change default /opt path.

Add run-once command to the plugin

The command should upload the DAG to specified DAG folder and use Airflow API to trigger it immediately. There may be a need to implement additional logic to wait for the dag to be loaded.

Support for config flag wait_for_completion is optional.

Check kedro kubeflow plugin for reference.

Compile step broken with kedro==0.18.0

Steps to reproduce:

  1. install kedro==0.18.0
  2. Run kedro airflow-k8s compile:
bash-3.2$ kedro airflow-k8s -e pipelines compile
Traceback (most recent call last):
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 268, in main
    cli_collection = KedroCLI(project_path=Path.cwd())
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 181, in __init__
    self._metadata = bootstrap_project(project_path)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/startup.py", line 181, in bootstrap_project
    configure_project(metadata.package_name)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/kedro/framework/project/__init__.py", line 219, in configure_project
    settings.configure(settings_module)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/base.py", line 182, in configure
    self._wrapped = Settings(settings_module=settings_module, **kwargs)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/base.py", line 235, in __init__
    self.validators.validate(
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/validator.py", line 417, in validate
    validator.validate(self.settings, only=only, exclude=exclude)
  File "/Users/mwiewior/job/projects/PKOBP/git/kedro-airflow-k8s/venv/kedro/lib/python3.8/site-packages/dynaconf/validator.py", line 197, in validate
    self._validate_items(
TypeError: _validate_items() got an unexpected keyword argument 'only'

  1. Downgrading to kedro 0.17.x resolves the issue
kedro --version
kedro, version 0.17.5

Extract operators from DAG template

Template has grown big recently and entangles quite a bit of logic. Extracting operators would make code more modular and allows reuse of operators.

Make usage of PV optional

In case all processing temporary data is stored outside of k8s or does not have to be persisted, usage of PV is not needed. Turn this into plugin option.

Make mlflow an optional dependency

The code that requires mlflow should execute only if the kedro-mlflow is properly configured and dependencies are in place.

Move mlflow requirements to extras in setup.py like here

DagRun pod for pipeline (QUESTION)

Instead of launching a pod per node....could i launch a pod for the entire dagrun with the localexecutor? If so, I was curious if anyone had tried this?

I have tasks that are too quick to spin up pods, i was hoping to have a pod per dag/ dag run.

Suppor for S3 as DAG login

The plugin should also support AWS S3 location as DAG definitions location. At the moment only local FS or Google Storage is supported via setup extras.
The plugin should have new extras extended with supporting profile of S3, by matching fsspec standard.

ValueError: Failed to format pattern '${xxx}': no config value found, no default provided

Hello

With:
kedro 0.17.4
kedro-airflow-k8s 0.7.3
python 3.8.12

I have a templated catalog:

training_data:
  type: spark.SparkDataSet
  filepath: data/${folders.intermediate}/training_data
  file_format: parquet
  save_args:
    mode: 'overwrite'
  layer: intermediate

with the parameter set in my globals.yml

folders:
    intermediate: 02_intermediate

And when I run:
kedro airflow-k8s compile

I get the following error

Traceback (most recent call last):
  File "/Users/user/miniconda3/envs/kedro/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 265, in main
    cli_collection()
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/cli/cli.py", line 210, in main
    super().main(
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/cli.py", line 64, in compile
    ) = get_dag_filename_and_template_stream(
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/template.py", line 170, in get_dag_filename_and_template_stream
    template_stream = _create_template_stream(
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/template.py", line 92, in _create_template_stream
    pipeline_grouped=context_helper.pipeline_grouped,
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro_airflow_k8s/context_helper.py", line 46, in pipeline_grouped
    return TaskGroupFactory().create(self.pipeline, self.context.catalog)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/context.py", line 329, in catalog
    return self._get_catalog()
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/context.py", line 365, in _get_catalog
    conf_catalog = self.config_loader.get("catalog*", "catalog*/**", "**/catalog*")
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 191, in get
    return _format_object(config_raw, self._arg_dict)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 264, in _format_object
    new_dict[key] = _format_object(value, format_dict)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 264, in _format_object
    new_dict[key] = _format_object(value, format_dict)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 279, in _format_object
    return IDENTIFIER_PATTERN.sub(lambda m: str(_format_string(m)), val)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 279, in <lambda>
    return IDENTIFIER_PATTERN.sub(lambda m: str(_format_string(m)), val)
  File "/Users/user/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/config/templated_config.py", line 242, in _format_string
    raise ValueError(
ValueError: Failed to format pattern '${folders.intermediate}': no config value found, no default provided

With this conf/base/airflow-k8s.yaml

host: https://airflow.url


output: dags

run_config:

  image: spark_image

  image_pull_policy: Always

  startup_timeout: 600

  namespace: namespace

  experiment_name: experiment

  run_name: experiment

  cron_expression: "@daily"

  description: "experiment Pipeline"


  service_account_name: namespace-vault

  volume:
      disabled: True

  macro_params: [ds, prev_ds]

  variables_params: []

I add the fact that kedro run works.

Do you have any hint?

Detect failed tasks on run-once with wait for completion.

Run once with wait for completion enabled checks only state of the dag run. However, with tasks trigger policy all_done pipeline can be successful even though some of the intermediate tasks fail. This situation should be detected and handled according to specified by the user policy.

Add support for python 3.9 and 3.10

Make sure plugin can work with python versions 3.8, 3.9 and 3.10. Modify test process to verify all three versions as working as expected (matrix builds).

Move some of the command line options to a config file and add dynamic config loader support

There are some configuration options required by commands that are tied to the environment. To make the work of the plugin easier, an allow storing some of the configuration in VCS we could move them to config file (airflow-k8s.yml).

To make the configuration more robust and dynamic we could reuse TemplatedConfigLoader hook. Check how we do it in kedro-kubeflow-example for reference.

References from kedro-kubeflow:

  • config handler -> here
  • dynamic config loader hook -> here
  • config loader hook registration -> here

Before implementing we may have a short discussion which options do we move to config and which of them we allow to override with console call.

Invalid configuration generated by `kedro airflow-k8s init`

It looks one of the comments in sample config has a line break.

Steps to reproduce:

  1. Install kedro-airflow-k8s==0.6.3
  2. Run kedro airflow-k8s init airlfow-url
  3. Run YAML linter on conf/base/airflow-k8s.yaml, the result is:
(<unknown>): found character that cannot start any token while scanning for the next token at line 68 column 13

Select pipeline for DAG generation

Similar to vanilla kedro -p option, which selects pipeline to be operated, airflow-k8s plugin could allow selection of specific pipeline for DAG generation.
That functionality could be useful in bigger projects, were pipelines are complex, can be separated but share common nodes.

Ensure pipeline works when git not initialized

Currently we use kedro session store to properly tag airflow DAG

git_info=context_helper.session.store["git"],

This causes build failure when git is not initialized in a project.

My suggestion is to obtain git sha the following way:
IF Check if KEDRO_CONFIG_COMMIT_ID env variable is set and then use it,
ELSE IF Check the session store for git info and if available use it
ELSE set git sha to "UNKNOWN"

Invalid status in Mlflow during pipeline execution

When the pipeline is executed, every node finishing sets the status in Mlflow to "SUCEEDED". Therefore, the status in Mlflow doesn't reflect status of last pipeline, only the status of the last executed node.

One of the solutions would be to disable the hook that sets the status and set it after kedro TaskGroup finishes, based on it's status

kedro-airflow-k8s does not support KEDRO_ENV

bash-3.2$ echo $KEDRO_ENV
pipelines
bash-3.2$  kedro airflow-k8s compile
 No files found in ['/XXX/git/spaceflights-in-airflow/conf/base', '/XXX/git/spaceflights-in-airflow/conf/local'] matching the glob pattern(s): ['airflow-k8s*']
bash-3.2$ kedro airflow-k8s -e pipelines compile
### works as expected

Configuration for pod scaling

At the moment tasks are scheduled with default k8s resource policy and node assignment is according to cluster policy. It's desired to optionally indicate which node to assign and how many resource to consume by specific task.

This should be possible via plugin configuration, where set of node configurations are available. For every configuration label of k8s node pool should be indicated as well as requested memory and cpu resources. It's up to node to indicate which configuration to use. If node does not specify the configuration, default configuration is used, unless not specified in plugin configuration - then pod is scheduled with the default from k8s cluster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.