kedro-org / kedro-starters Goto Github PK
View Code? Open in Web Editor NEWTemplates for your Kedro projects.
License: Apache License 2.0
Templates for your Kedro projects.
License: Apache License 2.0
GitHub Actions workflow .github/workflows/nightly-build.yml #37 failed.
Following up from #89 - the windows run tests do not pass due to a problem with pyspark-iris. The set-up is a bit tough to dig through and this issue should have it's own dedicated time.
Hello, all kedro command return following error when run any kedro
command inside a project created from kedro new --starter=pyspark-iris
(haven't checked pyspark
but I assumed same issue) inside a fresh virtualenv that has kedro
installed (beside pip
)
(mm-venv) bash-4.2$ kedro install
Traceback (most recent call last):
File "/you/shall/not/path/ds-workspace/mm-venv/bin/kedro", line 8, in <module>
sys.exit(main())
File "/you/shall/not/path/ds-workspace/mm-venv/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 268, in main
cli_collection = KedroCLI(project_path=Path.cwd())
File "/you/shall/not/path/ds-workspace/mm-venv/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 181, in __init__
self._metadata = bootstrap_project(project_path)
File "/you/shall/not/path/ds-workspace/mm-venv/lib/python3.7/site-packages/kedro/framework/startup.py", line 181, in bootstrap_project
configure_project(metadata.package_name)
File "/you/shall/not/path/ds-workspace/mm-venv/lib/python3.7/site-packages/kedro/framework/project/__init__.py", line 218, in configure_project
_validate_module(settings_module)
File "/you/shall/not/path/ds-workspace/mm-venv/lib/python3.7/site-packages/kedro/framework/project/__init__.py", line 210, in _validate_module
importlib.import_module(settings_module)
File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/private/tmp/new-kedro-project/src/new_kedro_project/settings.py", line 30, in <module>
from new_kedro_project.context import ProjectContext
File "/private/tmp/new-kedro-project/src/new_kedro_project/context.py", line 34, in <module>
from pyspark import SparkConf
As the stack trace shows, reason is because kedro
tries to validate the project before kedro install
got a chance to run, and all kedro
command failed because project context
try to import some not-yet-installed packages.
There are some not-beautiful workaround but wonder if this is indeed the issue or i misunderstood anything?
This issue in the framework will update to the tutorial docs for spaceflights. We should update the starter to match the final output of the tutorial.
Other changes/improvements/updates to the tutorial that affect code should also always be pushed through to the starter so they stay in sync.
There's separate discussion of creating additional tutorials (based on Spaceflights) to cover optional/advanced aspects such as reporting with Plotly, experiment tracking and modular pipelines/namespacing.
installed the requirements.txt, getting this immediately after kedro ipython
kedro.io.core.DatasetError: Class 'spark.SparkDataset' not found, is this a typo?
is there any additional config needed for starter=spaceflights-pyspark-viz?
I tried this on Windows with 0.19.1 and 0.19.2 both yielding the same issue. kedro-datasets is 1.5.1
Now that pandas-iris
starter has been simplified and the bug on pyspark
starters has been fixed we can simplify pyspark-iris in a similar way.
Related to: #79
The .gitignore file is missing when I tried to create a project using starters.
Kedro version used (pip show kedro or kedro -V): 0.17.5
Python version used (python -V): Python 3.8.12
This one is a quick, nice-to-have feature. Let's add this badge to all Kedro Starter README.md files:
Here's an example: https://github.com/kedro-org/kedro-starters/tree/main/spaceflights-pandas-viz/%7B%7B%20cookiecutter.repo_name%20%7D%7D
Right now, the starters show a FutureWarning
because they use the ConfigLoader
, which is deprecated.
[08/17/23 11:53:39] INFO Kedro project test-graph session.py:364
WARNING /Users/juan_cano/.micromamba/envs/kedro310/lib/python3.10/site warnings.py:109
-packages/kedro/framework/session/session.py:266:
FutureWarning: ConfigLoader will be deprecated in Kedro 0.19.
Please use the OmegaConfigLoader instead. To consult the
documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_config
uration.html#omegaconfigloader
warnings.warn(
Following starters work, #159, we switched to using the develop
branch in test_requirements.txt
so starters could use the new add-ons flow from kedro and unblock development. This also meant we had to update kedro-viz to work with develop
so we had to point all requirments.txt
from starters to the kedro-viz main
repo instead of PyPi.
Once Kedro 0.19.0 and next release of Kedro-Viz is out we can revert these back to the default settings.
kedro
to using the main
branch in test_requirements.txt
in framework.kedro-viz
pin in requirments.txt
to point to the release version (PyPi).Upon running kedro new --starter=spaceflights-pandas
, running python -m pytest
will produce two layers of errors, across several versions of Python.
How has this bug affected you? What were you trying to accomplish?
(This occurs across Python 3.10, 3.11, and 3.12)
kedro new --starter=spaceflights-pandas
cd spaceflights-pandas
pip install -r requirements.txt
python -m pytest
All tests that come with the starter should pass without error.
There are two levels of errors:
tests
directory being at project rootThe Kedro documentation's Automated Testing page instructs users to run pip install -e .
; however, the starter's Readme makes no mention of this. Thus, upon seeing tests and running python -m pytest
, users see this error message:
$ python -m pytest
==================================================================================== test session starts ====================================================================================
platform darwin -- Python 3.11.5, pytest-7.4.4, pluggy-1.3.0
rootdir: /Users/MyUserName/Downloads/spaceflights-pandas
configfile: pyproject.toml
plugins: mock-1.13.0, anyio-3.7.1, cov-3.0.0
collected 1 item / 1 error
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/coverage/control.py:888: CoverageWarning: No data was collected. (no-data-collected)
self._warn("No data was collected.", slug="no-data-collected")
========================================================================================== ERRORS ===========================================================================================
______________________________________________________________ ERROR collecting tests/pipelines/data_science/test_pipeline.py _______________________________________________________________
ImportError while importing test module '/Users/MyUserName/Downloads/spaceflights-pandas/tests/pipelines/data_science/test_pipeline.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/pipelines/data_science/test_pipeline.py:6: in <module>
from spaceflights_pandas.pipelines.data_science import create_pipeline as create_ds_pipeline
E ModuleNotFoundError: No module named 'spaceflights_pandas'
===================================================================================== warnings summary ======================================================================================
venv/lib/python3.11/site-packages/pytest_cov/plugin.py:256
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/pytest_cov/plugin.py:256: PytestDeprecationWarning: The hookimpl CovPlugin.pytest_configure_node uses old-
style configuration options (marks or attributes).
Please use the pytest.hookimpl(optionalhook=True) decorator instead
to configure the hooks.
See https://docs.pytest.org/en/latest/deprecations.html#configuring-hook-specs-impls-using-markers
def pytest_configure_node(self, node):
venv/lib/python3.11/site-packages/pytest_cov/plugin.py:265
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/pytest_cov/plugin.py:265: PytestDeprecationWarning: The hookimpl CovPlugin.pytest_testnodedown uses old-st
yle configuration options (marks or attributes).
Please use the pytest.hookimpl(optionalhook=True) decorator instead
to configure the hooks.
See https://docs.pytest.org/en/latest/deprecations.html#configuring-hook-specs-impls-using-markers
def pytest_testnodedown(self, node, error):
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
---------- coverage: platform darwin, python 3.11.5-final-0 ----------
Name Stmts Miss Cover Missing
---------------------------------------------------------------------------------------------
src/spaceflights_pandas/__init__.py 1 1 0% 4
src/spaceflights_pandas/__main__.py 30 30 0% 4-47
src/spaceflights_pandas/pipeline_registry.py 7 7 0% 2-16
src/spaceflights_pandas/pipelines/__init__.py 0 0 100%
src/spaceflights_pandas/pipelines/data_processing/__init__.py 1 1 0% 3
src/spaceflights_pandas/pipelines/data_processing/nodes.py 26 26 0% 1-68
src/spaceflights_pandas/pipelines/data_processing/pipeline.py 4 4 0% 1-7
src/spaceflights_pandas/pipelines/data_science/__init__.py 1 1 0% 3
src/spaceflights_pandas/pipelines/data_science/nodes.py 20 20 0% 1-55
src/spaceflights_pandas/pipelines/data_science/pipeline.py 4 4 0% 1-7
src/spaceflights_pandas/settings.py 3 3 0% 27-31
---------------------------------------------------------------------------------------------
TOTAL 97 97 0%
================================================================================== short test summary info ==================================================================================
ERROR tests/pipelines/data_science/test_pipeline.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=============================================================================== 2 warnings, 1 error in 11.80s ===============================================================================
Upon either running pip install -e .
or moving the tests
directory within src
, and then running python -m pytest
again, users see a second error:
$ python -m pytest
==================================================================================== test session starts ====================================================================================
platform darwin -- Python 3.11.5, pytest-7.4.4, pluggy-1.3.0
rootdir: /Users/MyUserName/Downloads/spaceflights-pandas
configfile: pyproject.toml
plugins: mock-1.13.0, anyio-3.7.1, cov-3.0.0
collected 4 items
tests/test_run.py E [ 25%]
tests/pipelines/data_science/test_pipeline.py ... [100%]
========================================================================================== ERRORS ===========================================================================================
__________________________________________________________________ ERROR at setup of TestProjectContext.test_project_path ___________________________________________________________________
config_loader = OmegaConfigLoader(conf_source=/Users/MyUserName/Downloads/spaceflights-pandas, env=None, config_patterns={'catalog': ['ca... '**/parameters*'], 'credentials': ['credentials*',
'credentials*/**', '**/credentials*'], 'globals': ['globals.yml']})
@pytest.fixture
def project_context(config_loader):
> return KedroContext(
package_name="spaceflights_pandas",
project_path=Path.cwd(),
config_loader=config_loader,
hook_manager=_create_hook_manager(),
)
E TypeError: KedroContext.__init__() missing 1 required positional argument: 'env'
tests/test_run.py:23: TypeError
===================================================================================== warnings summary ======================================================================================
venv/lib/python3.11/site-packages/pytest_cov/plugin.py:256
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/pytest_cov/plugin.py:256: PytestDeprecationWarning: The hookimpl CovPlugin.pytest_configure_node uses old-
style configuration options (marks or attributes).
Please use the pytest.hookimpl(optionalhook=True) decorator instead
to configure the hooks.
See https://docs.pytest.org/en/latest/deprecations.html#configuring-hook-specs-impls-using-markers
def pytest_configure_node(self, node):
venv/lib/python3.11/site-packages/pytest_cov/plugin.py:265
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/pytest_cov/plugin.py:265: PytestDeprecationWarning: The hookimpl CovPlugin.pytest_testnodedown uses old-st
yle configuration options (marks or attributes).
Please use the pytest.hookimpl(optionalhook=True) decorator instead
to configure the hooks.
See https://docs.pytest.org/en/latest/deprecations.html#configuring-hook-specs-impls-using-markers
def pytest_testnodedown(self, node, error):
tests/pipelines/data_science/test_pipeline.py::test_data_science_pipeline
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/sklearn/metrics/_regression.py:1187: UndefinedMetricWarning: R^2 score is not well-defined with less than
two samples.
warnings.warn(msg, UndefinedMetricWarning)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
---------- coverage: platform darwin, python 3.11.5-final-0 ----------
Name Stmts Miss Cover Missing
---------------------------------------------------------------------------------------------
src/spaceflights_pandas/__init__.py 1 0 100%
src/spaceflights_pandas/__main__.py 30 30 0% 4-47
src/spaceflights_pandas/pipeline_registry.py 7 7 0% 2-16
src/spaceflights_pandas/pipelines/__init__.py 0 0 100%
src/spaceflights_pandas/pipelines/data_processing/__init__.py 1 1 0% 3
src/spaceflights_pandas/pipelines/data_processing/nodes.py 26 26 0% 1-68
src/spaceflights_pandas/pipelines/data_processing/pipeline.py 4 4 0% 1-7
src/spaceflights_pandas/pipelines/data_science/__init__.py 1 0 100%
src/spaceflights_pandas/pipelines/data_science/nodes.py 20 0 100%
src/spaceflights_pandas/pipelines/data_science/pipeline.py 4 0 100%
src/spaceflights_pandas/settings.py 3 3 0% 27-31
---------------------------------------------------------------------------------------------
TOTAL 97 71 27%
================================================================================== short test summary info ==================================================================================
ERROR tests/test_run.py::TestProjectContext::test_project_path - TypeError: KedroContext.__init__() missing 1 required positional argument: 'env'
========================================================================== 3 passed, 3 warnings, 1 error in 14.05s ==========================================================================
The KedroContext documentation states that env
should supply a default value of "local", but that seems not to be getting picked up here. Manually adding env="local"
here does this error:
$ python -m pytest
==================================================================================== test session starts ====================================================================================
platform darwin -- Python 3.11.5, pytest-7.4.4, pluggy-1.3.0
rootdir: /Users/MyUserName/Downloads/spaceflights-pandas/spaceflights-pandas
configfile: pyproject.toml
plugins: mock-1.13.0, anyio-3.7.1, cov-3.0.0
collected 4 items
tests/test_run.py . [ 25%]
tests/pipelines/data_science/test_pipeline.py ... [100%]
===================================================================================== warnings summary ======================================================================================
../venv/lib/python3.11/site-packages/pytest_cov/plugin.py:256
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/pytest_cov/plugin.py:256: PytestDeprecationWarning: The hookimpl CovPlugin.pytest_configure_node uses old-st
yle configuration options (marks or attributes).
Please use the pytest.hookimpl(optionalhook=True) decorator instead
to configure the hooks.
See https://docs.pytest.org/en/latest/deprecations.html#configuring-hook-specs-impls-using-markers
def pytest_configure_node(self, node):
../venv/lib/python3.11/site-packages/pytest_cov/plugin.py:265
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/pytest_cov/plugin.py:265: PytestDeprecationWarning: The hookimpl CovPlugin.pytest_testnodedown uses old-styl
e configuration options (marks or attributes).
Please use the pytest.hookimpl(optionalhook=True) decorator instead
to configure the hooks.
See https://docs.pytest.org/en/latest/deprecations.html#configuring-hook-specs-impls-using-markers
def pytest_testnodedown(self, node, error):
tests/pipelines/data_science/test_pipeline.py::test_data_science_pipeline
/Users/MyUserName/Downloads/spaceflights-pandas/venv/lib/python3.11/site-packages/sklearn/metrics/_regression.py:1187: UndefinedMetricWarning: R^2 score is not well-defined with less than tw
o samples.
warnings.warn(msg, UndefinedMetricWarning)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
---------- coverage: platform darwin, python 3.11.5-final-0 ----------
Name Stmts Miss Cover Missing
---------------------------------------------------------------------------------------------
src/spaceflights_pandas/__init__.py 1 0 100%
src/spaceflights_pandas/__main__.py 30 30 0% 4-47
src/spaceflights_pandas/pipeline_registry.py 7 7 0% 2-16
src/spaceflights_pandas/pipelines/__init__.py 0 0 100%
src/spaceflights_pandas/pipelines/data_processing/__init__.py 1 1 0% 3
src/spaceflights_pandas/pipelines/data_processing/nodes.py 26 26 0% 1-68
src/spaceflights_pandas/pipelines/data_processing/pipeline.py 4 4 0% 1-7
src/spaceflights_pandas/pipelines/data_science/__init__.py 1 0 100%
src/spaceflights_pandas/pipelines/data_science/nodes.py 20 0 100%
src/spaceflights_pandas/pipelines/data_science/pipeline.py 4 0 100%
src/spaceflights_pandas/settings.py 3 3 0% 27-31
---------------------------------------------------------------------------------------------
TOTAL 97 71 27%
=============================================================================== 4 passed, 3 warnings in 2.13s ===============================================================================
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
or kedro -V
): 0.19.5 (though I believe that this happens with older versions, as well, ever since the tests
directory was moved to the project root)python -V
): 3.10, 3.11, 3.12pip install -e .
, or else move tests
to be within src
.env="local"
to KedroContext
in the example test file above.pyproject.toml
's [tool.pytest.ini_options]
section, add filterwarnings = ["ignore::DeprecationWarning:.*pytest_cov*"]
to suppress the pytest-cov warnings above.I would be happy to contribute a PR implementing the above, but thought to ask first, would those changes be welcome? Or, specifically as with the KedroContext
error above, is it possible that part of this may point to something that needs to be updated in kedro
itself?
kedro pyspark
starter with kedro version 0.18.3
. When trying to execute kedro run
, it produces the following error:ConstructorError: while constructing a mapping
in ".../conf/base/logging.yml", line 43, column 3
found unacceptable key (unhashable type: 'dict')
logging.yml
file, kedro run
completes without errorHow has this bug affected you? What were you trying to accomplish?
alloy
, and this error is causing execution errorsalloy
to create a new kedro project with kedro pyspark
starter version 0.18.3
kedro run
within the created kedro projectkedro run
should execute without error
Tell us what happens instead.
ConstructorError: while constructing a mapping
in ".../conf/base/logging.yml", line 43, column 3
found unacceptable key (unhashable type: 'dict')
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
or kedro -V
): 0.18.3python -V
): 3.9Missing tag for Kedro 0.18.2
This issue is the same as kedro-org/kedro#3110, which proposed resolution is proposed on PR kedro-org/kedro#3119.
- When merging
rated_shuttles
withcompanies
, both dataframes have anid
column, and this createsid_x
andid_y
, which could be avoided by selectively droppingid
before mergingcompanies
has duplicate rows, so when mergingrated_shuttles
withcompanies
, some rows are repeated and this might distort the result, which could be avoided by doing a.drop_duplicates()
on companies
[...]
rated_shuttles = shuttles.merge(reviews, left_on="id", right_on="shuttle_id")
rated_shuttles = rated_shuttles.drop("id", axis=1)
companies = companies.drop_duplicates()
model_input_table = rated_shuttles.merge(
companies, left_on="company_id", right_on="id"
[...]
)
In this repository, this should be add in two files:
(Optional) Describe any alternative solutions or features you've considered.
Annoyingly I think we need to do the same thing in the starters repo 🙈
https://github.com/kedro-org/kedro-starters/blob/1559a2a4750caed6d98c7e880bec70b5c71b43d4/astro-airflow-iris/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/requirements.txt
Originally posted by @datajoely in kedro-org/kedro#1322 (comment)
When I run the command kedro new -s standalone-datacatalog
I get an error saying that the starter is not found.
Just for clarification, the following command works normally as an alternative: kedro new -s https://github.com/kedro-org/kedro-starters/ --checkout 0.18.0 --directory standalone-datacatalog
I was trying to create a minimal project to test how the plugins work. This has an easy workaround, so it does not affect me.
kedro new -s standalone-datacatalog
or kedro new -s standalone-datacatalog --checkout 0.18.0
Kedro should move on to ask the normal questions such as project and repo name.
An error is displayed that says the starter is not found.
kedro.framework.cli.utils.KedroCliError: Kedro project template not found at standalone-datacatalog . Specified tag 0.18.0. The following tags are available: . The aliases for the official Kedro starters are:
- astro-airflow-iris
- mini-kedro
- pandas-iris
- pyspark
- pyspark-iris
- spaceflights
Run with --verbose to see the full exception
Error: Kedro project template not found at standalone-datacatalog . Specified tag 0.18.0. The following tags are available: . The aliases for the official Kedro starters are:
- astro-airflow-iris
- mini-kedro
- pandas-iris
- pyspark
- pyspark-iris
- spaceflights
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
or kedro -V
): kedro, version 0.18.0
python -V
): Python 3.8.5
To be merged after 0.18.0 has been released.
hooks.py
files containing registration hooks from our starters.pyspark
3.4.0 was released on the 13th of April and has broken our pyspark-iris
and potentially the other pyspark
starter. This was initially discovered because the kedro-docker
e2e tests that use the pyspark-iris
starter were failing.
It's failing on all builds so Python version 3.7, 3.8, 3.9, 3.10.
Python 3.7 should be expected as the support is deprecated in Spark 3.4
The problem seems to be when the dataset is loaded. But it's not clear yet why.
See the logs here: https://app.circleci.com/pipelines/github/kedro-org/kedro-starters/564/workflows/62a766f3-5e4c-4ad2-8e49-5907ec66a426/jobs/5820
I've used a project created with the pyspark-iris
starter to reproduce the error.
Through logging statements I've been able to trace the error to the point where the dataset is loaded, specifically to line: https://github.com/kedro-org/kedro/blob/main/kedro/extras/datasets/spark/spark_dataset.py#L386
I've used this branch to get to that point: https://github.com/kedro-org/kedro/tree/debug-spark-issue
Note that I've used the SparkDataset
inside the Kedro package for ease of debugging.
Loading the dataset directly in python code works fine:
import pyspark.sql
from kedro.io import DataCatalog
from kedro.extras.datasets.spark import SparkDataSet
spark_ds = SparkDataSet(
filepath="/Users/merel_theisen/Projects/Testing/spark-issue/data/01_raw/iris.csv",
file_format="csv",
load_args={"header": True, "inferSchema": True},
save_args={"header": True},
)
catalog = DataCatalog({"iris": spark_ds})
df = catalog.load("iris")
This line https://github.com/kedro-org/kedro/blob/main/kedro/extras/datasets/spark/spark_dataset.py#L386
Results in:
The main change that seems to be new for 3.4.0 in the methods we use is:
.. versionchanged:: 3.4.0
Supports Spark Connect.
With the changes to the directory structure applied in kedro-org/kedro#3731 now merged, @ankatiyar proposed updating the structure of folders created by the Kedro starters to match them.
The current structure has the pipeline tests created by starters placed in <project root>/tests/pipelines
, while the tests created from the kedro pipeline create
command go into <project root>/tests/pipelines/<pipeline name>
. What is being proposed is to put the pipeline tests from starters into their own directory as well.
For example:
In the image above, we can see test_data_science.py
that was created by using the spaceflights-pandas
starter and is located directly in the tests
directory. The test file for my_pipeline
, which was created with kedro pipeline create
, is on it's own directory.
The initial idea would be putting the pipeline tests generated from starters on their own directory. In the aforementioned example, the structure would change from <project root>/tests/pipelines/test_data_science.py
to <project root>/tests/pipelines/data_science/test_pipeline.py
.
As per title. pyproject.toml
with the Kedro metadata is missing, which leads to some subcommands not being available, for example kedro jupyter
:
I was trying to use the standalone-datacatalog
from Jupyter to have a minimal Kedro setup from a notebook, but found that the kedro.ipython
extension was not loading.
kedro new -s standalone-datacatalog
kedro jupyter notebook
kedro jupyter
works for all starters.
juan_cano@M-PH9T4K3P3C /t/test-kedro-ipython-mini> kedro jupyter notebook (kpolars310)
Usage: kedro [OPTIONS] COMMAND [ARGS]...
Try 'kedro -h' for help.
Error: No such command 'jupyter'.
juan_cano@M-PH9T4K3P3C /t/test-kedro-ipython-mini [2]> kedro (kpolars310)
Usage: kedro [OPTIONS] COMMAND [ARGS]...
Kedro is a CLI for creating and using Kedro projects. For more information,
type ``kedro info``.
Options:
-V, --version Show version and exit
-h, --help Show this message and exit.
Global commands from Kedro
Commands:
docs See the kedro API docs and introductory tutorial.
info Get more information about kedro.
new Create a new kedro project.
starter Commands for working with project starters.
pip show kedro
or kedro -V
): 0.18.4python -V
): 3.10.9We found that some links in pyspark-iris
are broken when reviewing the databricks-iris
starter.
It would be good to go through all the links in the starters (across docstrings / READMEs), check whether they are broken and if so, fix them.
(transfer from Jira, created by @ignacioparicio)
kedro-starters CI will fail whenever we update requirements.txt
of kedro. This is because:
test_requirements.txt
(https://github.com/kedro-org/kedro-starters/blob/master/features/environment.py#L70). This file refers to the main
branch from kedro (https://github.com/kedro-org/kedro-starters/blob/master/test_requirements.txt#L8), which is ahead of the latest release_kedro[pandas.CSVDataSet]==\{{ cookiecutter.kedro_version }}_
https://github.com/kedro-org/kedro-starters/blob/master/pandas-iris/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/requirements.txt for pandas-iris
). Unlike before, this points to the latest kedro releaserequirement.txt
are updated, this will lead to a conflictFor now we are avoiding to do pip compile
in order to deal with non-breaking changes in requirements.txt
(see #36). A better solution:
main
branch, not the latest releasePossible implementation of this solution would be to patch requirements.txt
in the kedro-starters CI to change any kedro[*]=={{ cookiecutter.kedro_version }} (which points to latest kedro release) to point to main instead. This would then enable a pure kedro install
command to be run during CI (without the --no-build-reqs flag).
If a solution is found, we might consider reverting the changes made https://github.com/kedro-org/kedro-starters/pull/36/files, as they shouldn't be required anymore.
Currently the starters can break if a user has pandas 2.0 installed. Update all starters so they can run fine with pandas 2.0 as well as older versions. This means updating the pin for kedro-datasets
to ~=1.0
instead of ~=1.0.0
.
For example in spaceflights:
This should not be a problem if the user follows the normal workflow, but if they install pandas 2 separately, things break:
> pip install kedro pandas scikit-learn openpyxl pyarrow # problems incoming
> kedro new --starter=spaceflights
> cd spaceflights
> kedro run # uh oh
[05/05/23 15:25:57] INFO Kedro project spaceflights session.py:360
[05/05/23 15:25:59] WARNING /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/importlib/__init__.py:126: DeprecationWarning: `kedro.extras.datasets` is deprecated and will be removed in warnings.py:109
Kedro 0.19, install `kedro-datasets` instead by running `pip install kedro-datasets`.
return _bootstrap._gcd_import(name[level:], package, level)
[05/05/23 15:26:00] INFO Loading data from 'companies' (CSVDataSet)... data_catalog.py:343
INFO Running node: preprocess_companies_node: preprocess_companies([companies]) -> [preprocessed_companies] node.py:329
INFO Saving data to 'preprocessed_companies' (ParquetDataSet)... data_catalog.py:382
INFO Completed 1 out of 6 tasks sequential_runner.py:85
INFO Loading data from 'shuttles' (ExcelDataSet)... data_catalog.py:343
[05/05/23 15:26:04] INFO Running node: preprocess_shuttles_node: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles] node.py:329
ERROR Node 'preprocess_shuttles_node: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]' failed with error: node.py:354
could not convert string to float: '$1325.0'
WARNING There are 5 nodes that have not run. runner.py:205
You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command:
--from-nodes "preprocess_shuttles_node,create_model_input_table_node"
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/juan_cano/.micromamba/envs/_test310/bin/kedro:8 in <module> │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/cli. │
│ py:211 in main │
│ │
│ 208 │ """ │
│ 209 │ _init_plugins() │
│ 210 │ cli_collection = KedroCLI(project_path=Path.cwd()) │
│ ❱ 211 │ cli_collection() │
│ 212 │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1130 in │
│ __call__ │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/cli. │
│ py:139 in main │
│ │
│ 136 │ │ ) │
│ 137 │ │ │
│ 138 │ │ try: │
│ ❱ 139 │ │ │ super().main( │
│ 140 │ │ │ │ args=args, │
│ 141 │ │ │ │ prog_name=prog_name, │
│ 142 │ │ │ │ complete_var=complete_var, │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1055 in │
│ main │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1657 in │
│ invoke │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1404 in │
│ invoke │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:760 in │
│ invoke │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/proj │
│ ect.py:472 in run │
│ │
│ 469 │ with KedroSession.create( │
│ 470 │ │ env=env, conf_source=conf_source, extra_params=params │
│ 471 │ ) as session: │
│ ❱ 472 │ │ session.run( │
│ 473 │ │ │ tags=tag, │
│ 474 │ │ │ runner=runner(is_async=is_async), │
│ 475 │ │ │ node_names=node_names, │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/session/ │
│ session.py:426 in run │
│ │
│ 423 │ │ ) │
│ 424 │ │ │
│ 425 │ │ try: │
│ ❱ 426 │ │ │ run_result = runner.run( │
│ 427 │ │ │ │ filtered_pipeline, catalog, hook_manager, session_id │
│ 428 │ │ │ ) │
│ 429 │ │ │ self._run_called = True │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:9 │
│ 1 in run │
│ │
│ 88 │ │ │ self._logger.info( │
│ 89 │ │ │ │ "Asynchronous mode is enabled for loading and saving data" │
│ 90 │ │ │ ) │
│ ❱ 91 │ │ self._run(pipeline, catalog, hook_manager, session_id) │
│ 92 │ │ │
│ 93 │ │ self._logger.info("Pipeline execution completed successfully.") │
│ 94 │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/sequential_ │
│ runner.py:70 in _run │
│ │
│ 67 │ │ │
│ 68 │ │ for exec_index, node in enumerate(nodes): │
│ 69 │ │ │ try: │
│ ❱ 70 │ │ │ │ run_node(node, catalog, hook_manager, self._is_async, session_id) │
│ 71 │ │ │ │ done_nodes.add(node) │
│ 72 │ │ │ except Exception: │
│ 73 │ │ │ │ self._suggest_resume_scenario(pipeline, done_nodes, catalog) │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 19 in run_node │
│ │
│ 316 │ if is_async: │
│ 317 │ │ node = _run_node_async(node, catalog, hook_manager, session_id) │
│ 318 │ else: │
│ ❱ 319 │ │ node = _run_node_sequential(node, catalog, hook_manager, session_id) │
│ 320 │ │
│ 321 │ for name in node.confirms: │
│ 322 │ │ catalog.confirm(name) │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:4 │
│ 15 in _run_node_sequential │
│ │
│ 412 │ ) │
│ 413 │ inputs.update(additional_inputs) │
│ 414 │ │
│ ❱ 415 │ outputs = _call_node_run( │
│ 416 │ │ node, catalog, inputs, is_async, hook_manager, session_id=session_id │
│ 417 │ ) │
│ 418 │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 81 in _call_node_run │
│ │
│ 378 │ │ │ is_async=is_async, │
│ 379 │ │ │ session_id=session_id, │
│ 380 │ │ ) │
│ ❱ 381 │ │ raise exc │
│ 382 │ hook_manager.hook.after_node_run( │
│ 383 │ │ node=node, │
│ 384 │ │ catalog=catalog, │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 71 in _call_node_run │
│ │
│ 368 ) -> Dict[str, Any]: │
│ 369 │ # pylint: disable=too-many-arguments │
│ 370 │ try: │
│ ❱ 371 │ │ outputs = node.run(inputs) │
│ 372 │ except Exception as exc: │
│ 373 │ │ hook_manager.hook.on_node_error( │
│ 374 │ │ │ error=exc, │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 55 in run │
│ │
│ 352 │ │ # purposely catch all exceptions │
│ 353 │ │ except Exception as exc: │
│ 354 │ │ │ self._logger.error("Node '%s' failed with error: \n%s", str(self), str(exc)) │
│ ❱ 355 │ │ │ raise exc │
│ 356 │ │
│ 357 │ def _run_with_no_inputs(self, inputs: Dict[str, Any]): │
│ 358 │ │ if inputs: │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 44 in run │
│ │
│ 341 │ │ │ if not self._inputs: │
│ 342 │ │ │ │ outputs = self._run_with_no_inputs(inputs) │
│ 343 │ │ │ elif isinstance(self._inputs, str): │
│ ❱ 344 │ │ │ │ outputs = self._run_with_one_input(inputs, self._inputs) │
│ 345 │ │ │ elif isinstance(self._inputs, list): │
│ 346 │ │ │ │ outputs = self._run_with_list(inputs, self._inputs) │
│ 347 │ │ │ elif isinstance(self._inputs, dict): │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 75 in _run_with_one_input │
│ │
│ 372 │ │ │ │ f"{sorted(inputs.keys())}." │
│ 373 │ │ │ ) │
│ 374 │ │ │
│ ❱ 375 │ │ return self._func(inputs[node_input]) │
│ 376 │ │
│ 377 │ def _run_with_list(self, inputs: Dict[str, Any], node_inputs: List[str]): │
│ 378 │ │ # Node inputs and provided run inputs should completely overlap │
│ │
│ /private/tmp/spaceflights/src/spaceflights/pipelines/data_processing/nodes.py:45 in │
│ preprocess_shuttles │
│ │
│ 42 │ """ │
│ 43 │ shuttles["d_check_complete"] = _is_true(shuttles["d_check_complete"]) │
│ 44 │ shuttles["moon_clearance_complete"] = _is_true(shuttles["moon_clearance_complete"]) │
│ ❱ 45 │ shuttles["price"] = _parse_money(shuttles["price"]) │
│ 46 │ return shuttles │
│ 47 │
│ 48 │
│ │
│ /private/tmp/spaceflights/src/spaceflights/pipelines/data_processing/nodes.py:16 in _parse_money │
│ │
│ 13 │
│ 14 def _parse_money(x: pd.Series) -> pd.Series: │
│ 15 │ x = x.str.replace("$", "", regex=True).str.replace(",", "") │
│ ❱ 16 │ x = x.astype(float) │
│ 17 │ return x │
│ 18 │
│ 19 │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/generic.py:6 │
│ 324 in astype │
│ │
│ 6321 │ │ │
│ 6322 │ │ else: │
│ 6323 │ │ │ # else, only a single dtype is given │
│ ❱ 6324 │ │ │ new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) │
│ 6325 │ │ │ return self._constructor(new_data).__finalize__(self, method="astype") │
│ 6326 │ │ │
│ 6327 │ │ # GH 33113: handle empty frame or series │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/ma │
│ nagers.py:451 in astype │
│ │
│ 448 │ │ elif using_copy_on_write(): │
│ 449 │ │ │ copy = False │
│ 450 │ │ │
│ ❱ 451 │ │ return self.apply( │
│ 452 │ │ │ "astype", │
│ 453 │ │ │ dtype=dtype, │
│ 454 │ │ │ copy=copy, │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/ma │
│ nagers.py:352 in apply │
│ │
│ 349 │ │ │ if callable(f): │
│ 350 │ │ │ │ applied = b.apply(f, **kwargs) │
│ 351 │ │ │ else: │
│ ❱ 352 │ │ │ │ applied = getattr(b, f)(**kwargs) │
│ 353 │ │ │ result_blocks = extend_blocks(applied, result_blocks) │
│ 354 │ │ │
│ 355 │ │ out = type(self).from_blocks(result_blocks, self.axes) │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/bl │
│ ocks.py:511 in astype │
│ │
│ 508 │ │ """ │
│ 509 │ │ values = self.values │
│ 510 │ │ │
│ ❱ 511 │ │ new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) │
│ 512 │ │ │
│ 513 │ │ new_values = maybe_coerce_values(new_values) │
│ 514 │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:242 in astype_array_safe │
│ │
│ 239 │ │ dtype = dtype.numpy_dtype │
│ 240 │ │
│ 241 │ try: │
│ ❱ 242 │ │ new_values = astype_array(values, dtype, copy=copy) │
│ 243 │ except (ValueError, TypeError): │
│ 244 │ │ # e.g. _astype_nansafe can fail on object-dtype of strings │
│ 245 │ │ # trying to convert to float │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:187 in astype_array │
│ │
│ 184 │ │ values = values.astype(dtype, copy=copy) │
│ 185 │ │
│ 186 │ else: │
│ ❱ 187 │ │ values = _astype_nansafe(values, dtype, copy=copy) │
│ 188 │ │
│ 189 │ # in pandas we don't store numpy str dtypes, so convert to object │
│ 190 │ if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str): │
│ │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:138 in _astype_nansafe │
│ │
│ 135 │ │
│ 136 │ if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype): │
│ 137 │ │ # Explicit copy, or required since NumPy can't view from / to object. │
│ ❱ 138 │ │ return arr.astype(dtype, copy=True) │
│ 139 │ │
│ 140 │ return arr.astype(dtype, copy=copy) │
│ 141 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: could not convert string to float: '$1325.0'
I was about to do a quick demonstration of the spaceflights pipeline, and instead of following the normal process, I installed the dependencies "by hand".
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
or kedro -V
): 0.18.8python -V
): 3.10.10Hi!
I want to create a poetry
starter with some of my personal setup.
What do I need to setup for kedro new --starter=poetry
to recognize it?
Do I need to make a PR here or is it possible to supply the starter in another way? I couldn't find anything about it in the documentation.
Is my only option to use it with cookicutter ...
?
(transfer from Jira, created by @lorenabalan)
Follow-up on #24 (comment)
The setup was more fiddly to get behave working with windows, which felt like it deserved its own dedicated time and energy. Should change .circleci/config.yml, as well as some of the behave setup (like subprocess, bin_dir, etc.) to make it work.
kedro-plugins
. For the starters we'll just need to add 3.11 builds to ensure they run properly on Python 3.11.#136 made a one character change in the README
but there were some test failures.
Failing scenarios:
features/run.feature:26 Run a Kedro project created from pyspark-iris
0 features passed, 1 failed, 0 skipped
4 scenarios passed, 1 failed, 0 skipped
24 steps passed, 1 failed, 0 skipped, 0 undefined
Took 3m56.385s
Exited with code exit status 1
CircleCI received exit code 1
This is the error:
DatasetError: An exception occurred when parsing config for dataset
'example_classifier':
Dataset type 'kedro.io.memory_dataset.MemoryDataSet' is invalid: all data set
types must extend 'AbstractDataSet'.
Example Notebook.ipynb in standalone-datacatalog has wrong configloader arguments.
I wanted to try the Example Notebook.ipynb in the standalone-datacatalog starter.
standalone-datacatalog/Example Notebook.ipynb
with JupyterIt should print the head of the dataframe.
It fails.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_33865/3764053128.py in <cell line: 8>()
6
7 # Load the data catalog configuration from catalog.yml
----> 8 conf_catalog = conf_loader.get("catalog.yml")
9
10 # Create the DataCatalog instance from the configuration
~/.local/share/virtualenvs/kedro-9vwvPpXf/lib/python3.10/site-packages/kedro/config/config.py in get(self, *patterns)
97
98 def get(self, *patterns: str) -> Dict[str, Any]:
---> 99 return _get_config_from_patterns(
100 conf_paths=self.conf_paths, patterns=list(patterns)
101 )
~/.local/share/virtualenvs/kedro-9vwvPpXf/lib/python3.10/site-packages/kedro/config/common.py in _get_config_from_patterns(conf_paths, patterns, ac_template)
93 for conf_path in conf_paths:
94 if not Path(conf_path).is_dir():
---> 95 raise ValueError(
96 f"Given configuration path either does not exist "
97 f"or is not a valid directory: {conf_path}"
ValueError: Given configuration path either does not exist or is not a valid directory: conf/base/base
I am new to kedro, so I don't really understand the ConfigLoader yet.
If I change conf_loader = ConfigLoader("conf/base")
to conf_loader = ConfigLoader("conf")
it works though. Is this just on my machine?
pip show kedro
or kedro -V
): kedro, version 0.18.0python -V
): Python 3.10.4Remove all linting dependencies and configuration:
setup.cfg
completelypyproject.toml
requirements.txt
: flake8, black, isortRemove test setup:
src/test
directory completelypytest
dependencies from requirements.txt
Update e2e test setup to still lint the starters by calling flake8
, isort
and black
commands directly instead of kedro lint
.
Follow up on: kedro-org/kedro#1849
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.