GithubHelp home page GithubHelp logo

ballet / ballet Goto Github PK

View Code? Open in Web Editor NEW
32.0 32.0 6.0 17.83 MB

☀️🦶 A lightweight framework for collaborative, open-source feature engineering

Home Page: https://ballet.github.io

License: MIT License

Makefile 1.02% Python 96.59% Shell 0.10% Jupyter Notebook 2.29%
collaborative-data-science feature-engineering

ballet's Introduction

PyPI Shield Tests codecov Shield

ballet

A lightweight framework for collaborative, open-source data science projects through feature engineering.

Overview

While the open-source model for software development has led to successful, large-scale collaborations in building software applications, chess engines, and scientific analyses, data science has not benefited from this development paradigm. In part, this is due to the divide between the development processes used by software engineers and those used by data scientists.

Ballet tries to address this disparity. It is a lightweight software framework that supports collaborative data science development by composing a data science pipeline from a collection of modular patches that can be written in parallel. Ballet provides the underlying functionality to support interactive development, test and merge high-quality contributions, and compose the accepted contributions into a single product.

We have deployed Ballet for feature engineering collaborations on tabular survey datasets of public interest. For example, predict-census-income is a large real-world collaborative project to engineer features from raw individual survey responses to the U.S. Census American Community Survey (ACS) and predict personal income. The resulting project is one of the largest data science collaborations GitHub, and outperforms state-of-the-art tabular AutoML systems and independent data science experts.

The Ballet framework

Ballet includes several different pieces for enabling collaborative data science.

  • The Ballet framework core is developed in this repository and includes:
    • the feature definition abstraction, a tuple of input variables and transformer steps (ballet.feature)
    • the feature engineering pipeline abstraction, a data flow graph over feature functions (ballet.pipeline)
    • the transformer step abstraction and a library of transformer steps that can be used in feature engineering (ballet.tranformer, ballet.eng)
    • a comprehensive feature validation library, that includes test suites and statistical methods for validating the machine learning performance and software quality of proposed feature definitions (ballet.validation)
    • functionality for programmatically collecting submitted feature definitions from file systems (ballet.contrib)
    • a project template for individual Ballet projects that can be automatically updated with upstream template improvements (ballet/templates/project_template, ballet.update)
    • a command line tool for maintaining and developing Ballet projects (ballet.cli)
    • an interface to interact with Ballet projects following the project template (ballet.project)
    • an interactive client for users during development (ballet.client)
  • Assemblé: A development environment for Ballet collaborations on top of Jupyter Lab
  • Ballet Bot: A bot to help manage Ballet projects on GitHub

Next steps

Learn more about Ballet

Are you a data owner or project maintainer that wants to organize a collaboration?

👉 Check out the Ballet Maintainer Guide

Are you a data scientist or enthusiast that wants to join a collaboration?

👉 Check out the Ballet Contributor Guide

Do you want to learn about how Ballet enables Better Feature Engineering™️?

👉 Check out the Feature Engineering Guide

You can also read our research paper about the Ballet framework and our case study analysis, which appeared at ACM CSCW 2021:

👉 Enabling Collaborative Data Science Development with the Ballet Framework

Join a Ballet collaboration

The Ballet GitHub organization hosts several ongoing Ballet collaborations:

Citing Ballet

If you use Ballet in your work, please consider citing the following paper:

@article{smith2021enabling,
    author = {Smith, Micah J. and Cito, J{\"u}rgen and Lu, Kelvin and Veeramachaneni, Kalyan},
    title = "Enabling Collaborative Data Science Development with the {Ballet} Framework",
    year = "2021",
    month = "October",
    volume = "5",
    pages = "1--39",
    doi = "10.1145/3479575",
    journal = "Proceedings of the {ACM} on Human-Computer Interaction",
    publisher = "{ACM}",
    language = "en",
    number = "CSCW2"
}

ballet's People

Contributors

kelvin-lu avatar micahjsmith avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ballet's Issues

Project engineer-features command fails with no features

  • ballet version: 0.5.3-dev (981521d)
  • Python version: 3.7
  • Operating System: macOS 10.14.3

Description

In a ballet project with no features ("ames"), ran ames-engineer-features at initialization (no features yet). The command fails but it should instead exit gracefully.

What I Did

$ ames-engineer-features data/test data/test
/usr/local/miniconda3/envs/ballet-ames-demo/lib/python3.7/site-packages/ballet/util/io.py:130: FutureWarning: read_table is deprecated, use read_csv instead.
  return pd.read_table(path, **kwargs)
[2019-03-12 15:43:59,320] {ames.features: log.py:116} INFO - Building features and target...
/Users/micahsmith/workspace/ballet-ames-demo/ames/load_data.py:29: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.                                                
  df = pd.read_csv(source, sep=None)
[2019-03-12 15:44:00,201] {ames.features: log.py:116} INFO - Building features and target...DONE
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/ballet-ames-demo/bin/ames-engineer-features", line 11, in <module>
    load_entry_point('ballet-ames-demo', 'console_scripts', 'ames-engineer-features')()
  File "/usr/local/miniconda3/envs/ballet-ames-demo/lib/python3.7/site-packages/click/core.py", line 764, in __call__                                 
    return self.main(*args, **kwargs)
  File "/usr/local/miniconda3/envs/ballet-ames-demo/lib/python3.7/site-packages/click/core.py", line 717, in main                                     
    rv = self.invoke(ctx)
  File "/usr/local/miniconda3/envs/ballet-ames-demo/lib/python3.7/site-packages/click/core.py", line 956, in invoke                                   
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/miniconda3/envs/ballet-ames-demo/lib/python3.7/site-packages/click/core.py", line 555, in invoke                                   
    return callback(*args, **kwargs)
  File "/Users/micahsmith/workspace/ballet-ames-demo/ames/features/__init__.py", line 94, in main
    X_ft = mapper_X.transform(X_df)
  File "/usr/local/miniconda3/envs/ballet-ames-demo/lib/python3.7/site-packages/sklearn_pandas/dataframe_mapper.py", line 377, in transform           
    return self._transform(X)
  File "/usr/local/miniconda3/envs/ballet-ames-demo/lib/python3.7/site-packages/sklearn_pandas/dataframe_mapper.py", line 292, in _transform          
    for columns, transformers, options in self.built_features:
TypeError: 'NoneType' object is not iterable

Describe and enforce naming for feature modules

Accepted features are python files added under the contrib/ directory. Let's describe and enforce a naming system so that the directory stays organized and collisions are made not possible.

  • each user that submits a feature has an associated directory prefixed with user_, for example features/contrib/user_bob
  • within each user directory, every file is prefixed feature_ and then includes a short name of the feature
  • if a proposed feature does not satisfy these checks, validation fails

Create user-facing notebook client

Desired behavior in some project:

can auto-create client instance

# auto-create client instance from cwd
from ballet.client import b

# api methods
b.validate_feature_api(feature)
b.validate_feature_api(feature, X_df, y_df)
b.validate_feature_acceptance(feature)
b.validate_feature_acceptance(feature, X_df, y_df)
b.load_data()
b.build()

can manually create client instance using package, path, or cwd

import ballet_my_project
from ballet.client import Client
b = Client(ballet_my_project)

Warn contributor to update if pinned version is greater than installed version

Motivation

  • Maintainer updates from project template and pushes
  • Contributor pulls from upstream
  • Contributor has version x installed but template relates to version y, y>x.
  • Contributor will experience errors on running commands.

Instead, contributor-facing ballet commands should detect version from config and fail if version does not match, issuing directive on how to resolve issue (run make install).

Feature Equality

Implement a version of feature equality with the following working spec:

Features are considered equivalent if they are made with identical code

Requirements:

  • Must work for functional features (SimpleFunctionTransformer, etc.)
  • Must not require users to define their own __eq__ functions

Currently, our subroutines use a feature's src field or id as a source of equality and these work well, but are not strong enough.

update-project-template merges spurious updates to cookiecutter.json _template value

The cookiecutter.json file stores the path to the template that was used to generate the file. First of all, this is not helpful information to keep around and we can consider removing it. Second of all, the cookiecutter.json file is probably not relevant to have on the master branch, we can see if we can isolate it to the project-template branch only. Ultimately, the current update command causes spurious changes like the following to be merged: micahjsmith/ballet-ames-demo@efa5667

Redesign validation to remove duplicated code and logic

  • remove duplicated code and logic (i.e. ballet.validation.feature_api.validators, ballet.validation.main.check_feature_api, ballet.cli.feature_api, and a hypothetical ballet.client.check_feature_api)
  • figure out a cleaner delineation of inputs, relying less on internal ChangeCollector instances

Add command to merge upstream updates to project template

Add a command that, when run in a ballet project repo, renders the most up-to-date project template from upstream ballet, and merges it into the current project.

Example usage:

$ ballet update-template
  • saves context on initial project creation into .cookiecutter.json (or .cookiecutter_replay.json)
  • add .cookiecutter_replay.json to .gitignore
  • renders project_template into temporary directory
  • merges external subtree

GFSSF acceptance validator fails with IndexError

  • ballet version: 0.5.3-dev (b413e30)
  • Python version: 3.6.3
  • Operating System: Ubuntu 14.04.5

Description

GFSSF validator fails judging example feature

What I Did

[2019-03-20 13:03:26,350] {ballet: gfssf_validator.py:60} INFO - Judging Feature using GFSSF: lambda_1=0.011564066054355004, lambda_2=0.011564066054355004
[2019-03-20 13:03:26,350] {ballet: gfssf_validator.py:65} DEBUG - Testing with omitted feature: None
[2019-03-20 13:03:26,374] {ballet: log.py:115} INFO - Ballet Validation: evaluating feature performance...FAILURE
Traceback (most recent call last):
  File "./validate.py", line 11, in <module>
    ballet.validation.main(ames)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/__init__.py", line 154, in main
    evaluate_feature_performance(project)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/decorators.py", line 38, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/__init__.py", line 25, in validation_stage
    return call()
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/flow.py", line 38, in wrapper
    return func(*args, **kwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/util/log.py", line 145, in wrapper
    return func(*args, **kwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/decorators.py", line 55, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/__init__.py", line 122, in evaluate_feature_performance
    accepted = evaluator.judge(proposed_feature)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/gfssf_validator.py", line 67, in judge
    cmi = estimate_conditional_information(feature_df, self.y, z)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/entropy.py", line 270, in estimate_conditional_information
    h_xyz = estimate_entropy(xyz, epsilon)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/entropy.py", line 186, in estimate_entropy
    selected_cont_samples = cont_features[unique_mask.ravel(), :]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2930 but corresponding boolean dimension is 5860
The command "./validate.py" exited with 1.

See https://travis-ci.org/micahjsmith/ballet-ames-demo/jobs/508915068

Validation fails on initial commit with empty parent

  • ballet version: 0.13.1
  • Python version:
  • Operating System:

Description

Running validation on initial git commit fails, as a special case

What I Did

Traceback (most recent call last):

  File "/home/travis/virtualenv/python3.8.7/bin/ballet", line 8, in <module>

    sys.exit(cli())

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/click/core.py", line 829, in __call__

    return self.main(*args, **kwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/click/core.py", line 782, in main

    rv = self.invoke(ctx)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/click/core.py", line 1259, in invoke

    return _process_result(sub_ctx.command.invoke(sub_ctx))

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/click/core.py", line 1066, in invoke

    return ctx.invoke(self.callback, **ctx.params)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/click/core.py", line 610, in invoke

    return callback(*args, **kwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/stacklog/__init__.py", line 172, in wrapper

    return func(*args, **kwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/cli.py", line 110, in validate

    ballet.validation.main.validate(

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/validation/main.py", line 179, in validate

    _prune_existing_features(project)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/funcy/decorators.py", line 39, in wrapper

    return deco(call, *dargs, **dkwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/validation/main.py", line 27, in validation_stage

    return call()

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/funcy/flow.py", line 42, in wrapper

    return func(*args, **kwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/stacklog/__init__.py", line 172, in wrapper

    return func(*args, **kwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/funcy/decorators.py", line 60, in __call__

    return self._func(*self._args, **self._kwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/validation/main.py", line 145, in _prune_existing_features

    proposed_feature = get_proposed_feature(project)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/validation/common.py", line 41, in get_proposed_feature

    collected_changes = change_collector.collect_changes()

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/validation/common.py", line 158, in collect_changes

    file_diffs = self._collect_file_diffs()

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/funcy/decorators.py", line 39, in wrapper

    return deco(call, *dargs, **dkwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/funcy/flow.py", line 197, in post_processing

    return func(call())

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/funcy/decorators.py", line 60, in __call__

    return self._func(*self._args, **self._kwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/stacklog/__init__.py", line 172, in wrapper

    return func(*args, **kwargs)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/validation/common.py", line 170, in _collect_file_diffs

    file_diffs = self.differ.diff()

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/util/git.py", line 32, in diff

    a, b = self._get_diff_endpoints()

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/util/ci.py", line 139, in _get_diff_endpoints

    return get_diff_endpoints_from_commit_range(self.repo, commit_range)

  File "/home/travis/virtualenv/python3.8.7/lib/python3.8/site-packages/ballet/util/git.py", line 177, in get_diff_endpoints_from_commit_range

    raise ValueError('commit_range cannot be empty')

ValueError: commit_range cannot be empty

The command "ballet validate" exited with 1.

test_update_push fails with ConfigurationError

  • ballet version: 9befc56
  • Python version: 3.7.2

Description

test_update_push fails with ConfigurationError due to test order conflict :(

Suggest taking this opportunity to simplify config file detection

What I Did

python -m pytest -k test_update_push
# collects 1 test, passes
python -m pytest -k test_update
# collects 7 tests, fails
==================================================================================================== FAILURES ====================================================================================================
________________________________________________________________________________________________ test_update_push ________________________________________________________________________________________________
                                                                                   
args = (PosixPath('/private/var/folders/mp/7s96qjnn7tl6nyjk729y16jw0000gn/T/tmpq6atwq2i/foo/foo/conf.py'),), kwargs = {}
key = (PosixPath('/private/var/folders/mp/7s96qjnn7tl6nyjk729y16jw0000gn/T/tmpq6atwq2i/foo/foo/conf.py'),)                                                                                                        
                                
    @wraps(func)                                                
    def wrapper(*args, **kwargs):                 
        # NOTE: we inline this here but not in @cache,
        #       since @memoize also targets microoptimizations.
        key = key_func(*args, **kwargs) if key_func else \
              args + tuple(sorted(kwargs.items())) if kwargs else args
        try:
>           return memory[key]
E           KeyError: (PosixPath('/private/var/folders/mp/7s96qjnn7tl6nyjk729y16jw0000gn/T/tmpq6atwq2i/foo/foo/conf.py'),)

/usr/local/miniconda3/envs/ballet/lib/python3.7/site-packages/funcy/calc.py:41: KeyError

During handling of the above exception, another exception occurred:

quickstart = Quickstart(tempdir=PosixPath('/var/folders/mp/7s96qjnn7tl6nyjk729y16jw0000gn/T/tmplhfo_t0l'), project_slug='foo', repo=<git.Repo "/var/folders/mp/7s96qjnn7tl6nyjk729y16jw0000gn/T/tmplhfo_t0l/foo/.g
it">)
project_template_copy = PosixPath('/var/folders/mp/7s96qjnn7tl6nyjk729y16jw0000gn/T/tmplhfo_t0l/templates/project_template')

    @pytest.mark.slow
    def test_update_push(quickstart, project_template_copy):
        # TODO(mjs)
        # make this unit tests instead
        # update this test to test more behaviors of the push
        # test failure by using non-existent remote locally
        # test success by mocking bare repo locally
        tempdir = quickstart.tempdir
        project_slug = quickstart.project_slug
    
        # update the project template so the update command runs to completion
        new_content = 'foo: bar'
        template_dir = project_template_copy
        p = template_dir.joinpath(
            '{{cookiecutter.project_slug}}', DEFAULT_CONFIG_NAME)
        with p.open('a') as f:
            f.write('\n')
            f.write(new_content)
            f.write('\n')
    
        with patch('ballet.update._call_remote_push') as mock_call_push:
>           _run_ballet_update_template(tempdir, project_slug, push=True)

/Users/micahsmith/workspace/ballet/tests/end_to_end/test_update_end_to_end.py:321:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Users/micahsmith/workspace/ballet/tests/end_to_end/test_update_end_to_end.py:20: in _run_ballet_update_template
    ballet.update.update_project_template(**kwargs)
/Users/micahsmith/workspace/ballet/ballet/update.py:185: in update_project_template
    _push(project)
/Users/micahsmith/workspace/ballet/ballet/util/log.py:148: in wrapper
    return func(*args, **kwargs)
/Users/micahsmith/workspace/ballet/ballet/update.py:93: in _push
    remote_name = project.get('project', 'remote')
/Users/micahsmith/workspace/ballet/ballet/project.py:102: in config_get
    config_info = find_configs(package_root)
/usr/local/miniconda3/envs/ballet/lib/python3.7/site-packages/funcy/calc.py:44: in wrapper
    value = memory[key] = func(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

package_root = PosixPath('/private/var/folders/mp/7s96qjnn7tl6nyjk729y16jw0000gn/T/tmpq6atwq2i/foo/foo/conf.py')

    @memoize
    def find_configs(package_root):
        """Find valid ballet project config files
    
        See if any of the candidates returned by get_config_paths are valid.
    
        Args:
            package_root (path-like): Directory of the ballet project root
                directory, the one usually containing the ``ballet.yml`` file.
    
        Returns:
            list[tuple]: List of (dict, str) representing config
                information and the path that information was loaded from
    
        Raises:
            ConfigurationError: No valid config files were found.
        """
        configs = []
        for candidate in get_config_paths(package_root):
            config = load_config_at_path(candidate)
            if config is not None:
                configs.append((config, candidate))
    
        if configs:
            return configs
        else:
>           raise ConfigurationError("Couldn't find any ballet.yml config files.")
E           ballet.exc.ConfigurationError: Couldn't find any ballet.yml config files.

/Users/micahsmith/workspace/ballet/ballet/project.py:83: ConfigurationError

Estimate entropy fails on continuous column with n>>k identical values

  • ballet version: 0.7.11

Description

Suppose we have a sample x. If x is continuous but has m>>k identical values, as is the case if it had many missing values and they have been imputed to a constant value, then estimate_entropy "fails" (takes forever and doesn't terminate within 30 mins before giving up).

What I Did

Doesn't work

x = np.random.permutation(np.vstack((np.random.rand(100, 1), np.full((1000, 1), 0.0))))
estimate_entropy(x)

Produces a result reasonably quickly

x = np.random.permutation(np.vstack((np.random.rand(100, 1), np.full((1000, 1), 0.0))))
y = x + np.random.uniform(low=-1e-3, high=1e3, size=x.shape)
estimate_entropy(y)

This is due to this hack, in which where k is small relative to m the number of identical values all the distances are 0 and the k is increased to a very large number and nearest neighbor searches have to be conducted each time.

# if the kth neighbor is at distance 0, then we are in trouble
# but we can try the trick of increasing k if we don't use the old
# value of k sometime later
while not np.all(distances) and k < n:
    distances, _ = nn.kneighbors(n_neighbors=k)
    distances = distances[:, -1]  # distances to k-nearest neighbor
    k += 1

Create quickstart command

  • prompts for details or reads from yaml file
  • creates new templated repository with required structure
  • displays post-creation steps to take

Validation fails on PR on demo project with gitdb.exc.BadName

Description

See https://travis-ci.org/micahjsmith/ballet-ames-demo/jobs/503796033

The TravisPullRequestBuildDiffer should successfully identify the comparison ref

What I Did

$ ./validate.py
[2019-03-08 21:14:06,839] {ballet: log.py:116} INFO - Ballet Validation: checking project structure...
[2019-03-08 21:14:06,849] {ballet: log.py:116} INFO - Collecting file changes...
[2019-03-08 21:14:06,856] {ballet: log.py:116} INFO - Collecting file changes...FAILURE
[2019-03-08 21:14:06,856] {ballet: log.py:116} INFO - Ballet Validation: checking project structure...FAILURE
Traceback (most recent call last):
  File "./validate.py", line 12, in <module>
    ballet.validation.main(ames)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/__init__.py", line 151, in main
    check_project_structure(project)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/decorators.py", line 38, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/__init__.py", line 26, in log_validation_stage
    return call()
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/flow.py", line 38, in wrapper
    return func(*args, **kwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/util/log.py", line 146, in wrapper
    return func(*args, **kwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/decorators.py", line 55, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/__init__.py", line 96, in check_project_structure
    result = validator.validate()
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/project_structure.py", line 154, in validate
    collected_changes = self.change_collector.collect_changes()
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/project_structure.py", line 62, in collect_changes
    file_diffs = self._collect_file_diffs()
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/decorators.py", line 38, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/flow.py", line 165, in post_processing
    return func(call())
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/funcy/decorators.py", line 55, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/util/log.py", line 146, in wrapper
    return func(*args, **kwargs)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/validation/project_structure.py", line 74, in _collect_file_diffs
    file_diffs = self.differ.diff()
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/util/git.py", line 26, in diff
    diffs = get_diffs_by_diff_str(self.repo, diff_str)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/ballet/util/git.py", line 92, in get_diffs_by_diff_str
    b_obj = repo.rev_parse(b)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/git/repo/fun.py", line 334, in rev_parse
    obj = name_to_object(repo, rev)
  File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/git/repo/fun.py", line 147, in name_to_object
    raise BadName(name)
gitdb.exc.BadName: Ref '.59bc5069f7045071ee153e3280319676f85c71f9' did not resolve to an object
The command "./validate.py" exited with 1.

Migrate validation to GH actions

Description

Travis CI is no longer free for open-source projects (limited to certain number of minutes per organization). Instead, use GH Actions for validation.

Implementation

Refactor

  • move differs from ballet.util.{git,ci} to ballet.validation.differs

Updating differ interface

  • add ABC to Differ
  • add can_use_differ class method
  • deprecate PullRequestBuildDiffer
  • make repo: Optional[git.Repo] a required param
  • look into best way to configure CustomDiffer

Allow configuration for CI provider to use

  • implement ballet.validation.providers.CIProvider data class
    • name attribute
    • available function
    • differ_class attribute
  • implement ballet.validation.providers.TravisCIProvider
  • add github.ci_provider as a configuration option and set it to the above

Implementing validation routines

  • load CIProvider per config
  • check for provider.available in ChangeCollector and load provider.differ_class if present

Implement GHA provider

  • implement ballet.util.ci.GithubActionsPullRequestBuildDiffer
  • implement ballet.validation.providers.GithubActionsProvider

Set GHA provider as default

  • replace .travis-ci.yml with .github/workflows/main.yml in project template
  • set github.ci_provider to the new provider

Implement changes in ballet-bot

Update to sphinx 3.x

Updating to sphinx 3.x results in errors due to both m2r and sphinx-click extensions. Resolve those errors and update to 3.x

Relevant stack trace lost in robust transformers

If a simple error in one component of the pipeline leads to immediate failure, the sequence of conversion approaches is still followed until the last one fails. Only then does the overall operation fail and resulting stack trace displayed to user.

For example, in the following degenerate case, there should be a way to figure out how to clearly fault the ErrorTransformer in the log:

import logging
logging.basicConfig()

import pandas as pd
from sklearn.preprocessing import StandardScaler

from fhub_core import Feature
from fhub_core.util.logutil import LoggingContext
from fhub_transformers import SimpleFunctionTransformer

def error(*args, **kwargs):
    raise ValueError
ErrorTransformer = SimpleFunctionTransformer(error)

df = pd.DataFrame({
    'a': [1,2,3,4],
    'b': [9,8,7,6],
})
input = 'a'
transformer = [
    StandardScaler(),
    ErrorTransformer,
]
feature = Feature(input=input, transformer=transformer)
mapper = feature.as_dataframe_mapper()
with LoggingContext(logging.getLogger('fhub_core'), level=logging.CRITICAL):
    mapper.fit(df, None)

with LoggingContext(logging.getLogger('fhub_core'), level=logging.DEBUG):
    mapper.transform(df)

Instead, we get the following inscrutable log from mapper.transform:

DEBUG:fhub_core.feature:Converting using approach 'identity'
DEBUG:fhub_core.feature:Converting using approach 'identity'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'Series'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'DataFrame'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'asarray'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'asarray2d'
/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 24, in transform
            return _transform(X, **transform_kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'Series'
DEBUG:fhub_core.feature:Converting using approach 'identity'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'Series'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'DataFrame'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'asarray'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'asarray2d'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 24, in transform
            return _transform(X, **transform_kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'DataFrame'
DEBUG:fhub_core.feature:Converting using approach 'identity'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'Series'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 433, in check_array
            array = np.array(array, dtype=dtype, order=order, copy=copy)
        ValueError: setting an array element with a sequence.
        
DEBUG:fhub_core.feature:Converting using approach 'DataFrame'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'asarray'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'asarray2d'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 24, in transform
            return _transform(X, **transform_kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'asarray'
DEBUG:fhub_core.feature:Converting using approach 'identity'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'Series'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'DataFrame'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'asarray'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/preprocessing/data.py", line 681, in transform
            estimator=self, dtype=FLOAT_DTYPES)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/utils/validation.py", line 441, in check_array
            "if it contains a single sample.".format(array))
        ValueError: Expected 2D array, got 1D array instead:
        array=[1. 2. 3. 4.].
        Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
        
DEBUG:fhub_core.feature:Converting using approach 'asarray2d'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 24, in transform
            return _transform(X, **transform_kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'asarray2d'
DEBUG:fhub_core.feature:Converting using approach 'identity'
DEBUG:fhub_core.feature:Application subsequently failed with exception 'ValueError'

        Traceback (most recent call last):
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py", line 50, in wrapped
            return func(convert(X), **kwargs)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/scikit_learn-0.19.1-py3.6-macosx-10.7-x86_64.egg/sklearn/pipeline.py", line 426, in _transform
            Xt = transform.transform(Xt)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_transformers-0.2.4-py3.6.egg/fhub_transformers/base.py", line 27, in transform
            return self.func_call(X)
          File "/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/funcy-1.10.1-py3.6.egg/funcy/funcs.py", line 37, in <lambda>
            return lambda *a: func(*(a + args))
          File "<ipython-input-14-70b3694491b7>", line 12, in error
            raise ValueError
        ValueError
        
DEBUG:fhub_core.feature:Converting using approach 'Series'

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-14-70b3694491b7> in <module>()
     28 
     29 with LoggingContext(logging.getLogger('fhub_core'), level=logging.DEBUG):
---> 30     mapper.transform(df)

/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/sklearn_pandas-1.6.0-py3.6.egg/sklearn_pandas/dataframe_mapper.py in transform(self, X)
    277             if transformers is not None:
    278                 with add_column_names_to_exception(columns):
--> 279                     Xt = transformers.transform(Xt)
    280             extracted.append(_handle_feature(Xt))
    281 

/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py in wrapped(X, y, **kwargs)
     48                     return func(convert(X), y=convert(y), **kwargs)
     49                 else:
---> 50                     return func(convert(X), **kwargs)
     51             except catch as e:
     52                 formatted_exc = indent(traceback.format_exc(), n=8)

/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py in transform(self, X, **transform_kwargs)
     22     def transform(self, X, **transform_kwargs):
     23         _transform = make_robust_to_tabular_types(super().transform)
---> 24         return _transform(X, **transform_kwargs)
     25 
     26 

/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/fhub_core-0.3.3-py3.6.egg/fhub_core/feature.py in wrapped(X, y, **kwargs)
     48                     return func(convert(X), y=convert(y), **kwargs)
     49                 else:
---> 50                     return func(convert(X), **kwargs)
     51             except catch as e:
     52                 formatted_exc = indent(traceback.format_exc(), n=8)

/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/pandas-0.22.0-py3.6-macosx-10.7-x86_64.egg/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    262             else:
    263                 data = _sanitize_array(data, index, dtype, copy,
--> 264                                        raise_cast_failure=True)
    265 
    266                 data = SingleBlockManager(data, index, fastpath=True)

/usr/local/anaconda3/envs/dengue_prediction/lib/python3.6/site-packages/pandas-0.22.0-py3.6-macosx-10.7-x86_64.egg/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   3273     elif subarr.ndim > 1:
   3274         if isinstance(data, np.ndarray):
-> 3275             raise Exception('Data must be 1-dimensional')
   3276         else:
   3277             subarr = _asarray_tuplesafe(data, dtype=dtype)

Exception: a: Data must be 1-dimensional

see also micahjsmith/dengue_prediction#7 and micahjsmith/dengue_prediction#8

Be Able to Specify a function as the input to nullfiller

Motivation

Frequently, you might want to pass in as the replacement to null filler some function of the data - the mode, median, etc. It might be useful for nullfiller itself to take in a function to compute for the data.

Workflow for Documentation

If users want to be collaborative on a framework with groups that they may never interface with in person, ballet should enforce some level of documentation in ballet.

The level of documentation is to be determined, as ballet should still be a lightweight framework. It would be some combination of what the features are intended to represent, the shape of the features (specifically # of columns), or the actual column names (for a pd Dataframe)

Ideas for Implementation:

  • Have each feature.py be in its own folder, with a readme with some formatting
    • Ex. ./features/kelvin_feature/ has a feature.py and a ReadMe.md that describes the feature
  • Enforce a python docs comment in each feature.py that travisci checks for
  • Enforce some level of documentation in the pull request

Track desired extras in project template

Description

Current situation

  • ballet quickstart causes ballet==x.y.z to be installed in the project
  • if the project maintainer wants to install any feature engineering extras, such as [all], then they must manually edit ballet.yml and setup.py
  • if they make these changes, every future invocation of ballet update-project-template will cause a merge conflict

Possible fixes:

  • prompt for extras in cookiecutter => a bit excessive
  • something else

Implement repo file change github app

Some files used for validation should not be changed by a contributor or they could mess up the validation process:

  • /.travis.yml
  • /ballet.yml
  • /validate.py
  • anything that is imported by validate.py, such as /project_name/conf.py

This is an app that receives a hook on PRs. It looks for its own configuration on the master branch, not in the source of the PR. Based on that configuration, it ensures that specified files have not been modified, and "fails" the PR otherwise.

Possible names for app:

  • repolockr
  • repoguard

Implementation ideas:

  • configuration file on master branch called .repoguard or similar
  • configuration file uses existing include/exclude format from some well-maintained library for this task that allows recursive include etc.. Alternately, a more lightweight solution would be require each line to specify the relative path to exactly one file to "guard".
  • on PR, hook gets sent to serverless endpoint. project is cloned, configuration is read from master branch. PR is checked out, diff is computed against master branch. if any inadmissible changes have been made, fail the PR.

Automate additional steps in creating new project

Currently, ballet quickstart renders a new project to a local directory. But to follow the steps on the maintainer guide, there are manual steps to (1) push to github (2) enable CI (3) enable bots. These could also be automated through the quickstart command

  1. add --push/--no-push option to push this project to github. the remote is known from the rendering step. token will be obtained in order from (1) the usual $GITHUB_TOKEN envvar (2) provided at the command line with --github-token.
  2. add --enable-ci/--no-enable-ci option to enable the CI service. if the CI service is GH Actions, this is a noop. since we are moving there anyway, we may pass on implementing this.
  3. add --enable-bots/--no-enable-bots option to enable the Ballet bot and the repolockr bot.

Potential relevance/redundancy implementations in ballet

group-SOALA

group-SOALA is a potential alternative to alpha-investing as it is potentially faster and does not rely on state for its computations (though it is useful if state is saved, it can be run from only the features currently accepted to a Ballet repository).

Brief Sketch of Implementation (starting from the current alpha-investing implementation where we pull out candidate_feature and accepted_features:

Acceptance/Reject check on Travis:

Acceptance in group-SOALA is already broken down into one relevance subroutine and one redundancy subroutine.

Relevance subroutine

Transform only candidate_feature and calculate the mutual information between candidate_feature and the target y using scikit-learn.
If any column has a non-zero information, we move on. Otherwise, we immediately reject (feature is irrelevant

Redundancy subroutine

  • Transform all of the accepted_features and calculate each columns' mutual information score w.r.t. the target column. For each accepted column accepted_col and each candidate column candidate_col, do:
    • Compare their information scores. If accepted_col has a higher score, calculate their mutual information. If their mutual information is higher than the mutual information between candidate_col and the target column, mark candidate_col as redundant.
  • Once either all candidate_col's have been marked redundant or all non-redundant columns have been checked against all accepted_col's, we finish. if all columns are redundant, reject. Otherwise, we accept the feature.

Redundancy Check on Travis.

On merge, we begin a pruning subroutine on Travis CI. This entails running the redundancy subroutine on each feature, deleting the feature entirely if it is found to be redundant.

Relax assumption that validation is "on PR"

  • ballet version: 0.10.0

Description

If creating a feature on a feature development branch locally, validation cannot be run even though it makes sense. This is because validation is skipped unless it is run from "on a PR" which means the ref is named like pull/1. Relax the assumption to just check when on the master branch vs when not on the master branch. This allows easier validation from local repos.

Solution

  • rename on_master to on_default_branch
  • deprecate on_pr and use not project.on_default_branch

Project CLI does not detect program name correctly

  • ballet version: 0.13.1

Description

  • Install a new project following maintainer guide
  • Try to get help for CLI: python -m myproject --help
  • Displays __main__.py as the program name
$ python -m myproject --help
Usage: __main__.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  engineer-features  Engineer features

Should be able to provide the name parameter: https://click.palletsprojects.com/en/7.x/api/#click.BaseCommand.name

Create `validate` command group to ease development and debugging

The ./validate.py script (ballet.validation.main) within ballet projects mostly functions as an end-to-end tool for use within a CI process. It is difficult to use it for development/debugging if all the conditions the tool expects are not met (e.g. on a branch that appears to be a PR, the feature is committed, the commit that includes the feature does not introduce unacceptable project changes, etc.)

Desired functionality:

  • given any commit range, run check_project_structure
  • given any feature, identified as a module or as a path to a file, import the feature and run validate_feature_api on that feature
  • given any feature, identified as a module or as a path to a file, import the feature and run evaluate_feature_performance where that feature is the proposed feature and the complement is the set of accepted features
  • run prune_existing_features on the full feature set
  • be able to drop into a debugger on error

This functionality should be available both from the ballet CLI tool as well as through import in an interactive python session.

Fix template's load_data with input_dir when called from installed package

  • ballet version: 0.7.9

Description

If a ballet project is rendered from the default project template, and then installed using pip, the user is not working from the source directory. Thus the project cannot rely on from ballet.project import config

What I Did

Basically:

pip install -e ballet-predict-census-income
cd /tmp/work
python -c "from predict_census_income.api import api; api.load_data('/path/to/input/data/dir')"
Traceback (most recent call last):
  File "main.py", line 60, in <module>
    main()
  File "main.py", line 55, in main
    create_automl_training_data(b, datadir)
  File "main.py", line 31, in create_automl_training_data
    X_df, y_df = b.api.load_data(path / 'train')
  File "/work/.venv/lib/python3.8/site-packages/ballet/project.py", line 327, in load_data
    return self._load_data(*args, **kwargs)
  File "/work/.venv/lib/python3.8/site-packages/funcy/calc.py", line 121, in wrapper
    result = func(*args, **kwargs)
  File "/work/.venv/lib/python3.8/site-packages/predict_census_income/load_data.py", line 13, in load_data
    entities_config = some(where(tables, name=entities_table_name))
  File "/work/.venv/lib/python3.8/site-packages/funcy/colls.py", line 322, in where
    return filter(match, mappings)
TypeError: 'NoneType' object is not iterable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.