GithubHelp home page GithubHelp logo

danielchang1985 / hyperimpute Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vanderschaarlab/hyperimpute

0.0 0.0 0.0 438 KB

A framework for prototyping and benchmarking imputation methods

Home Page: https://www.vanderschaar-lab.com/

License: MIT License

Python 61.29% Jupyter Notebook 38.71%

hyperimpute's Introduction

HyperImpute - A library for NaNs and nulls.

Test In Colab Tests PR Tests Full Tutorials Documentation Status

arXiv License: MIT Python 3.7+ slack

image

HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.

HyperImpute features

  • ๐Ÿš€ Fast and extensible dataset imputation algorithms, compatible with sklearn.
  • ๐Ÿ”‘ New iterative imputation method: HyperImpute.
  • ๐ŸŒ€ Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.
  • ๐Ÿ”ฅ Pluginable architecture.

๐Ÿš€ Installation

The library can be installed from PyPI using

$ pip install hyperimpute

or from source, using

$ pip install .

๐Ÿ’ฅ Sample Usage

List available imputers

from hyperimpute.plugins.imputers import Imputers

imputers = Imputers()

imputers.list()

Impute a dataset using one of the available methods

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

method = "gain"

plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())

print(method, out)

Specify the baseline models for HyperImpute

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get(
    "hyperimpute",
    optimizer="hyperband",
    classifier_seed=["logistic_regression"],
    regression_seed=["linear_regression"],
)

out = plugin.fit_transform(X.copy())
print(out)

Use an imputer with a SKLearn pipeline

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])

imputer = Imputers().get("hyperimpute")

estimator = Pipeline(
    [
        ("imputer", imputer),
        ("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
    ]
)

estimator.fit(X, y)

Write a new imputation plugin

from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

knn_imputer = "custom_knn"

class KNN(ImputerPlugin):
    def __init__(self) -> None:
        super().__init__()
        self._model = KNNImputer(n_neighbors=2, weights="uniform")

    @staticmethod
    def name():
        return knn_imputer

    @staticmethod
    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs):
        self._model.fit(*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

imputers.add(knn_imputer, KNN)

assert imputers.get(knn_imputer) is not None

Benchmark imputation models on a dataset

from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models

X, y = load_iris(as_frame=True, return_X_y=True)

imputer = Imputers().get("hyperimpute")

compare_models(
    name="example",
    evaluated_model=imputer,
    X_raw=X,
    ref_methods=["ice", "missforest"],
    scenarios=["MAR"],
    miss_pct=[0.1, 0.3],
    n_iter=2,
)

๐Ÿ““ Tutorials

โšก Imputation methods

The following table contains the default imputation plugins:

Strategy Description Code
HyperImpute Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets plugin_hyperimpute.py
Mean Replace the missing values using the mean along each column with SimpleImputer plugin_mean.py
Median Replace the missing values using the median along each column with SimpleImputer plugin_median.py
Most-frequent Replace the missing values using the most frequent value along each column with SimpleImputer plugin_most_freq.py
MissForest Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor plugin_missforest.py
ICE Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge plugin_ice.py
MICE Multiple imputations based on ICE using IterativeImputer and BayesianRidge plugin_mice.py
SoftImpute Low-rank matrix approximation via nuclear-norm regularization plugin_softimpute.py
EM Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm plugin_em.py
Sinkhorn Missing Data Imputation using Optimal Transport plugin_sinkhorn.py
GAIN GAIN: Missing Data Imputation using Generative Adversarial Nets plugin_gain.py
MIRACLE MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms plugin_miracle.py
MIWAE MIWAE: Deep Generative Modelling and Imputation of Incomplete Data plugin_miwae.py

๐Ÿ”จ Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx

Citing

If you use this code, please cite the associated paper:

@article{Jarrett2022HyperImpute,
  doi = {10.48550/ARXIV.2206.07769},
  url = {https://arxiv.org/abs/2206.07769},
  author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
  year = {2022},
  booktitle={39th International Conference on Machine Learning},
}

hyperimpute's People

Contributors

bcebere avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.