HyperImpute - A library for NaNs and nulls.

HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.

HyperImpute features

🚀 Fast and extensible dataset imputation algorithms, compatible with sklearn.
🔑 New iterative imputation method: HyperImpute.
🌀 Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.
🔥 Pluginable architecture.

🚀 Installation

The library can be installed from PyPI using

$ pip install hyperimpute

or from source, using

$ pip install .

💥 Sample Usage

List available imputers

from hyperimpute.plugins.imputers import Imputers

imputers = Imputers()

imputers.list()

Impute a dataset using one of the available methods

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

method = "gain"

plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())

print(method, out)

Specify the baseline models for HyperImpute

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get(
    "hyperimpute",
    optimizer="hyperband",
    classifier_seed=["logistic_regression"],
    regression_seed=["linear_regression"],
)

out = plugin.fit_transform(X.copy())
print(out)

Use an imputer with a SKLearn pipeline

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])

imputer = Imputers().get("hyperimpute")

estimator = Pipeline(
    [
        ("imputer", imputer),
        ("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
    ]
)

estimator.fit(X, y)

Write a new imputation plugin

from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

knn_imputer = "custom_knn"

class KNN(ImputerPlugin):
    def __init__(self) -> None:
        super().__init__()
        self._model = KNNImputer(n_neighbors=2, weights="uniform")

    @staticmethod
    def name():
        return knn_imputer

    @staticmethod
    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs):
        self._model.fit(*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

imputers.add(knn_imputer, KNN)

assert imputers.get(knn_imputer) is not None

Benchmark imputation models on a dataset

from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models

X, y = load_iris(as_frame=True, return_X_y=True)

imputer = Imputers().get("hyperimpute")

compare_models(
    name="example",
    evaluated_model=imputer,
    X_raw=X,
    ref_methods=["ice", "missforest"],
    scenarios=["MAR"],
    miss_pct=[0.1, 0.3],
    n_iter=2,
)

📓 Tutorials

⚡ Imputation methods

The following table contains the default imputation plugins:

Strategy	Description	Code
HyperImpute	Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets	`plugin_hyperimpute.py`
Mean	Replace the missing values using the mean along each column with `SimpleImputer`	`plugin_mean.py`
Median	Replace the missing values using the median along each column with `SimpleImputer`	`plugin_median.py`
Most-frequent	Replace the missing values using the most frequent value along each column with `SimpleImputer`	`plugin_most_freq.py`
MissForest	Iterative imputation method based on Random Forests using `IterativeImputer` and `ExtraTreesRegressor`	`plugin_missforest.py`
ICE	Iterative imputation method based on regularized linear regression using `IterativeImputer` and `BayesianRidge`	`plugin_ice.py`
MICE	Multiple imputations based on ICE using `IterativeImputer` and `BayesianRidge`	`plugin_mice.py`
SoftImpute	`Low-rank matrix approximation via nuclear-norm regularization`	`plugin_softimpute.py`
EM	Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - `EM imputation algorithm`	`plugin_em.py`
Sinkhorn	`Missing Data Imputation using Optimal Transport`	`plugin_sinkhorn.py`
GAIN	`GAIN: Missing Data Imputation using Generative Adversarial Nets`	`plugin_gain.py`
MIRACLE	`MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms`	`plugin_miracle.py`
MIWAE	`MIWAE: Deep Generative Modelling and Imputation of Incomplete Data`	`plugin_miwae.py`

🔨 Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx

Citing

If you use this code, please cite the associated paper:

@article{Jarrett2022HyperImpute,
  doi = {10.48550/ARXIV.2206.07769},
  url = {https://arxiv.org/abs/2206.07769},
  author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
  year = {2022},
  booktitle={39th International Conference on Machine Learning},
}

danielchang1985 / hyperimpute Goto Github PK

hyperimpute's Introduction

HyperImpute - A library for NaNs and nulls.

HyperImpute features

🚀 Installation

💥 Sample Usage

📓 Tutorials

⚡ Imputation methods

🔨 Tests

Citing

hyperimpute's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs