awslabs / datawig Goto Github PK

View Code? Open in Web Editor NEW

472.0 23.0 67.0 6.66 MB

Imputation of missing values in tables.

License: Apache License 2.0

Python 15.15% Shell 0.04% JavaScript 77.92% CSS 0.14% HTML 6.75%

imputation missing-value-handling

datawig's Introduction

DataWig - Imputation for Tables

DataWig learns Machine Learning models to impute missing values in tables.

See our user-guide and extended documentation here.

Installation

CPU

pip3 install datawig

GPU

If you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings. Depending on your version of CUDA, you can do this by running the following:

wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt
pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt
rm requirements.gpu-cu${CUDA_VERSION}.txt

where ${CUDA_VERSION} can be 75 (7.5), 80 (8.0), 90 (9.0), or 91 (9.1).

Running DataWig

The DataWig API expects your data as a pandas DataFrame. Here is an example of how the dataframe might look:

Product Type	Description	Size	Color
Shoe	Ideal for Running	12UK	Black
SDCards	Best SDCard ever ...	8GB	Blue
Dress	This yellow dress	M	?

Quickstart Example

For most use cases, the SimpleImputer class is the best starting point. For convenience there is the function SimpleImputer.complete that takes a DataFrame and fits an imputation model for each column with missing values, with all other columns as inputs:

import datawig, numpy

# generate some data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric() 
# mask 10% of the values
df_with_missing = df.mask(numpy.random.rand(*df.shape) > .9)

# impute missing values
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)

You can also impute values in specific columns only (called output_column below) using values in other columns (called input_columns below). DataWig currently supports imputation of categorical columns and numeric columns.

Imputation of categorical columns

import datawig

df = datawig.utils.generate_df_string( num_samples=200, 
                                       data_column_name='sentences', 
                                       label_column_name='label')

df_train, df_test = datawig.utils.random_split(df)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['sentences'], # column(s) containing information about the column we want to impute
    output_column='label', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

Imputation of numerical columns

import datawig

df = datawig.utils.generate_df_numeric( num_samples=200, 
                                        data_column_name='x', 
                                        label_column_name='y')         
df_train, df_test = datawig.utils.random_split(df)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['x'], # column(s) containing information about the column we want to impute
    output_column='y', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

In order to have more control over the types of models and preprocessings, the Imputer class allows directly specifying all relevant model features and parameters.

For details on usage, refer to the provided examples.

Acknowledgments

Thanks to David Greenberg for the package name.

Building documentation

git clone [email protected]:awslabs/datawig.git
cd datawig/docs
make html
open _build/html/index.html

Executing Tests

Clone the repository from git and set up virtualenv in the root dir of the package:

python3 -m venv venv

Install the package from local sources:

./venv/bin/pip install -e .

Run tests:

./venv/bin/pip install -r requirements/requirements.dev.txt
./venv/bin/python -m pytest

Updating PyPi distribution

Before updating, increment the version in setup.py.

git clone [email protected]:awslabs/datawig.git
cd datawig
# build local distribution for current version
python setup.py sdist
# upload to PyPi
twine upload --skip-existing dist/*

datawig's People

Contributors

Stargazers

Watchers

datawig's Issues

Problems during installation and execution of example code

Hi,

I installed datawig today using pip3 in an extra virtualenv and ran into some problems that I'd like to point out here.

During the installation, I encountered the following warning:
mxnet 1.3.0b20180820 has requirement numpy<1.15.0,>=1.8.2, but you'll have numpy 1.15.0 which is incompatible.

After importing numpy in jupyter I got this warning:

/home/.../datawig/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

Next I tried the simple imputer example, but the fit_hpo method took ages to run for the small train dataset with 5k records, I stopped the hpo after 20 minutes or so.

When I tried to evaluate the predictions of datawig using the proposed code
f1 = f1_score(predictions['finish'], predictions['finish_imputed'])

I received the following error message:

ValueError: Target is multiclass but average='binary'. Please choose another average setting.

Move logic behind image based imputation into separate (experimental) branch

It appears that logic for image based imputation is not as well tested and polished as text based imputation. Therefore, it may confuse customers that both are of the same quality (e.g. using text + image is better than just text).

I'd like to move the code from master to a feature branch where development on image based imputation will continue until it's of comparable quality with text based imputation.

Any thoughts ?

Datawig: NotADirectoryError: [WinError 267] The directory name is invalid: '.\\Level 0: Cat Group'

Hi, thanks for the amazing library.

I am having an issue while running the complete() method.

import datawig
import pandas as pd

df = pd.read_csv('SpendCubeCleanNAN.csv', low_memory=False)

df_imp= datawig.SimpleImputer.complete(df)

df_imp.to_csv('SpendCubeCleanImpNN.csv')

Level 0: Cat Group is a column name in my dataset.

This is the output with the error traceback:


`C:\Users\Shadow\miniconda3\lib\site-packages\sklearn\utils\extmath.py:765: RuntimeWarning: invalid value encountered in true_divide
  updated_mean = (last_sum + new_sum) / updated_sample_count
C:\Users\Shadow\miniconda3\lib\site-packages\sklearn\utils\extmath.py:706: RuntimeWarning: Degrees of freedom <= 0 for slice.
  result = op(x, *args, **kwargs)
[10:30:26] c:\jenkins\workspace\mxnet-tag\mxnet\src\operator\../common/utils.h:450: 
Storage type fallback detected:
operator = Concat
input storage types = [csr, default, ]
output storage types = [default, ]
params = {"num_args" : 2, "dim" : 1, }
context.dev_mask = cpu
The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.
Traceback (most recent call last):
  File "C:/Users/Shadow/PycharmProjects/Datawig/SimpleImpSpendCube.py", line 8, in <module>
    df_imp= datawig.SimpleImputer.complete(df)
  File "C:\Users\Shadow\miniconda3\lib\site-packages\datawig\simple_imputer.py", line 527, in complete
    calibrate=False)
  File "C:\Users\Shadow\miniconda3\lib\site-packages\datawig\simple_imputer.py", line 382, in fit
    output_path=self.output_path)
  File "C:\Users\Shadow\miniconda3\lib\site-packages\datawig\imputer.py", line 150, in __init__
    os.makedirs(self.output_path)
  File "C:\Users\Shadow\miniconda3\lib\os.py", line 221, in makedirs
    mkdir(name, mode)
NotADirectoryError: [WinError 267] Der Verzeichnisname ist ungültig: '.\\Level 0: Cat Group'

Can you please help me with the issue?

Improved logging levels

I would be nice to have a logging level, that is less verbose than info but provide some basic information. Info prints a line at every training batch. A useful logger could generate a statement at every epoch, or every n epochs and include:

log-likelihood (train/test)
number of epochs passed
time passed
time that the encoder took (which can be the bottleneck).

No module named 'datawig.utils'; 'datawig' is not a package

I tried to run the simpleimputer on pycharm, but reported an error, which surprisingly worked fine on the command line。

Traceback (most recent call last): File "F:/pyfile/missing_data/datawig.py", line 9, in <module> import datawig.utils File "F:\pyfile\missing_data\datawig.py", line 9, in <module> import datawig.utils ModuleNotFoundError: No module named 'datawig.utils'; 'datawig' is not a package

Order of tests should not matter

Certain tests share state by using the same pseudo random number generator. Changing order of these tests as well as addition of new tests and removal of existing tests break other tests.

Question) Getting Imputation Weight

Thanks for your nice package.

I have one question.

I am imputing large matrix (90,000 by 7,000).

And this matrix contain lots of NA (Over 80%).

Also include numerical value and zero or one categorical value.

Below is my code (After loading whole dataframe to impute)
` import datawig

    with tf.device(d):
        df = datawig.SimpleImputer.complete(df, inplace=True, num_epochs=max_epoch, verbose=1, output_path=result_dir+ str(num_seed)+'seed_imputer_model')
        with open(result_dir+str(num_seed)+"seed_Imputed_merged_cid.pickle", 'wb') as handle:
            pickle.dump(merged_cid, handle, protocol=pickle.HIGHEST_PROTOCOL)

        pd.DataFrame(df).to_csv(result_dir+ str(num_seed) + 'seed_Imputed_merged_cid.csv', index=None)`

I use "datawig.SimpleImputer.complete" for simplicity,

but is there any method to get neural network weight which used for imputation.
And "datawig.SimpleImputer.complete" function how works for train and validation

I asking because there is no decrease of accuracy

2020-10-27 11:14:22,355 [INFO] Epoch[49] Batch [0-34] Speed: 1651.71 samples/sec cross-entropy=0.515578 C0040436-accuracy=0.000000 2020-10-27 11:14:22,675 [INFO] Epoch[49] Train-cross-entropy=0.667427 2020-10-27 11:14:22,675 [INFO] Epoch[49] Train-C0040436-accuracy=0.000000 2020-10-27 11:14:22,676 [INFO] Epoch[49] Time cost=0.657 2020-10-27 11:14:22,688 [INFO] Saved checkpoint to "result/dtip/impute/datawig/1000seed_imputer_model/C0040436/model-0049.params" 2020-10-27 11:14:22,723 [INFO] Epoch[49] Validation-cross-entropy=0.492388 2020-10-27 11:14:22,723 [INFO] Epoch[49] Validation-C0040436-accuracy=0.000000

Thanks

Hyojin

mxnet 1.4.0 requirement cannot be satisfied in newer Python

Hello,

I am trying to install datawig, however, I can only install later versions of mxnet. Is it possible to use newer versions of mxnet?

This is the error I am getting while installing from pip:

ERROR: Could not find a version that satisfies the requirement mxnet==1.4.0 (from versions: 1.6.0, 1.7.0.post1)
ERROR: No matching distribution found for mxnet==1.4.0

Unable to remove imputer.log file until I restart my kernel

I have checked the full code to find the error. During training a log file is generated and mapped to logger Python framework, but the connection is never closed. This action make us unuseful when we want to remove the imputer folder that is created until we restart the kernel and the connection is lost.

To solve it I've attached a new function to retrieve all the handlers that are opened, then all of them are closed. With this easy action we can now remove any file generated after the training of datawig model has finished.

Find below the only changes I have made to make it work. Changes only performed in "imputer.py" file:

` def __close_filehandlers(self) -> None:
"""Function to close connection with log file.
author: Carlos Moral Rubio."""

    handlers = logger.handlers[:]
    for handler in handlers:
        handler.close()
        logger.removeHandler(handler)

` def fit(self,
train_df: pd.DataFrame,
test_df: pd.DataFrame = None,
ctx: mx.context = get_context(),
learning_rate: float = 1e-3,
num_epochs: int = 100,
patience: int = 3,
test_split: float = .1,
weight_decay: float = 0.,
batch_size: int = 16,
final_fc_hidden_units: List[int] = None,
calibrate: bool = True):
"""
Trains and stores imputer model

    :param train_df: training data as dataframe
    :param test_df: test data as dataframe; if not provided, [test_split] % of the training
                    data are used as test data
    :param ctx: List of mxnet contexts (if no gpu's available, defaults to [mx.cpu()])
                User can also pass in a list gpus to be used, ex. [mx.gpu(0), mx.gpu(2), mx.gpu(4)]
    :param learning_rate: learning rate for stochastic gradient descent (default 1e-4)
    :param num_epochs: maximal number of training epochs (default 100)
    :param patience: used for early stopping; after [patience] epochs with no improvement,
                    training is stopped. (default 3)
    :param test_split: if no test_df is provided this is the ratio of test data to be held
                    separate for determining model convergence
    :param weight_decay: regularizer (default 0)
    :param batch_size: default 16
    :param final_fc_hidden_units: list of dimensions for the final fully connected layer.
    :param calibrate: whether to calibrate predictions
    :return: trained imputer model
    """
    if final_fc_hidden_units is None:
        final_fc_hidden_units = []

    # make sure the output directory is writable
    assert os.access(self.output_path, os.W_OK), "Cannot write to directory {}".format(
        self.output_path)

    self.batch_size = batch_size
    self.final_fc_hidden_units = final_fc_hidden_units

    self.ctx = ctx
    logger.debug('Using [{}] as the context for training'.format(ctx))

    if (train_df is None) or (not isinstance(train_df, pd.core.frame.DataFrame)):
        raise ValueError("Need a non-empty DataFrame for fitting Imputer model")

    if test_df is None:
        train_df, test_df = random_split(train_df, [1.0 - test_split, test_split])

    iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)

    self.__check_data(test_df)

    # to make consecutive calls to .fit() continue where the previous call finished
    if self.module is None:
        self.module = self.__build_module(iter_train)

    self.__fit_module(iter_train, iter_test, learning_rate, num_epochs, patience, weight_decay)

    # Check whether calibration is needed, if so ompute and set internal parameter
    # for temperature scaling that is supplied to self.__predict_mxnet_iter()
    if calibrate is True:
        self.calibrate(iter_test)

    _, metrics = self.__transform_and_compute_metrics_mxnet_iter(iter_test,
                                                                 metrics_path=self.metrics_path)

    for att, att_metric in metrics.items():
        if isinstance(att_metric, dict) and ('precision_recall_curves' in att_metric):
            self.precision_recall_curves[att] = att_metric['precision_recall_curves']

    self.__prune_models()
    self.save()

    if self.is_explainable:
        self.__persist_class_prototypes(iter_train, train_df)

    self.__close_filehandlers()

    return self`

datawig.SimpleImputer.complete not imputing some columns

I am working on some missing values problem with datawig (I am new to it), where from a total of 19 features in a pandas dataframe with missing data, only 4 of them are not fully imputed.

I do:

import datawig

# impute missing values
dataframe = datawig.SimpleImputer.complete(dataframe)

and I get the following error message:

/home/user/.local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

What's happening and how could I impute the rest of the features?

Migrate Imputer from Symbolic API to Gluon API

The symbolic API for the imputer makes the code difficult to understand.

The Gluon API is much cleaner and allows for easier extensions.

Given the refactoring discussions I feel we would profit a lot from moving to Gluon.

Make SimpleImputer explainable

SimpleImputer.complete fails when output_path is not the default, i.e. not "."

shutil.rmtree(output_col) in SimpleImputer.complete fails because the directories are generated within the output_path. The code should have been shutil.rmtree(os.path.join(output_path, output_col))

Issue with explain method

I trained an Imputer model with a mix of categorical, numerical and bow encoder (and associated featurizers), but when I run the explain method on it I get this error:

~/opt/anaconda3/envs/extract/lib/python3.6/site-packages/datawig/imputer.py in explain(self, label, k, label_column)
    390         # for each data encoder extract (token_idx, token_idx_correlation_with_label), extract and apply idx2token map.
    391         feature_dict = dict(explained_label = label)
--> 392         for encoder, pattern in self.__class_patterns:
    393             # extract idx2token mappings
    394             if isinstance(encoder, CategoricalEncoder):

TypeError: 'NoneType' object is not iterable

I tried with a dummy setup and explain works, so I would like to know if you have any clue about what is exactly causing this in my more complex model

Save learning curve in metrics

I would propose to write learning curve as part of the metrics output. I.e. log likelihood, test and train accuracy; all as function of the epoch. This facilitates convergence diagnostics and is computed already.

Dependencies clashing

numpy==1.18.0
scikit-learn[alldeps]==0.22.1
typing==3.6.6
pandas==0.25.0
mxnet==1.4.0
these are the requirements however mxnet 1.4.0 has numpy dependency as 1.14.6
and latest version of mxnet 1.6.0 has numpy dependencies 1.16.6.
As a result unable to install this package

about application on categorical and numerical data

Hi there, I am trying to run the example to apply datawig on both categorical and numerical data. The categorical data has integral values while numerical data is a positive real number. I read the documentation, it seems that datawig takes multiple columns as input and impute on a specific column instead of imputing on all missing values across all columns, am I correct? I have a dataset with 4 columns, A, B, C, and Y. Y is the conclusion (label) while A, B, C are preditors. all columns contain missing values. Here is what I am trying to do with datawig if I understand it correctly

take A, B, C as input for imputation onto Y, to get Y_imputed, then replace Y with Y_imputed
take B, C, Y as input for imputation on to A, to get A_imputed, then replace A with A_imputed
take A, C, Y as input for imputation on to B, to get B_imputed, then replace B with B_imputed
take A, B, Y as input for imputation on to C, to get C_imputed, then replace C with C_imputed

It seems quite tedious if there are 1000 columns, how do I manage to run all those iterations.

My second question is about handling the categorical data. In the section of quick example, it tells how to handle the text but what happens if the categorical data is a number (integer) in a given range while some data are real numbers, how could I specify the data type? I am trying the following example


df = pd.DataFrame([[np.nan, 2,     np.nan, 0],
                   [0,      3,     np.nan, 1],
                   [np.nan, np.nan,  np.nan, 1],
                   [1,      4,     3,      0],
                   [3,      1,     0,      np.nan],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [1,      3,     0,      np.nan],
                   [np.nan, np.nan,  0,     np.nan],
                   ],
                  columns = list('ABCY'))

df_train, df_test = datawig.utils.random_split(df)
categorial_encoder_cols = [CategoricalEncoder('A')]
label_encoder_cols = 'Y'
print(df)
imputer = datawig.SimpleImputer(
    label_encoders=label_encoder_cols,
    data_encoders=categorial_encoder_cols,
    output_path = 'imputer_model' # stores model data and metrics
    )
dout = imputer.fit(train_df=df_train)

but it turns out with an error "TypeError: init() got an unexpected keyword argument 'label_encoders'"

OSError: [WinError 126] The specified module could not be found, Datawig 0.1.12

Hi All,

I installed the latest (0.1.12) version of Datawig module. I've considered all the package requirements:

scikit-learn[alldeps]==0.22.1
typing==3.6.6
pandas==0.25.3
mxnet==1.4.0

But when I run the command "import datawig", I get the following error:

Traceback (most recent call last):
File "C:/Users/PC/Desktop/python_ummd/venv/Lib/imputation.py", line 7, in
import datawig
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\datawig_init_.py", line 2, in
from .column_encoders import CategoricalEncoder, BowEncoder, NumericalEncoder, SequentialEncoder
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\datawig\column_encoders.py", line 26, in
import mxnet as mx
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\mxnet_init_.py", line 24, in
from .context import Context, current_context, cpu, gpu, cpu_pinned
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\mxnet\context.py", line 24, in
from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\mxnet\base.py", line 213, in
_LIB = _load_lib()
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\mxnet\base.py", line 204, in load_lib
lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
File "C:\Users\PC\AppData\Local\Programs\Python\Python37\lib\ctypes_init.py", line 364, in init
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

I want to use SimpleImputer. I use Windows [Version 10.0.18363.836], Python 3.7, Pycharm 2020.1.1.

Could someone give me a hint by this issue?

Best regards,
Anastasiia

imputer.predict should not modify dataframe in place.

As the title, says. When calling imputer.predict(df), columns are appended to df such that a second call of the command throws an error.
This makes it unnecessarily complicated to apply different imputers to the same dataset, e.g. for comparing predictions.
Changing this may break backwards compatibility though.

Fix broken build

Why is the failing build not been fixed? Is the project still alive? It hasn't seen any updates lately. Or is it abandoned?

Feature request: Progress indicator

Howdy,

I'm playing around with datawig and it's really neat. I am using it for a fairly large file and it completed after almost 8 hours last night - it would be awesome if there was some indicator of progress.

Right now I'm checking my computer here and there to see if the CPU is maxed out and if the job is running. Tensorflow has a great progress bar that might be nice to help people estimate timing for large jobs.

ValueError: cannot convert float NaN to integer

I confirmed that I have the required dependency versions.
scikit-learn[alldeps]==0.22.1
typing==3.6.6
pandas==0.25.3
mxnet==1.4.0

I successfully installed Datawig and the quickstart example works just fine.

`import pandas as pd
import datawig
import numpy as np

df1 = datawig.utils.generate_df_numeric()

df1_with_missing = df1.mask(np.random.rand(*df1.shape) > .9)

df1_with_missing_imputed = datawig.SimpleImputer.complete(df1_with_missing)`

However, when I try to apply it to my dataset, I get the following error:
`df = pd.read_csv(abundance_file_path)
df_with_missing = df
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)

/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/sklearn/utils/extmath.py:765: RuntimeWarning: invalid value encountered in true_divide
updated_mean = (last_sum + new_sum) / updated_sample_count
/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/sklearn/utils/extmath.py:706: RuntimeWarning: Degrees of freedom <= 0 for slice.
result = op(x, *args, **kwargs)
Traceback (most recent call last):

File "", line 2, in
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)

File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/simple_imputer.py", line 527, in complete
calibrate=False)

File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/simple_imputer.py", line 390, in fit
calibrate=calibrate)

File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/imputer.py", line 263, in fit
iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)

File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/imputer.py", line 604, in __build_iterators
batch_size=self.batch_size

File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/iterators.py", line 231, in init
self.start_padding_idx = int(data_frame.index.max() + 1)

ValueError: cannot convert float NaN to integer
`

Unify precision filter

There appear to be two places in the code where precision filtering for categorical predictions is done.

in imputer.predict where below threshold values are replaced by empty strings; here the resulting data frame has the same number of rows as the data frame that was the argument to predict
in imputer.__filter_predictions where the below threshold values are discarded; the result list now can have a lower number of rows and there will be an error in imputer.predict

We should make sure filtering is done consistently and preferably without changing the size of the input data frame

Serialisation behaviour

There has been a discussion about serialisation behaviour. In particular it can lead to crashes because of permission issue. As second problem is with space issues when training many models, e.g. for HPO.

From my perspective it would be nice to make serialisation optional. But I understand there has already been a consensus to not to(?)

ValueError: fill value must be in categories

Hi,

I am trying to impute numeric values from one specific column (it's called 'Comercializadora_encoded', and it is now a numeric column because I previously encoded the original object-type column with LabelEncoder() from sklearn).

This is are the column types I would like to input:

--> Provincia 166203 non-null float64
--> Consumo 166203 non-null float64
--> Potencia max 166203 non-null float64

And this one the column to impute:

--> Comercializadora_encoded 163937 non-null object

This is my code:

df_train, df_test = datawig.utils.random_split(df_copy)

imputer = datawig.SimpleImputer(
    input_columns=['Provincia', 'Consumo', 'Potencia max'],
    output_column= 'Comercializadora_encoded', 
    output_path = 'imputer_model' 
    )

imputer.fit(train_df=df_train, num_epochs=50)

imputed = imputer.predict(df_test)

And this is the error message I am getting:

2020-11-30 09:57:37,860 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 47 occurrences of value 16.0
2020-11-30 09:57:37,860 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 40 occurrences of value 7.0
2020-11-30 09:57:37,860 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 27 occurrences of value 44.0
2020-11-30 09:57:37,865 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 23 occurrences of value 66.0
2020-11-30 09:57:37,866 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 19 occurrences of value 29.0
2020-11-30 09:57:37,868 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 18 occurrences of value 28.0
2020-11-30 09:57:37,869 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 17 occurrences of value 56.0
2020-11-30 09:57:37,870 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 17 occurrences of value 21.0
2020-11-30 09:57:37,871 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 16 occurrences of value 81.0
2020-11-30 09:57:37,872 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 16 occurrences of value 34.0
2020-11-30 09:57:37,873 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 16 occurrences of value 74.0
2020-11-30 09:57:37,874 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 13 occurrences of value 43.0
2020-11-30 09:57:37,875 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 12 occurrences of value 1.0
2020-11-30 09:57:37,876 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 9 occurrences of value 52.0
2020-11-30 09:57:37,877 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 9 occurrences of value 38.0
2020-11-30 09:57:37,878 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 9 occurrences of value 9.0
2020-11-30 09:57:37,880 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 8 occurrences of value 12.0
2020-11-30 09:57:37,881 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 8 occurrences of value 25.0
2020-11-30 09:57:37,882 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 69.0
2020-11-30 09:57:37,884 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 79.0
2020-11-30 09:57:37,885 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 63.0
2020-11-30 09:57:37,886 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 6.0
2020-11-30 09:57:37,887 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 76.0
2020-11-30 09:57:37,888 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 6 occurrences of value 67.0
2020-11-30 09:57:37,888 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 6 occurrences of value 54.0
2020-11-30 09:57:37,889 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 26.0
2020-11-30 09:57:37,890 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 20.0
2020-11-30 09:57:37,890 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 48.0
2020-11-30 09:57:37,891 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 49.0
2020-11-30 09:57:37,892 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 10.0
2020-11-30 09:57:37,893 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 4 occurrences of value 23.0
2020-11-30 09:57:37,894 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 4 occurrences of value 53.0
2020-11-30 09:57:37,896 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 4 occurrences of value 5.0
2020-11-30 09:57:37,897 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 4 occurrences of value 36.0
2020-11-30 09:57:37,899 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 57.0
2020-11-30 09:57:37,900 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 27.0
2020-11-30 09:57:37,902 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 0.0
2020-11-30 09:57:37,903 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 17.0
2020-11-30 09:57:37,904 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 2.0
2020-11-30 09:57:37,906 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 45.0
2020-11-30 09:57:37,907 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 71.0
2020-11-30 09:57:37,908 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 46.0
2020-11-30 09:57:37,909 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 4.0
2020-11-30 09:57:37,910 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 50.0
2020-11-30 09:57:37,911 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 14.0
2020-11-30 09:57:37,912 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 68.0
2020-11-30 09:57:37,913 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 22.0
2020-11-30 09:57:37,914 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 59.0
2020-11-30 09:57:37,916 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 65.0
2020-11-30 09:57:37,917 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 42.0
2020-11-30 09:57:37,919 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 72.0
2020-11-30 09:57:37,920 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 77.0
2020-11-30 09:57:37,921 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 60.0
2020-11-30 09:57:37,922 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 8.0
2020-11-30 09:57:37,923 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 3.0
2020-11-30 09:57:37,924 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 82.0
2020-11-30 09:57:37,925 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 13.0
2020-11-30 09:57:37,926 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 33.0
2020-11-30 09:57:37,927 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 15.0
2020-11-30 09:57:37,928 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 37.0
2020-11-30 09:57:37,930 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 62.0
2020-11-30 09:57:37,931 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 75.0
2020-11-30 09:57:37,932 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 40.0
2020-11-30 09:57:37,933 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 41.0
2020-11-30 09:57:37,934 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 30.0
2020-11-30 09:57:37,935 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 39.0
C:\Users\rcruz\Anaconda3\lib\site-packages\pandas\core\frame.py:3509: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-55b90ff782c9> in <module>
     10 
     11 ## Fit an imputer model on the train data
---> 12 imputer.fit(train_df=df_train, num_epochs=50)
     13 
     14 ## Impute missing values and return original dataframe with predictions

~\AppData\Roaming\Python\Python38\site-packages\datawig\simple_imputer.py in fit(self, train_df, test_df, ctx, learning_rate, num_epochs, patience, test_split, weight_decay, batch_size, final_fc_hidden_units, calibrate, class_weights, instance_weights)
    384         self.output_path = self.imputer.output_path
    385 
--> 386         self.imputer = self.imputer.fit(train_df, test_df, ctx, learning_rate, num_epochs, patience,
    387                                         test_split,
    388                                         weight_decay, batch_size,

~\AppData\Roaming\Python\Python38\site-packages\datawig\imputer.py in fit(self, train_df, test_df, ctx, learning_rate, num_epochs, patience, test_split, weight_decay, batch_size, final_fc_hidden_units, calibrate)
    261             train_df, test_df = random_split(train_df, [1.0 - test_split, test_split])
    262 
--> 263         iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)
    264 
    265         self.__check_data(test_df)

~\AppData\Roaming\Python\Python38\site-packages\datawig\imputer.py in __build_iterators(self, train_df, test_df, test_split)
    590 
    591         logger.debug("Building Train Iterator with {} elements".format(len(train_df)))
--> 592         iter_train = ImputerIterDf(
    593             data_frame=train_df,
    594             data_columns=self.data_encoders,

~\AppData\Roaming\Python\Python38\site-packages\datawig\iterators.py in __init__(self, data_frame, data_columns, label_columns, batch_size)
    221         numerical_columns = [c for c in data_frame.columns if is_numeric_dtype(data_frame[c])]
    222         string_columns = list(set(data_frame.columns) - set(numerical_columns))
--> 223         data_frame = data_frame.fillna(value={x: "" for x in string_columns})
    224         data_frame = data_frame.fillna(value={x: np.nan for x in numerical_columns})
    225 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   4250         **kwargs
   4251     ):
-> 4252         return super().fillna(
   4253             value=value,
   4254             method=method,

~\Anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6272                         continue
   6273                     obj = result[k]
-> 6274                     obj.fillna(v, limit=limit, inplace=True, downcast=downcast)
   6275                 return result if not inplace else None
   6276 

~\Anaconda3\lib\site-packages\pandas\core\series.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   4339         **kwargs
   4340     ):
-> 4341         return super().fillna(
   4342             value=value,
   4343             method=method,

~\Anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6255                     )
   6256 
-> 6257                 new_data = self._data.fillna(
   6258                     value=value, limit=limit, inplace=inplace, downcast=downcast
   6259                 )

~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in fillna(self, **kwargs)
    573 
    574     def fillna(self, **kwargs):
--> 575         return self.apply("fillna", **kwargs)
    576 
    577     def downcast(self, **kwargs):

~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

~\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in fillna(self, value, limit, inplace, downcast)
   1950     def fillna(self, value, limit=None, inplace=False, downcast=None):
   1951         values = self.values if inplace else self.values.copy()
-> 1952         values = values.fillna(value=value, limit=limit)
   1953         return [
   1954             self.make_block_same_class(

~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

~\Anaconda3\lib\site-packages\pandas\core\arrays\categorical.py in fillna(self, value, method, limit)
   1871             elif is_hashable(value):
   1872                 if not isna(value) and value not in self.categories:
-> 1873                     raise ValueError("fill value must be in categories")
   1874 
   1875                 mask = codes == -1

ValueError: fill value must be in categories

I've also tried to use categorical columns as input columns, and to convert the output column into a category.
Am I missing something?

Thank you very much.
Regards,
Rubén.

The dataset you're doing the examples on, doesn't have missing values

I might be missing something but I couldn't find any missing values on the dataset you're doing the example on.

AttributeError: module 'mxnet' has no attribute 'random'

Build 1.4 today
Ubuntu 18.04
gcc-6, g++-6
Cuda 10
TensorRT
Tensorflow 1.13
Make option:
USE_OPENCV=1
USE_BLAS=openblas
USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda
USE_CUDNN=1
USE_NCCL=1
Compiled with only a warning about lapack
Copied the example into a python script and ran with python3

Error reported:
Traceback (most recent call last):
File "test.py", line 8, in
mx.random.seed(1234)
AttributeError: module 'mxnet' has no attribute 'random'

test.zip

Update dataset links in user guide

The user guide at https://datawig.readthedocs.io/en/latest/source/userguide.html#step-by-step-examples contains a reference to a dropbox link that does not exist anymore.

Return training metrics

At the moment all metrics we evaluate automatically are measured wrt to the test data (except for the log likelihood). To analyse model performance in particular bias/variance, training accuracy is crucial.
We should add this.

The whole metrics computation, however, is cluttered and complicated and I wonder whether we should revisit it more fundamentally.

AttributeError: module 'mxnet' has no attribute 'random'

Thanks for sharing amazing project.
I tried to run the first example provided in the README, by calling `import datawig` I just get errors.

AttributeError Traceback (most recent call last)
in
----> 1 import datawig
2
3 df = datawig.utils.generate_df_string(num_samples=200, data_column_name='sentences', label_column_name='label')
4 df_train, df_test = datawig.utils.random_split(df)
5

~/Development/repos/AWS/MBA/datawig/datawig/init.py in
1 # makes the column encoders available as e.g. from datawig import CategoricalEncoder
----> 2 from .column_encoders import CategoricalEncoder, BowEncoder, NumericalEncoder, SequentialEncoder
3 from .mxnet_input_symbols import BowFeaturizer, LSTMFeaturizer, NumericalFeaturizer, EmbeddingFeaturizer
4 from .simple_imputer import SimpleImputer
5 from .imputer import Imputer

~/Development/repos/AWS/MBA/datawig/datawig/column_encoders.py in
30 from sklearn.preprocessing import StandardScaler
31
---> 32 from .utils import logger
33
34 random.seed(0)

~/Development/repos/AWS/MBA/datawig/datawig/utils.py in
32 import pandas as pd
33
---> 34 mx.random.seed(1)
35 random.seed(1)
36 np.random.seed(42)

AttributeError: module 'mxnet' has no attribute 'random'

Should I setup any conda env. for specific version of mxnet?

Make use of categorical encoding in SimpleImputer if applicable

replace instead of creating new column

Hi There,

I'm wondering if it's possible to have datawig replace a existing column instead of creating a new column?

For example, when I run df = imputer.predict(df). It takes the column I wanted to predict and adds a copy of _imputed and _imputed_proba.

How can I avoid that and just replace the df's column with the new value?

ValueError: cannot convert float NaN to integer

Hello,
I get this error that I can't solve, using google colaboratory.
I'm not sure if it's due to a wrong install or conflicting versions, my apologies if it is.

/usr/local/lib/python3.6/dist-packages/datawig/iterators.py in init(self, data_frame, data_columns, label_columns, batch_size)
229 # custom padding for having to discard the last batch in mxnet for sparse data
230 padding_n_rows = self._n_rows_padding(data_frame)
--> 231 self.start_padding_idx = int(data_frame.index.max() + 1)
232 for idx in range(self.start_padding_idx, self.start_padding_idx + padding_n_rows):
233 data_frame.loc[idx, :] = data_frame.loc[self.start_padding_idx - 1, :]

ValueError: cannot convert float NaN to integer

My code :

!pip install datawig

import datawig, numpy

import pandas as pd

import sys
from io import StringIO

data="""epiU	epiPV	dsU	dsPG	ifrU	ifrPG
874	1125	40	57		
815	1081	48	95		
712	937	39	53		
606	773	45	80		
576	721	38	52		
401	547	28	44	1040	1202
362	479	31	46	986	1139
295	361	29	42	909	1043
253	314	30	57	757	892
292	364	92	150	844	1018
253	311	18	43	765	921
214	263	14	24	681	808
198	248	16	26	645	752
161	199	10	24	562	654
"""

df = pd.read_csv(StringIO(data), sep="\t")
df = df[['epiU', 'dsU', 'ifrU']]

print(df.dtypes)
print(df)

df_imputed = datawig.SimpleImputer.complete(df)

EDIT:
important note is that the basic example is working ok.

# generate some data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric() 
# mask 10% of the values
df_with_missing = df.mask(numpy.random.rand(*df.shape) > .9)

# impute missing values
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)

EDIT2

I think the issue is datawig requires pandas 0.25.3

!pip install datawig 
ERROR: google-colab 1.0.0 has requirement pandas~=1.0.0; python_version >= "3.0", but you'll have pandas 0.25.3 which is incompatible.

Problems when encoding a numerical column using categorical featurizers

Hello,
I am working on a synthetic categorical dataset that contains information about people and their location, with a format similar to

name, city, zip
john, paris, 1234
frank, rome, 718

In this situation the zip codes are integers, but I want to treat them as categorical data because I do not want to infer information based on their numerical properties.

In my code, I implemented the imputer both as

data_encoder_cols = [datawig.BowEncoder('zip')]
label_encoder_cols = [datawig.CategoricalEncoder('city')]
data_featurizer_cols = [datawig.BowFeaturizer('zip', max_tokens=df_train['zip'].nunique())]

and as

data_encoder_cols = [datawig.CategoricalEncoder('zip')]
label_encoder_cols = [datawig.CategoricalEncoder('city')]
data_featurizer_cols = [datawig.EmbeddingFeaturizer('zip', max_tokens=df_train['zip'].nunique())]

In the first case, imputer.fit(df_train) failed because (I'm assuming) the zip column was automatically cast back to integer (even though I had previously set it to object). The exception I'm getting is AttributeError: 'float' object has no attribute 'lower'
In the second case, the training failed in a weird way I can't explain, with the exception IndexError: index 716 is out of bounds for axis 0 with size 715

I tried df_dirty['zip'] = df_dirty['zip'].apply(lambda x: 'i' + str(x)) as a workaround, so that the zip codes are forced to be seen as strings. In this case, the code with CategoricalEncoder still failed with a similar error (719 instead of 716), while the BoWEncoder runs (slowly).

What's the correct way of treating numerical columns as categorical ones?

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'SDNN\\imputer.log'

Hello,

Thank you for this code.

I am currently using Datawig for data imputation of numerical dataset.

However, an error is received when calling the Simpleimputer.complete function:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'SDNN\imputer.log'

SDNN is my column name in array.

This is my code:

import scipy.io

mat = scipy.io.loadmat('A_only_HRV.mat')
training_array_A = mat['A_only_HRV']
mat = scipy.io.loadmat('dataset_all_A_fixed_missing.mat')
test_array_A = mat['dataset_all_A_fixed_missing']

import pandas as pd
c = pd.DataFrame(data=test_array_A,columns=['AVNN','SDNN','RMSSD','pNN','SEM','BETA','HF_NORM','HF_Peak','HF_Power','LF_Norm','LF_Peak','LF_Power','LF/HF','Total_Power','VLF_Norm','VLF_Power','SD1','SD2','Alpha1','Alpha2','SE','PIP','IALS','PSS','PAS'])
c2 = c.astype('str')

c_train = pd.DataFrame(data=training_array_A,columns=['AVNN','SDNN','RMSSD','pNN','SEM','BETA','HF_NORM','HF_Peak','HF_Power','LF_Norm','LF_Peak','LF_Power','LF/HF','Total_Power','VLF_Norm','VLF_Power','SD1','SD2','Alpha1','Alpha2','SE','PIP','IALS','PSS','PAS'])
c_train2 = c_train.astype('str')

import datawig, numpy
from datawig import SimpleImputer

df_with_missing_imputed = datawig.SimpleImputer.complete(c)

I am attaching the needed files, I obtain them as .mat files.

files.zip

Could you please let me know why am facing this error?

Interpreting and loading outputs of datawig which are in module

I have imputed a column using datawig but its returning output in the columns is an object which is <datawig.simple_imputer.SimpleImputer object at 0x7f5e05b40630> how can we further process and determine the value of this categorical datapoint it is even not saved in pickle file and is to be run again and again when restarteed kernel

Leverage GPU when doing predictions instead of just training?

Hi There,

I could be wrong but it appears that the GPU is mainly used when training. When I train my model, I see the GPU speed up but when I'm doing predictions it uses a single CPU core. For my large dataset, I'm noticing it's spending more time here than the training.

Is there is anything I can do to leverage the GPU for predictions as well? In the tutorial, you can recreate this with using a large dataset(1M rows X 200 columns) and run imputed = imputer.predict(df_test)

Model optimization

If I want to add attention mechanisms to optimize the model, Where should I operate？

Onboard dev team to readthedocs

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

When I predict missing value, I found that the datawig can't predict multiple data. For example,

data_encoder_cols = [NumericalEncoder('a'), NumericalEncoder('c'),
                     NumericalEncoder('e'),NumericalEncoder('g'),NumericalEncoder('h')]
label_encoder_cols = [NumericalEncoder('b'),NumericalEncoder('d'),NumericalEncoder('f')]
data_featurizer_cols = [NumericalFeaturizer('a'), NumericalFeaturizer('c'), NumericalFeaturizer('e'),
                         NumericalFeaturizer('g'), NumericalFeaturizer('h')]

imputer = Imputer(
    data_featurizers=data_featurizer_cols,
    label_encoders=label_encoder_cols,
    data_encoders=data_encoder_cols,
    output_path='imputer_model1'
)

This is my code, I want to get the 'b','d','f', but there will be a error:

Traceback (most recent call last):

  File "<ipython-input-42-15a8b8acfb65>", line 1, in <module>
    runfile('E:/Python/datawig-master/1.py', wdir='E:/Python/datawig-master')

  File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
    execfile(filename, namespace)

  File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "E:/Python/datawig-master/1.py", line 32, in <module>
    imputer.fit(train_df=df_train,num_epochs=10)

  File "E:\Python\datawig-master\datawig\imputer.py", line 257, in fit
    iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)

  File "E:\Python\datawig-master\datawig\imputer.py", line 564, in __build_iterators
    train_df = self.__drop_missing_labels(train_df, how='all')

  File "E:\Python\datawig-master\datawig\imputer.py", line 935, in __drop_missing_labels
    if missing_idx == -1:

  File "D:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1469, in __nonzero__
    .format(self.__class__.__name__))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I don't know how to solve it.I want to get some help.

Add guide for contributing

Write a markdown doc detailing how to contribute.

datawig ignore manually set logging levels for mxnet

This used to work in previous versions but has no effect now

from datawig.utils import logger
logger.setLevel("ERROR")

Error running simpleimputer_intro.py in the example

When I ran the simpleimputer_intro.py in the example, the following error occurred
Traceback (most recent call last):File "/Users/chen/PycharmProjects/test2/datamissing/examples/simpleimputer_intro.py", line 41, in <module> predictions = imputer.predict(df_test)
File "/usr/local/lib/python3.7/site-packages/datawig/simple_imputer.py", line 420, in predict score_suffix, inplace=inplace)
File "/usr/local/lib/python3.7/site-packages/datawig/imputer.py", line 822, in predict if data_frame.columns.contains(imputation_col):
AttributeError: 'Index' object has no attribute 'contains'
It could be a data processing error in predict function

Question: When assigning a numeric variable

Why do we get cross-entropy and accuracy logs when we assign a numeric variable?
I got a continuous value for the Impute value, but I'm wondering.

2021-02-19 19:20:06,854 [INFO]  NumExpr defaulting to 8 threads.
2021-02-19 19:24:50,957 [INFO]  
========== start: fit model
2021-02-19 19:24:50,957 [WARNING]  Already bound, ignoring bind()
2021-02-19 19:25:18,406 [INFO]  Epoch[0] Batch [0-6639]	Speed: 3870.89 samples/sec	cross-entropy=2.836561	total_votes-accuracy=0.000000
2021-02-19 19:25:45,889 [INFO]  Epoch[0] Train-cross-entropy=2.407569
2021-02-19 19:25:45,890 [INFO]  Epoch[0] Train-total_votes-accuracy=0.000000
2021-02-19 19:25:45,891 [INFO]  Epoch[0] Time cost=54.932
2021-02-19 19:25:45,893 [INFO]  Saved checkpoint to "imputer_model/model-0000.params"
2021-02-19 19:25:50,622 [INFO]  Epoch[0] Validation-cross-entropy=2.897565
2021-02-19 19:25:50,623 [INFO]  Epoch[0] Validation-total_votes-accuracy=0.000000
2021-02-19 19:26:19,453 [INFO]  Epoch[1] Batch [0-6639]	Speed: 3685.39 samples/sec	cross-entropy=2.033162	total_votes-accuracy=0.000000
2021-02-19 19:26:46,887 [INFO]  Epoch[1] Train-cross-entropy=1.915317
2021-02-19 19:26:46,888 [INFO]  Epoch[1] Train-total_votes-accuracy=0.000000
2021-02-19 19:26:46,888 [INFO]  Epoch[1] Time cost=56.265
2021-02-19 19:26:46,890 [INFO]  Saved checkpoint to "imputer_model/model-0001.params"
2021-02-19 19:26:51,626 [INFO]  Epoch[1] Validation-cross-entropy=2.187781
2021-02-19 19:26:51,626 [INFO]  Epoch[1] Validation-total_votes-accuracy=0.000000
2021-02-19 19:27:19,355 [INFO]  Epoch[2] Batch [0-6639]	Speed: 3831.58 samples/sec	cross-entropy=1.926377	total_votes-accuracy=0.000000
2021-02-19 19:27:46,809 [INFO]  Epoch[2] Train-cross-entropy=1.839549
2021-02-19 19:27:46,810 [INFO]  Epoch[2] Train-total_votes-accuracy=0.000000
2021-02-19 19:27:46,810 [INFO]  Epoch[2] Time cost=55.183
2021-02-19 19:27:46,813 [INFO]  Saved checkpoint to "imputer_model/model-0002.params"
2021-02-19 19:27:51,539 [INFO]  Epoch[2] Validation-cross-entropy=2.001703
2021-02-19 19:27:51,540 [INFO]  Epoch[2] Validation-total_votes-accuracy=0.000000
2021-02-19 19:28:19,027 [INFO]  Epoch[3] Batch [0-6639]	Speed: 3865.17 samples/sec	cross-entropy=1.891239	total_votes-accuracy=0.000000
2021-02-19 19:28:46,633 [INFO]  Epoch[3] Train-cross-entropy=1.813997
2021-02-19 19:28:46,634 [INFO]  Epoch[3] Train-total_votes-accuracy=0.000000
2021-02-19 19:28:46,634 [INFO]  Epoch[3] Time cost=55.094
2021-02-19 19:28:46,637 [INFO]  Saved checkpoint to "imputer_model/model-0003.params"
2021-02-19 19:28:51,367 [INFO]  Epoch[3] Validation-cross-entropy=1.956922
2021-02-19 19:28:51,368 [INFO]  Epoch[3] Validation-total_votes-accuracy=0.000000
2021-02-19 19:29:18,846 [INFO]  Epoch[4] Batch [0-6639]	Speed: 3866.56 samples/sec	cross-entropy=1.873258	total_votes-accuracy=0.000000
2021-02-19 19:29:46,276 [INFO]  Epoch[4] Train-cross-entropy=1.806516
2021-02-19 19:29:46,277 [INFO]  Epoch[4] Train-total_votes-accuracy=0.000000
2021-02-19 19:29:46,277 [INFO]  Epoch[4] Time cost=54.909
2021-02-19 19:29:46,279 [INFO]  Saved checkpoint to "imputer_model/model-0004.params"
2021-02-19 19:29:51,008 [INFO]  Epoch[4] Validation-cross-entropy=1.971730
2021-02-19 19:29:51,009 [INFO]  Epoch[4] Validation-total_votes-accuracy=0.000000
2021-02-19 19:30:18,524 [INFO]  Epoch[5] Batch [0-6639]	Speed: 3861.23 samples/sec	cross-entropy=1.869224	total_votes-accuracy=0.000000
2021-02-19 19:30:46,013 [INFO]  Epoch[5] Train-cross-entropy=1.799461
2021-02-19 19:30:46,014 [INFO]  Epoch[5] Train-total_votes-accuracy=0.000000
2021-02-19 19:30:46,014 [INFO]  Epoch[5] Time cost=55.005
2021-02-19 19:30:46,016 [INFO]  Saved checkpoint to "imputer_model/model-0005.params"
2021-02-19 19:30:50,742 [INFO]  Epoch[5] Validation-cross-entropy=1.914169
2021-02-19 19:30:50,743 [INFO]  Epoch[5] Validation-total_votes-accuracy=0.000000
2021-02-19 19:31:18,287 [INFO]  Epoch[6] Batch [0-6639]	Speed: 3857.19 samples/sec	cross-entropy=1.848747	total_votes-accuracy=0.000000
2021-02-19 19:31:45,981 [INFO]  Epoch[6] Train-cross-entropy=1.785161
2021-02-19 19:31:45,981 [INFO]  Epoch[6] Train-total_votes-accuracy=0.000000
2021-02-19 19:31:45,982 [INFO]  Epoch[6] Time cost=55.238
2021-02-19 19:31:45,984 [INFO]  Saved checkpoint to "imputer_model/model-0006.params"
2021-02-19 19:31:50,726 [INFO]  Epoch[6] Validation-cross-entropy=1.864254
2021-02-19 19:31:50,727 [INFO]  Epoch[6] Validation-total_votes-accuracy=0.000000
2021-02-19 19:32:18,360 [INFO]  Epoch[7] Batch [0-6639]	Speed: 3844.81 samples/sec	cross-entropy=1.842331	total_votes-accuracy=0.000000
2021-02-19 19:32:45,956 [INFO]  Epoch[7] Train-cross-entropy=1.781625
2021-02-19 19:32:45,957 [INFO]  Epoch[7] Train-total_votes-accuracy=0.000000
2021-02-19 19:32:45,957 [INFO]  Epoch[7] Time cost=55.230
2021-02-19 19:32:45,961 [INFO]  Saved checkpoint to "imputer_model/model-0007.params"
2021-02-19 19:32:50,694 [INFO]  Epoch[7] Validation-cross-entropy=1.862272
2021-02-19 19:32:50,695 [INFO]  Epoch[7] Validation-total_votes-accuracy=0.000000
2021-02-19 19:33:18,318 [INFO]  Epoch[8] Batch [0-6639]	Speed: 3846.10 samples/sec	cross-entropy=1.836069	total_votes-accuracy=0.000000
2021-02-19 19:33:45,916 [INFO]  Epoch[8] Train-cross-entropy=1.777847
2021-02-19 19:33:45,917 [INFO]  Epoch[8] Train-total_votes-accuracy=0.000000
2021-02-19 19:33:45,917 [INFO]  Epoch[8] Time cost=55.222
2021-02-19 19:33:45,919 [INFO]  Saved checkpoint to "imputer_model/model-0008.params"
2021-02-19 19:33:50,644 [INFO]  Epoch[8] Validation-cross-entropy=1.833026
2021-02-19 19:33:50,645 [INFO]  Epoch[8] Validation-total_votes-accuracy=0.000000
2021-02-19 19:34:18,208 [INFO]  Epoch[9] Batch [0-6639]	Speed: 3854.56 samples/sec	cross-entropy=1.833520	total_votes-accuracy=0.000000
2021-02-19 19:34:45,896 [INFO]  Epoch[9] Train-cross-entropy=1.776226
2021-02-19 19:34:45,897 [INFO]  Epoch[9] Train-total_votes-accuracy=0.000000
2021-02-19 19:34:45,897 [INFO]  Epoch[9] Time cost=55.252
2021-02-19 19:34:45,900 [INFO]  Saved checkpoint to "imputer_model/model-0009.params"
2021-02-19 19:34:50,627 [INFO]  Epoch[9] Validation-cross-entropy=1.813570
2021-02-19 19:34:50,628 [INFO]  Epoch[9] Validation-total_votes-accuracy=0.000000
2021-02-19 19:35:18,287 [INFO]  Epoch[10] Batch [0-6639]	Speed: 3841.24 samples/sec	cross-entropy=1.830642	total_votes-accuracy=0.000000
2021-02-19 19:35:45,914 [INFO]  Epoch[10] Train-cross-entropy=1.778353
2021-02-19 19:35:45,915 [INFO]  Epoch[10] Train-total_votes-accuracy=0.000000
2021-02-19 19:35:45,915 [INFO]  Epoch[10] Time cost=55.287
2021-02-19 19:35:45,917 [INFO]  Saved checkpoint to "imputer_model/model-0010.params"
2021-02-19 19:35:50,653 [INFO]  Epoch[10] Validation-cross-entropy=1.804272
2021-02-19 19:35:50,654 [INFO]  Epoch[10] Validation-total_votes-accuracy=0.000000
2021-02-19 19:36:18,289 [INFO]  Epoch[11] Batch [0-6639]	Speed: 3844.46 samples/sec	cross-entropy=1.830434	total_votes-accuracy=0.000000
2021-02-19 19:36:45,936 [INFO]  Epoch[11] Train-cross-entropy=1.775856
2021-02-19 19:36:45,937 [INFO]  Epoch[11] Train-total_votes-accuracy=0.000000
2021-02-19 19:36:45,937 [INFO]  Epoch[11] Time cost=55.283
2021-02-19 19:36:45,940 [INFO]  Saved checkpoint to "imputer_model/model-0011.params"
2021-02-19 19:36:50,669 [INFO]  Epoch[11] Validation-cross-entropy=1.835253
2021-02-19 19:36:50,670 [INFO]  Epoch[11] Validation-total_votes-accuracy=0.000000
2021-02-19 19:37:18,301 [INFO]  Epoch[12] Batch [0-6639]	Speed: 3845.06 samples/sec	cross-entropy=1.821207	total_votes-accuracy=0.000000
2021-02-19 19:37:47,084 [INFO]  Epoch[12] Train-cross-entropy=1.769923
2021-02-19 19:37:47,085 [INFO]  Epoch[12] Train-total_votes-accuracy=0.000000
2021-02-19 19:37:47,085 [INFO]  Epoch[12] Time cost=56.415
2021-02-19 19:37:47,087 [INFO]  Saved checkpoint to "imputer_model/model-0012.params"
2021-02-19 19:37:51,819 [INFO]  No improvement detected for 3 epochs compared to 1.8135704350806672 last error obtained: 1.8220642169074315, stopping here
2021-02-19 19:37:51,820 [INFO]  
========== done (780.864077091217 s) fit model
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/datawig/calibration.py:92: RuntimeWarning: invalid value encountered in log
  return np.log(probas)
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/datawig/calibration.py:59: RuntimeWarning: invalid value encountered in greater_equal
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/datawig/calibration.py:59: RuntimeWarning: invalid value encountered in less
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)```

Segment 11 warning

Unable to run datawig code in google colab

Add explainability

To introspect the learned parameters of the imputer:

add a method for explaining a given class (def explain('class'))
add a method for explaining an instance (def explain_instance(sample))

For now, a simple, univariate, measure of feature label covariances should suffice (link).

WinError32 similar to #127

Hi, I am using Jupyter Notebook(through Anaconda) and the SimpleImputer.complete. I am running Anaconda as an administrator.

The error arises at shutil.rmtree(output_col), and the stack trace eventually calls os.unlink(fullname) in _rmtree_unsafe.

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'SOME_PATH\\imputer.log'

I am unable to delete the imputer.log manually until I restart the notebook after which this is possible.

awslabs / datawig Goto Github PK

datawig's Introduction

DataWig - Imputation for Tables

Installation

CPU

GPU

Running DataWig

Quickstart Example

Imputation of categorical columns

Imputation of numerical columns

Acknowledgments

Building documentation

Executing Tests

Updating PyPi distribution

datawig's People

Contributors

Stargazers

Watchers

Forkers

datawig's Issues

Thanks for sharing amazing project. I tried to run the first example provided in the README, by calling import datawig I just get errors.

AttributeError: module 'mxnet' has no attribute 'random'

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Thanks for sharing amazing project.
I tried to run the first example provided in the README, by calling `import datawig` I just get errors.