GithubHelp home page GithubHelp logo

midasverse / midaspy Goto Github PK

View Code? Open in Web Editor NEW
121.0 121.0 34.0 18.94 MB

Python package for missing-data imputation with deep learning

License: Apache License 2.0

Python 100.00%
deep-learning imputation-methods neural-network python tensorflow

midaspy's People

Contributors

david-woroniuk avatar edvinskis avatar jackewiebohne avatar oracen avatar ranjitlall avatar tsrobinson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

midaspy's Issues

Results are not perfectly consistent

Have tried running the Python example notebook and noticed that the final loss slightly changes from run to run (e.g., from 73446.1 to 73355.3) despite setting the same seed. Does this have to do with unaccounted for randomness in the algorithm or just due rounding?

Another question, does it generally make a difference to scale the continuous data before inputting it to the algorithm? I assumed that the answer is no because it's done internally anyway; however, I noticed that in the R example the data was explictly scaled but not in the Python example.

Deprecation warnings to fix

Getting the following warning as part of training cycle:

FutureWarning: Passing a dict as an indexer is deprecated and will raise in a future version. Use a list instead.
  data_1 = data[subset]

We should update to future-proof asap.

Minimum and maximum value arguments (constraints)

I'm working with Dirichlet distributions and the compositional data simplex, and am really enjoying MIDASpy's flexibility when dealing with this data (related to K-L divergence in the decoder). However, there is a tendency to produce negative values in the numerical feature data I have been using.

In the case of compositional data, there is a constraint of zero as a minimum value. Other imputation approaches allow setting maximum and minimum value arguments (e.g., Scikit-Learn) and importantly these can be set per feature (autoimpute). Is this an argument which could be added to the package? It would be a major help to people working in several disciplines.

Error with multiple GPUs: Do not use tf.reset_default_graph() to clear nested graphs

I am trying to utilize two GPUs with MIDASpy. However, I get the following error during set-up:

from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import tensorflow as tf
import MIDASpy as md

data_0 = pd.read_csv('/home/comp/Documents/file.txt', sep = "\t")
data_0.columns.str.strip()

data_0 = data_0.set_index('Unnamed: 0')
data_0.index.names = [None]

np.random.seed(441)

na_loc = data_0.isnull()
data_0[na_loc] = np.nan

imputer = md.Midas(layer_structure= [256, 256, 256],
                   learn_rate= 1e-4,
                   input_drop= 0.9,
                   train_batch = 50,
                   savepath= '/home/comp/Documents/save',
                   seed= 89)

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
imputer.build_model(data_0)

AssertionError: Do not use tf.reset_default_graph() to clear nested graphs. If you need a cleared graph, exit the nesting and create a new graph.

How to reverse One hot encoding

Hello,

How to get the data in the original form (reverse dummies). We receive the imputed dataset in one hot encoded form. But how to convert it into the original dataset (the categorical data).
Thank you

Pyplot rewrite

Would be useful to simplify pyplot in overimpute function to remove interactive plotting -- ideal behaviour is a single plot at the end of imputation, using "agg" if possible.

Improve TensorFlow 2.X compatibility

Current behaviour allows MIDASpy to be loaded when using TF 2.X, but returns logging error to inform users imputation only possible in TF1.X

Looks like all TF1 components can be updated to TF 2.X -- just requires additional tensorflow-addons package dependency for the AdamW optimiser.

Heuristics on choosing a model structure

Hi,

I was wondering if there was any heuristics on choosing a model structure for different types / sizes of datasets. For instance, if I had a standard corporate dataset with 20,000 rows and 15 columns, are there any sure-fire methods / parameters I should be using? Are there any clear do's or dont's in certain situations?

values not imputed

I'm essentially running the demo code, but with my own input data (all numeric data), and the data frames generated by imputer.generate_samples(m=10).output_list still have the same missing values as in the input.

Example input table:

Feature     feat1  feat2  feat3  ...  feat30  feat31  feat32
ERS2551628                65.0         0.0             101.0  ...            105.0                 230.0                27.0
SRS143466                 43.0         NaN              34.0  ...             98.0                   0.0                26.0
SRS023715                  0.0        54.0               0.0  ...             33.0                  55.0                 NaN
SRS580227                  0.0         0.0              10.0  ...             67.0                  22.0                 0.0
DRS091214             327457.0         0.0               NaN  ...              NaN                   0.0                24.0
...                        ...         ...               ...  ...              ...                   ...                 ...
ERS2551594                74.0        15.0              21.0  ...             93.0                  40.0                 0.0
ERS634957                  0.0        12.0               0.0  ...              0.0                  45.0                 0.0
DRS087574                  0.0        80.0              43.0  ...            209.0                   NaN                12.0
ERS634952                 33.0        56.0              11.0  ...              NaN                1032.0                 0.0
SRS1820544                49.0       102.0              12.0  ...             13.0                  27.0                49.0

...and the output:

Feature     feat1  feat2  feat3  ...  feat30  feat31  feat32
ERS2551628                65.0         0.0             101.0  ...            105.0                 230.0                27.0
SRS143466                 43.0         NaN              34.0  ...             98.0                   0.0                26.0
SRS023715                  0.0        54.0               0.0  ...             33.0                  55.0                 NaN
SRS580227                  0.0         0.0              10.0  ...             67.0                  22.0                 0.0
DRS091214             327457.0         0.0               NaN  ...              NaN                   0.0                24.0
...                        ...         ...               ...  ...              ...                   ...                 ...
ERS2551594                74.0        15.0              21.0  ...             93.0                  40.0                 0.0
ERS634957                  0.0        12.0               0.0  ...              0.0                  45.0                 0.0
DRS087574                  0.0        80.0              43.0  ...            209.0                   NaN                12.0
ERS634952                 33.0        56.0              11.0  ...              NaN                1032.0                 0.0
SRS1820544                49.0       102.0              12.0  ...             13.0                  27.0                49.0

Any idea on why the missing values are not imputed?

conda env

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
_tflow_select             2.3.0                       mkl
absl-py                   0.15.0                   pypi_0    pypi
aiohttp                   3.8.1            py39h3811e60_0    conda-forge
aiosignal                 1.2.0              pyhd8ed1ab_0    conda-forge
astor                     0.8.1              pyh9f0ad1d_0    conda-forge
astunparse                1.6.3              pyhd8ed1ab_0    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
attrs                     21.4.0             pyhd8ed1ab_0    conda-forge
blas                      1.1                    openblas    conda-forge
blinker                   1.4                        py_1    conda-forge
brotlipy                  0.7.0           py39h3811e60_1003    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2021.10.26           h06a4308_2
cachetools                4.2.4              pyhd8ed1ab_0    conda-forge
certifi                   2021.10.8        py39hf3d152e_1    conda-forge
cffi                      1.15.0           py39h4bc2ebd_0    conda-forge
charset-normalizer        2.0.9              pyhd8ed1ab_0    conda-forge
click                     8.0.3            py39hf3d152e_1    conda-forge
cryptography              36.0.0           py39h9ce1e76_0
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
dataclasses               0.8                pyhc8e2a94_3    conda-forge
flatbuffers               1.12                     pypi_0    pypi
freetype                  2.11.0               h70c0345_0
frozenlist                1.2.0            py39h3811e60_1    conda-forge
gast                      0.3.3                    pypi_0    pypi
google-auth               1.35.0                   pypi_0    pypi
google-auth-oauthlib      0.4.1                      py_2    conda-forge
google-pasta              0.2.0              pyh8c360ce_0    conda-forge
grpcio                    1.32.0                   pypi_0    pypi
h5py                      2.10.0          nompi_py39h98ba4bc_106    conda-forge
hdf5                      1.10.6          nompi_h3c11f04_101    conda-forge
idna                      3.3                pyhd3eb1b0_0
importlib-metadata        4.10.0           py39hf3d152e_0    conda-forge
jbig                      2.1               h7f98852_2003    conda-forge
joblib                    1.1.0                    pypi_0    pypi
jpeg                      9d                   h516909a_0    conda-forge
keras-preprocessing       1.1.2              pyhd8ed1ab_0    conda-forge
kiwisolver                1.3.2            py39h1a9c180_1    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
lerc                      3.0                  h9c3ff4c_0    conda-forge
libblas                   3.9.0           1_h6e990d7_netlib    conda-forge
libcblas                  3.9.0           3_h893e4fe_netlib    conda-forge
libdeflate                1.8                  h7f98852_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 11.2.0              h1d223b6_11    conda-forge
libgfortran-ng            7.5.0               h14aa051_19    conda-forge
libgfortran4              7.5.0               h14aa051_19    conda-forge
libgomp                   11.2.0              h1d223b6_11    conda-forge
liblapack                 3.9.0           3_h893e4fe_netlib    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.13               h4367d64_0
libpng                    1.6.37               hed695b0_2    conda-forge
libprotobuf               3.19.2               h780b84a_0    conda-forge
libstdcxx-ng              11.2.0              he4da1e4_11    conda-forge
libtiff                   4.3.0                h6f004c6_2    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libwebp-base              1.2.1                h7f98852_0    conda-forge
libzlib                   1.2.11            h36c2ea0_1013    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
markdown                  3.3.6              pyhd8ed1ab_0    conda-forge
matplotlib                3.3.2                         0    conda-forge
matplotlib-base           3.3.2            py39h98787fa_1    conda-forge
midaspy                   1.2.1                    pypi_0    pypi
multidict                 5.2.0            py39h3811e60_1    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.19.5                   pypi_0    pypi
oauthlib                  3.1.1              pyhd8ed1ab_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openblas                  0.3.4             h9ac9557_1000    conda-forge
openjpeg                  2.4.0                hb52868f_1    conda-forge
openssl                   3.0.0                h7f98852_2    conda-forge
opt_einsum                3.3.0              pyhd8ed1ab_1    conda-forge
pandas                    1.3.5            py39hde0f152_0    conda-forge
patsy                     0.5.2              pyhd8ed1ab_0    conda-forge
pillow                    8.4.0            py39ha612740_0    conda-forge
pip                       21.3.1             pyhd8ed1ab_0    conda-forge
protobuf                  3.19.2           py39he80948d_0    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.8                      py_0
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pyjwt                     2.3.0              pyhd8ed1ab_1    conda-forge
pyopenssl                 21.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.6              pyhd8ed1ab_0    conda-forge
pysocks                   1.7.1            py39hf3d152e_4    conda-forge
python                    3.9.9           h543edf9_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.9                      2_cp39    conda-forge
pytz                      2021.3             pyhd8ed1ab_0    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.27.0             pyhd8ed1ab_0    conda-forge
requests-oauthlib         1.3.0              pyh9f0ad1d_0    conda-forge
rsa                       4.8                pyhd8ed1ab_0    conda-forge
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.7.1            py39hc65b3f8_2
setuptools                60.2.0           py39hf3d152e_0    conda-forge
six                       1.15.0                   pypi_0    pypi
sqlite                    3.37.0               h9cd32fc_0    conda-forge
statsmodels               0.13.1           py39hce5d2b2_0    conda-forge
tensorboard               2.6.0                      py_0
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1              pyhd8ed1ab_0    conda-forge
tensorflow                2.4.1           mkl_py39h4683426_0
tensorflow-addons         0.15.0                   pypi_0    pypi
tensorflow-base           2.4.1           mkl_py39h43e0292_0
tensorflow-estimator      2.4.0                    pypi_0    pypi
termcolor                 1.1.0                      py_2    conda-forge
threadpoolctl             3.0.0                    pypi_0    pypi
tk                        8.6.11               h27826a3_1    conda-forge
tornado                   6.1              py39h3811e60_2    conda-forge
typeguard                 2.13.3                   pypi_0    pypi
typing-extensions         3.7.4.3                  pypi_0    pypi
tzdata                    2021e                he74cb21_0    conda-forge
urllib3                   1.26.7             pyhd8ed1ab_0    conda-forge
werkzeug                  2.0.2              pyhd3eb1b0_0
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
wrapt                     1.12.1                   pypi_0    pypi
xz                        5.2.5                h516909a_1    conda-forge
yarl                      1.7.2            py39h3811e60_1    conda-forge
zipp                      3.6.0              pyhd8ed1ab_0    conda-forge
zlib                      1.2.11            h36c2ea0_1013    conda-forge
zstd                      1.5.1                ha95c52a_0    conda-forge

Optimizing MIDAS on very large/complex datasets

In very large datasets (~30,000 samples x 1,000,000 features) with complex relationships (e.g. cancer omics data), the runtime for MIDAS can take a very long time (days?), even on a single GPU. However, I would like to take advantage of the 'overimpute' feature for hyperparameter tuning. This is prohibitive since this very useful feature runs the algorithm multiple times to evaluate various settings.

Would random downsampling of samples (columns) and/or features (rows) generalize the optimal hyperparameters to the larger dataset? For instance, a random subset of 500-1,000 samples with 5,000-10,000 features. This would be to specifically determine the optimal number of: nodes, layers, learning rate, and training epochs. I would think batch size (which can speed up training) is a function of the dataset size, so this would not generalize.

Any help would be great

Impute new data using trained model.

Looking at the codebase I could not locate a function where the trained model could be used to impute new data after training the model. There seems to be a couple of functions that could be utilized to perform this indirectly but I am surprised that is not included as a separate function.

MIDASpy some times get error

As recommended , I have installed all the packages, but I sometimes get an error message. the interesting point is that When I ran exactly this code on another account of Google Colab, I got no errors
!pip install numpy pandas matplotlib statsmodels scipy
!pip install tensorflow==2.11
!pip install tensorflow-addons<0.20
!pip install MIDASpy

import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
import sys
import MIDASpy as md

/usr/local/lib/python3.10/dist-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.9.0 and strictly below 2.12.0 (nightly versions are not supported).
The versions of TensorFlow you are currently using is 2.13.0 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
warnings.warn(

ImportError Traceback (most recent call last)
in <cell line: 6>()
4 from sklearn.preprocessing import MinMaxScaler
5 import sys
----> 6 import MIDASpy as md

14 frames
/usr/local/lib/python3.10/dist-packages/keras/engine/base_layer_utils.py in
22
23 from keras import backend
---> 24 from keras.dtensor import dtensor_api as dtensor
25 from keras.utils import control_flow_util
26 from keras.utils import tf_inspect

ImportError: cannot import name 'dtensor_api' from 'keras.dtensor' (/usr/local/lib/python3.10/dist-packages/keras/dtensor/init.py)

Overimpute legend

Related to #7, shift legend to below plotting area.

Need to account for clipping of legend when saving, and for varying numbers of items dependent on input data.

VAE deprecation warning from tf.distributions

Running MIDAS using VAE leads to deprecation warning re. tf.compat.v1.distributions.

E.g.

>>> tf.compat.v1.distributions.Normal()
WARNING:tensorflow:From <stdin>:1: Normal.__init__ (from tensorflow.python.ops.distributions.normal) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.

Migrating affected code to tfp.distributions is not straightforward as not designed for TF1 graph-oriented model. We should investigate solutions to safeguard codebase in medium term.

Compatibility with compositional data

Sometimes we know that a set of variables should add up to a given total. Measurements involving proportions, percentages, probabilities, concentrations are compositional data. These data occur often in household and business surveys, nutritional information for food, population surveys, biological and genetic data, etc.

The complication of compositional data are that the features are inherently mathematically related, leading to spurious correlation coefficients if applying conventional statistical or ML approaches (e.g., calculating Euclidean distance metrics). However, use of K-L distance is potentially a way to avoid this issue, and so MIDAS might offer a nice Deep Learning solution to imputation issues concerning compositional data.

However, some preliminary experiments using classic compositional data imputation datasets and MIDASpy hasn't performed as well as I might have expected, and I was wondering if you'd be able to comment?

For example, I imposed 30% missingness at random on the 'Kola soil horizon' geochemical dataset, and compared the known vs imputed samples against each other. You can see a marked linear trend to the imputed values.

If you are interested to take a look, here is a recent paper which references the Kola datasets, along with a copy of the data:
Paper and two datasets

UnboundLocalError: local variable 'train_rng' referenced before assignment

If no seed is given when initialising the Midas object, then no seed is passed to Midas.train_model() and so the variable train_rng is left unassigned (line 748) and this creates an error on line on 759 when a value for train_rng is expected.

I suspect this same issue will arise in other areas where if self.seed is not None: is used without a corresponding else statement (e.g. line 1184 in Midas.over_impute()).

I suspect this can be fixed by simply adding an else statement which generates a random seed and uses this to assign a value to train_rng

Interpreter settings:
Python 3.9

numpy~=1.22.1
pandas~=1.3.5

scipy==1.8.0
matplotlib~=3.5.1
scikit-learn~=1.0.1
tensorflow==2.8.0
keras~=2.6.0
graphviz~=0.19
MIDASpy~=1.2.1
statsmodels~=0.13.2

Train data

when i try to train data " adult data"
this message showed up
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Imputation target contains no missing values. Please ensure missing values are encoded as type np.nan
I tried to replace the missing values with np.nan but same message came

Use of ```isinstance``` instead of ```type```

Firstly, a great package.

I noticed that the package uses if type(var) == float:, and thought it may be useful to modify the behaviour to be more Pydantic.

To summarise, isinstance caters for inheritance (where an instance of a derived class is an instance of a base class), while checking for equality of type does not. This instead demands identity of types and rejects instances of subclasses.

Typical Python code should support inheritance, so isinstance is less bad than checking types, as it supports inheritance. However, “duck typing” would be the preferred (try, except), catching all exceptions associated with an incorrect type (TypeError).

I refer to lines 142-153, whereby the list type is evaluated:

    if type(layer_structure) == list:
      self.layer_structure = layer_structure
    else:
      raise ValueError("Layer structure must be specified within a list")

which could be achieved more elegantly using:

if not isinstance(layer_structure, list):
    raise TypeError("Layer structure must be specified within a list.")

181-187:

    if weight_decay == 'default':
      self.weight_decay = 'default'
    elif type(weight_decay) == float:
      self.weight_decay = weight_decay
    else:
      raise ValueError("Weight decay argument accepts either 'standard' (string) "\
                       "or floating point")

whereby the type (or types) could be hinted to the user within the init dunder method, and can be evaluated through:

if isinstance(weight_decay, str):
   if weight_decay != 'default':
        raise ValueError("A warning that the value must be 'default' or a float type")
   self.weight_decay = weight_decay
elif isinstance(weight_decay, float):
   self.weight_decay = weight_decay

Depending on the python versions supported, I would also recommend using typehints, and using the below:

from typing import List

abc_var: List[int]

More than happy to submit a PR with the proposed changes.

Torch/TF2 version

MIDASpy is currently implemented using logic of TF1 and compatibility layers. As TF2 matures and more graph-based features become deprecated (see e.g. #21), we will need to plan for larger scale update of codebase.

We could try rebuild in TF2 natively or alternatively pivot to PyTorch implementation, which has a more "pythonic" feel.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.