midasverse / midaspy Goto Github PK
View Code? Open in Web Editor NEWPython package for missing-data imputation with deep learning
License: Apache License 2.0
Python package for missing-data imputation with deep learning
License: Apache License 2.0
I see the post enquiring about support for 3.9 two years ago, but how about 3.10?
Have tried running the Python example notebook and noticed that the final loss slightly changes from run to run (e.g., from 73446.1
to 73355.3
) despite setting the same seed. Does this have to do with unaccounted for randomness in the algorithm or just due rounding?
Another question, does it generally make a difference to scale the continuous data before inputting it to the algorithm? I assumed that the answer is no because it's done internally anyway; however, I noticed that in the R example the data was explictly scaled but not in the Python example.
We have a legacy build file in build
which should be removed
Getting the following warning as part of training cycle:
FutureWarning: Passing a dict as an indexer is deprecated and will raise in a future version. Use a list instead.
data_1 = data[subset]
We should update to future-proof asap.
I'm working with Dirichlet distributions and the compositional data simplex, and am really enjoying MIDASpy's flexibility when dealing with this data (related to K-L divergence in the decoder). However, there is a tendency to produce negative values in the numerical feature data I have been using.
In the case of compositional data, there is a constraint of zero as a minimum value. Other imputation approaches allow setting maximum and minimum value arguments (e.g., Scikit-Learn) and importantly these can be set per feature (autoimpute). Is this an argument which could be added to the package? It would be a major help to people working in several disciplines.
I am trying to utilize two GPUs with MIDASpy. However, I get the following error during set-up:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import tensorflow as tf
import MIDASpy as md
data_0 = pd.read_csv('/home/comp/Documents/file.txt', sep = "\t")
data_0.columns.str.strip()
data_0 = data_0.set_index('Unnamed: 0')
data_0.index.names = [None]
np.random.seed(441)
na_loc = data_0.isnull()
data_0[na_loc] = np.nan
imputer = md.Midas(layer_structure= [256, 256, 256],
learn_rate= 1e-4,
input_drop= 0.9,
train_batch = 50,
savepath= '/home/comp/Documents/save',
seed= 89)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
imputer.build_model(data_0)
AssertionError: Do not use tf.reset_default_graph() to clear nested graphs. If you need a cleared graph, exit the nesting and create a new graph.
Hello,
How to get the data in the original form (reverse dummies). We receive the imputed dataset in one hot encoded form. But how to convert it into the original dataset (the categorical data).
Thank you
Would be useful to simplify pyplot in overimpute function to remove interactive plotting -- ideal behaviour is a single plot at the end of imputation, using "agg" if possible.
It's usual for imputation or data pre-processing packages in python to support the scikit-learn interface. The interface allows the library to be used in data pipelines and existing scikit-learn infrastructures. Is there any plan on explicitly supporting it in the library?
References:
https://scikit-learn.org/stable/modules/impute.html
https://scikit-learn.org/stable/developers/develop.html
https://scikit-learn.org/stable/data_transforms.html
Current behaviour allows MIDASpy to be loaded when using TF 2.X, but returns logging error to inform users imputation only possible in TF1.X
Looks like all TF1 components can be updated to TF 2.X -- just requires additional tensorflow-addons package dependency for the AdamW optimiser.
Hi,
I was wondering if there was any heuristics on choosing a model structure for different types / sizes of datasets. For instance, if I had a standard corporate dataset with 20,000 rows and 15 columns, are there any sure-fire methods / parameters I should be using? Are there any clear do's or dont's in certain situations?
I'm essentially running the demo code, but with my own input data (all numeric data), and the data frames generated by imputer.generate_samples(m=10).output_list
still have the same missing values as in the input.
Example input table:
Feature feat1 feat2 feat3 ... feat30 feat31 feat32
ERS2551628 65.0 0.0 101.0 ... 105.0 230.0 27.0
SRS143466 43.0 NaN 34.0 ... 98.0 0.0 26.0
SRS023715 0.0 54.0 0.0 ... 33.0 55.0 NaN
SRS580227 0.0 0.0 10.0 ... 67.0 22.0 0.0
DRS091214 327457.0 0.0 NaN ... NaN 0.0 24.0
... ... ... ... ... ... ... ...
ERS2551594 74.0 15.0 21.0 ... 93.0 40.0 0.0
ERS634957 0.0 12.0 0.0 ... 0.0 45.0 0.0
DRS087574 0.0 80.0 43.0 ... 209.0 NaN 12.0
ERS634952 33.0 56.0 11.0 ... NaN 1032.0 0.0
SRS1820544 49.0 102.0 12.0 ... 13.0 27.0 49.0
...and the output:
Feature feat1 feat2 feat3 ... feat30 feat31 feat32
ERS2551628 65.0 0.0 101.0 ... 105.0 230.0 27.0
SRS143466 43.0 NaN 34.0 ... 98.0 0.0 26.0
SRS023715 0.0 54.0 0.0 ... 33.0 55.0 NaN
SRS580227 0.0 0.0 10.0 ... 67.0 22.0 0.0
DRS091214 327457.0 0.0 NaN ... NaN 0.0 24.0
... ... ... ... ... ... ... ...
ERS2551594 74.0 15.0 21.0 ... 93.0 40.0 0.0
ERS634957 0.0 12.0 0.0 ... 0.0 45.0 0.0
DRS087574 0.0 80.0 43.0 ... 209.0 NaN 12.0
ERS634952 33.0 56.0 11.0 ... NaN 1032.0 0.0
SRS1820544 49.0 102.0 12.0 ... 13.0 27.0 49.0
Any idea on why the missing values are not imputed?
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
_tflow_select 2.3.0 mkl
absl-py 0.15.0 pypi_0 pypi
aiohttp 3.8.1 py39h3811e60_0 conda-forge
aiosignal 1.2.0 pyhd8ed1ab_0 conda-forge
astor 0.8.1 pyh9f0ad1d_0 conda-forge
astunparse 1.6.3 pyhd8ed1ab_0 conda-forge
async-timeout 4.0.2 pyhd8ed1ab_0 conda-forge
attrs 21.4.0 pyhd8ed1ab_0 conda-forge
blas 1.1 openblas conda-forge
blinker 1.4 py_1 conda-forge
brotlipy 0.7.0 py39h3811e60_1003 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2021.10.26 h06a4308_2
cachetools 4.2.4 pyhd8ed1ab_0 conda-forge
certifi 2021.10.8 py39hf3d152e_1 conda-forge
cffi 1.15.0 py39h4bc2ebd_0 conda-forge
charset-normalizer 2.0.9 pyhd8ed1ab_0 conda-forge
click 8.0.3 py39hf3d152e_1 conda-forge
cryptography 36.0.0 py39h9ce1e76_0
cycler 0.11.0 pyhd8ed1ab_0 conda-forge
dataclasses 0.8 pyhc8e2a94_3 conda-forge
flatbuffers 1.12 pypi_0 pypi
freetype 2.11.0 h70c0345_0
frozenlist 1.2.0 py39h3811e60_1 conda-forge
gast 0.3.3 pypi_0 pypi
google-auth 1.35.0 pypi_0 pypi
google-auth-oauthlib 0.4.1 py_2 conda-forge
google-pasta 0.2.0 pyh8c360ce_0 conda-forge
grpcio 1.32.0 pypi_0 pypi
h5py 2.10.0 nompi_py39h98ba4bc_106 conda-forge
hdf5 1.10.6 nompi_h3c11f04_101 conda-forge
idna 3.3 pyhd3eb1b0_0
importlib-metadata 4.10.0 py39hf3d152e_0 conda-forge
jbig 2.1 h7f98852_2003 conda-forge
joblib 1.1.0 pypi_0 pypi
jpeg 9d h516909a_0 conda-forge
keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge
kiwisolver 1.3.2 py39h1a9c180_1 conda-forge
lcms2 2.12 hddcbb42_0 conda-forge
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
lerc 3.0 h9c3ff4c_0 conda-forge
libblas 3.9.0 1_h6e990d7_netlib conda-forge
libcblas 3.9.0 3_h893e4fe_netlib conda-forge
libdeflate 1.8 h7f98852_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 11.2.0 h1d223b6_11 conda-forge
libgfortran-ng 7.5.0 h14aa051_19 conda-forge
libgfortran4 7.5.0 h14aa051_19 conda-forge
libgomp 11.2.0 h1d223b6_11 conda-forge
liblapack 3.9.0 3_h893e4fe_netlib conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libopenblas 0.3.13 h4367d64_0
libpng 1.6.37 hed695b0_2 conda-forge
libprotobuf 3.19.2 h780b84a_0 conda-forge
libstdcxx-ng 11.2.0 he4da1e4_11 conda-forge
libtiff 4.3.0 h6f004c6_2 conda-forge
libuuid 2.32.1 h14c3975_1000 conda-forge
libwebp-base 1.2.1 h7f98852_0 conda-forge
libzlib 1.2.11 h36c2ea0_1013 conda-forge
lz4-c 1.9.3 h9c3ff4c_1 conda-forge
markdown 3.3.6 pyhd8ed1ab_0 conda-forge
matplotlib 3.3.2 0 conda-forge
matplotlib-base 3.3.2 py39h98787fa_1 conda-forge
midaspy 1.2.1 pypi_0 pypi
multidict 5.2.0 py39h3811e60_1 conda-forge
ncurses 6.2 h58526e2_4 conda-forge
numpy 1.19.5 pypi_0 pypi
oauthlib 3.1.1 pyhd8ed1ab_0 conda-forge
olefile 0.46 pyh9f0ad1d_1 conda-forge
openblas 0.3.4 h9ac9557_1000 conda-forge
openjpeg 2.4.0 hb52868f_1 conda-forge
openssl 3.0.0 h7f98852_2 conda-forge
opt_einsum 3.3.0 pyhd8ed1ab_1 conda-forge
pandas 1.3.5 py39hde0f152_0 conda-forge
patsy 0.5.2 pyhd8ed1ab_0 conda-forge
pillow 8.4.0 py39ha612740_0 conda-forge
pip 21.3.1 pyhd8ed1ab_0 conda-forge
protobuf 3.19.2 py39he80948d_0 conda-forge
pyasn1 0.4.8 py_0 conda-forge
pyasn1-modules 0.2.8 py_0
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pyjwt 2.3.0 pyhd8ed1ab_1 conda-forge
pyopenssl 21.0.0 pyhd8ed1ab_0 conda-forge
pyparsing 3.0.6 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 py39hf3d152e_4 conda-forge
python 3.9.9 h543edf9_0_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python_abi 3.9 2_cp39 conda-forge
pytz 2021.3 pyhd8ed1ab_0 conda-forge
pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge
readline 8.1 h46c0cb4_0 conda-forge
requests 2.27.0 pyhd8ed1ab_0 conda-forge
requests-oauthlib 1.3.0 pyh9f0ad1d_0 conda-forge
rsa 4.8 pyhd8ed1ab_0 conda-forge
scikit-learn 1.0.2 pypi_0 pypi
scipy 1.7.1 py39hc65b3f8_2
setuptools 60.2.0 py39hf3d152e_0 conda-forge
six 1.15.0 pypi_0 pypi
sqlite 3.37.0 h9cd32fc_0 conda-forge
statsmodels 0.13.1 py39hce5d2b2_0 conda-forge
tensorboard 2.6.0 py_0
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pyhd8ed1ab_0 conda-forge
tensorflow 2.4.1 mkl_py39h4683426_0
tensorflow-addons 0.15.0 pypi_0 pypi
tensorflow-base 2.4.1 mkl_py39h43e0292_0
tensorflow-estimator 2.4.0 pypi_0 pypi
termcolor 1.1.0 py_2 conda-forge
threadpoolctl 3.0.0 pypi_0 pypi
tk 8.6.11 h27826a3_1 conda-forge
tornado 6.1 py39h3811e60_2 conda-forge
typeguard 2.13.3 pypi_0 pypi
typing-extensions 3.7.4.3 pypi_0 pypi
tzdata 2021e he74cb21_0 conda-forge
urllib3 1.26.7 pyhd8ed1ab_0 conda-forge
werkzeug 2.0.2 pyhd3eb1b0_0
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
wrapt 1.12.1 pypi_0 pypi
xz 5.2.5 h516909a_1 conda-forge
yarl 1.7.2 py39h3811e60_1 conda-forge
zipp 3.6.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.11 h36c2ea0_1013 conda-forge
zstd 1.5.1 ha95c52a_0 conda-forge
In very large datasets (~30,000 samples x 1,000,000 features) with complex relationships (e.g. cancer omics data), the runtime for MIDAS can take a very long time (days?), even on a single GPU. However, I would like to take advantage of the 'overimpute' feature for hyperparameter tuning. This is prohibitive since this very useful feature runs the algorithm multiple times to evaluate various settings.
Would random downsampling of samples (columns) and/or features (rows) generalize the optimal hyperparameters to the larger dataset? For instance, a random subset of 500-1,000 samples with 5,000-10,000 features. This would be to specifically determine the optimal number of: nodes, layers, learning rate, and training epochs. I would think batch size (which can speed up training) is a function of the dataset size, so this would not generalize.
Any help would be great
Looking at the codebase I could not locate a function where the trained model could be used to impute new data after training the model. There seems to be a couple of functions that could be utilized to perform this indirectly but I am surprised that is not included as a separate function.
As recommended , I have installed all the packages, but I sometimes get an error message. the interesting point is that When I ran exactly this code on another account of Google Colab, I got no errors
!pip install numpy pandas matplotlib statsmodels scipy
!pip install tensorflow==2.11
!pip install tensorflow-addons<0.20
!pip install MIDASpy
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
import sys
import MIDASpy as md
ImportError Traceback (most recent call last)
in <cell line: 6>()
4 from sklearn.preprocessing import MinMaxScaler
5 import sys
----> 6 import MIDASpy as md
14 frames
/usr/local/lib/python3.10/dist-packages/keras/engine/base_layer_utils.py in
22
23 from keras import backend
---> 24 from keras.dtensor import dtensor_api as dtensor
25 from keras.utils import control_flow_util
26 from keras.utils import tf_inspect
ImportError: cannot import name 'dtensor_api' from 'keras.dtensor' (/usr/local/lib/python3.10/dist-packages/keras/dtensor/init.py)
Related to #7, shift legend to below plotting area.
Need to account for clipping of legend when saving, and for varying numbers of items dependent on input data.
Running MIDAS using VAE leads to deprecation warning re. tf.compat.v1.distributions.
E.g.
>>> tf.compat.v1.distributions.Normal()
WARNING:tensorflow:From <stdin>:1: Normal.__init__ (from tensorflow.python.ops.distributions.normal) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
Migrating affected code to tfp.distributions is not straightforward as not designed for TF1 graph-oriented model. We should investigate solutions to safeguard codebase in medium term.
Are there any plans of supporting python 3.9? I would really appreciate it.
Sometimes we know that a set of variables should add up to a given total. Measurements involving proportions, percentages, probabilities, concentrations are compositional data. These data occur often in household and business surveys, nutritional information for food, population surveys, biological and genetic data, etc.
The complication of compositional data are that the features are inherently mathematically related, leading to spurious correlation coefficients if applying conventional statistical or ML approaches (e.g., calculating Euclidean distance metrics). However, use of K-L distance is potentially a way to avoid this issue, and so MIDAS might offer a nice Deep Learning solution to imputation issues concerning compositional data.
However, some preliminary experiments using classic compositional data imputation datasets and MIDASpy hasn't performed as well as I might have expected, and I was wondering if you'd be able to comment?
For example, I imposed 30% missingness at random on the 'Kola soil horizon' geochemical dataset, and compared the known vs imputed samples against each other. You can see a marked linear trend to the imputed values.
If you are interested to take a look, here is a recent paper which references the Kola datasets, along with a copy of the data:
Paper and two datasets
If no seed is given when initialising the Midas object, then no seed is passed to Midas.train_model() and so the variable train_rng
is left unassigned (line 748) and this creates an error on line on 759 when a value for train_rng
is expected.
I suspect this same issue will arise in other areas where if self.seed is not None:
is used without a corresponding else
statement (e.g. line 1184 in Midas.over_impute()).
I suspect this can be fixed by simply adding an else
statement which generates a random seed and uses this to assign a value to train_rng
Interpreter settings:
Python 3.9
numpy~=1.22.1
pandas~=1.3.5
scipy==1.8.0
matplotlib~=3.5.1
scikit-learn~=1.0.1
tensorflow==2.8.0
keras~=2.6.0
graphviz~=0.19
MIDASpy~=1.2.1
statsmodels~=0.13.2
when i try to train data " adult data"
this message showed up
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Imputation target contains no missing values. Please ensure missing values are encoded as type np.nan
I tried to replace the missing values with np.nan but same message came
Firstly, a great package.
I noticed that the package uses if type(var) == float:
, and thought it may be useful to modify the behaviour to be more Pydantic.
To summarise, isinstance
caters for inheritance (where an instance of a derived class is an instance of a base class), while checking for equality of type
does not. This instead demands identity of types and rejects instances of subclasses.
Typical Python code should support inheritance, so isinstance
is less bad than checking types, as it supports inheritance. However, “duck typing” would be the preferred (try, except), catching all exceptions associated with an incorrect type (TypeError).
I refer to lines 142-153, whereby the list type is evaluated:
if type(layer_structure) == list:
self.layer_structure = layer_structure
else:
raise ValueError("Layer structure must be specified within a list")
which could be achieved more elegantly using:
if not isinstance(layer_structure, list):
raise TypeError("Layer structure must be specified within a list.")
181-187:
if weight_decay == 'default':
self.weight_decay = 'default'
elif type(weight_decay) == float:
self.weight_decay = weight_decay
else:
raise ValueError("Weight decay argument accepts either 'standard' (string) "\
"or floating point")
whereby the type (or types) could be hinted to the user within the init dunder method, and can be evaluated through:
if isinstance(weight_decay, str):
if weight_decay != 'default':
raise ValueError("A warning that the value must be 'default' or a float type")
self.weight_decay = weight_decay
elif isinstance(weight_decay, float):
self.weight_decay = weight_decay
Depending on the python versions supported, I would also recommend using typehints, and using the below:
from typing import List
abc_var: List[int]
More than happy to submit a PR with the proposed changes.
MIDASpy is currently implemented using logic of TF1 and compatibility layers. As TF2 matures and more graph-based features become deprecated (see e.g. #21), we will need to plan for larger scale update of codebase.
We could try rebuild in TF2 natively or alternatively pivot to PyTorch implementation, which has a more "pythonic" feel.
AttributeError: module 'MIDASpy' has no attribute 'cat_conv'
how i can solve this problem ? I'm using jupyter
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.