GithubHelp home page GithubHelp logo

jrzaurin / pytorch-widedeep Goto Github PK

View Code? Open in Web Editor NEW
1.2K 26.0 183.0 97.28 MB

A flexible package for multimodal-deep-learning to combine tabular data with text and images using Wide and Deep models in Pytorch

License: Apache License 2.0

Python 97.47% Shell 0.03% JavaScript 1.99% CSS 0.51%
pytorch tabular-data text images multimodal-deep-learning pytorch-tabular-data pytorch-nlp pytorch-cv pytorch-transformers deep-learning

pytorch-widedeep's People

Contributors

5uperpalo avatar brunobelucci avatar gradientsky avatar hyonaldo avatar jin530 avatar jrzaurin avatar kd1510 avatar luoxueling avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-widedeep's Issues

Consistent names for model components

At the moment there are two inconsistencies of different nature:

  1. There is one very small inconsistency between pypi and the github version. If you installed the pypi version, the Transformer Encoder blocks component of the TabTransformer model are called blks. However, if you install from github those blocks are called tab_transformer_blks, which is a name more consistent with the other two models (TabMlp and TabResnet). This inconsistency will be fixed in the coming release
  2. The Resnet blocks in the TabResnet model are named tab_resnet. A more consistent name would be tab_resnet_blks

All this needs to be fixed in the next version

About loss function

Hello everyone.
Can I customize the loss function? Train the model?

good luck

Directly extracting embeddings?

Hello, I would like to use the widedeep model together with pytorch-metric-learning, which uses (embedding, label) pairs. Is it possible to extract the embedding from the model in an easy fashion, e.g. perhaps with like an identity head component?

Thank you!

A more advance text component

Currently the default text component is simply a stack of LSTMs:

while the image component component can be a pretrained ResNet:

In an ideal situation, the text component could also be a pre-trained architecture, maybe the AWD-LSTM replicating the code in the ULMFit paper and available in the fastai library. Maybe bringing BERT and company, or the SHA-RNN.

Computing AUC on epoch end [suggestions are welcomed]

Hey, I'm trying to write a custom metric to allow computing AUC score after each epoch's end. I tried using torchmetric's AUC but it seems that the attribute num_classes is required and I can't set it on torchmetric's AUC.
It seems weird since for Accuracy num_classes is inferred from the shape of y_pred and I can't set num_classes on it as well.

This is where I am so far:

from pytorch_widedeep.metrics import Metric
from sklearn.metrics import roc_auc_score

class CustomAUC(Metric):
  def __init__(self, num_classes): # here I'm forcing the parameter on init
    if num_classes > 2:
      raise ValueError("AUC for multiclass problems is not currently supported.")
    
    self.num_classes = num_classes
    self._name = "auc"
  
  def reset(self):
    # I don't know what should be put here
    pass

  def __call__(self, y_pred, y_true):
    yp = y_pred.clone().detach().cpu().numpy()
    yt = y_true.clone().detach().cpu().numpy()
    return roc_auc_score(yt, yp)

Right now is working but I was wondering if there is (it surely does) a better way to write this piece of code. Suggestions are welcomed!

Thank you, awesome library

about the wide feature and deep feature

first of all. thanks for your work. I am a little confuse about the split of the wide and deep feature. embeddings_cols, continuous_cols, standardize_cols, wide_cols, crossed_cols, already_dummies. embeddings_cols is the category feature. continuous_cols is continuous feature. what's the different between standardize_cols and continuous_cols? the wide_cols feature is similar to the category feature. crossed_cols I know you want to get the interaction feature.how to select the element to constitute interaction feature? thanks in advance.

'softmax' activation function

Good day! I was wondering if we could use 'softmax' activation function for classification problems? I was trying to use TabResnet and FTTransformer for multi-class classification problem and the last layer is a linear. I was hoping if I can apply 'softmax'? or is it automatically implied?

Thanks.

SAINT Model

This new model would be great to implement: https://arxiv.org/abs/2106.01342

They mention that they even found performance improvements in the TabTransformer architecture by embedding continuous variables (variable-specific linear layer to go form 1 -> n_embeddings). I have made a fork and implemented this as an option in tab_transformer. Okay to make a PR?

question about loss

hello,i want to know why the input need to be "input = torch.cat([1 - input, input], axis=1)",what the meaning of 1-input. the input should be original input that not the sigmoid result,because you use F.binary_cross_entropy_loss_withlogits instead of binary_cross_entropy.

Fix Focal Loss input/output for upcoming version (temporal solution in master)

The issue is related to the use of F.binary_cross_entropy_with_logits instead of F.binary_cross_entropy. The problem is that I use a sigmoid activation in the case of a binary classification and no activation for a multiclass problem. Therefore, although the use of F.binary_cross_entropy_with_logits is correct for multiclass, is not the case for binary.

also the method get_weight applies a sigmoid activation to the entries, which is only correct for multiclass.

Need to fix all this

See here:

def _activation_fn(self, inp: Tensor) -> Tensor:

recall metrics

hello,Javier:
you do a good job, thank you very much. but still a little suggestions, most people usually meet extremely imbalanced data, maybe recall metrics will be better than accuracy, so in order to improve the effect ,you can add recall or precision metrics ~

thank you again for you contribution~

how to save and load model?

exception occupied when i save model toch.save(model, a.model) and then use toch.load(a.model) and then preprocess_wide.transform to predict my label.

exception like this:
This WidePreprocessor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

please help me !

Add Regression losses

Add losses that show promising results for regression prediction in scenarios with highly imbalanced datasets used e.g. in Life Time Value prediction

AttributeError: 'tuple' object has no attribute 'view'

Getting the following error while training.

File "train.py", line 58, in <module>
    val_split=0.1,
  File "/Users/shin/miniconda3/envs/transaction_env/lib/python3.7/site-packages/pytorch_widedeep/training/trainer_utils.py", line 284, in __call__
    return wrapped(*args, **kwargs)
  File "/Users/shin/miniconda3/envs/transaction_env/lib/python3.7/site-packages/pytorch_widedeep/training/trainer_utils.py", line 284, in __call__
    return wrapped(*args, **kwargs)
  File "/Users/shin/miniconda3/envs/transaction_env/lib/python3.7/site-packages/pytorch_widedeep/training/trainer_utils.py", line 284, in __call__
    return wrapped(*args, **kwargs)
  [Previous line repeated 10 more times]
  File "/Users/shin/miniconda3/envs/transaction_env/lib/python3.7/site-packages/pytorch_widedeep/training/trainer.py", line 577, in fit
    train_score, train_loss = self._train_step(data, targett, batch_idx)
  File "/Users/shin/miniconda3/envs/transaction_env/lib/python3.7/site-packages/pytorch_widedeep/training/trainer.py", line 993, in _train_step
    y = target.view(-1, 1).float() if self.method != "multiclass" else target
AttributeError: 'tuple' object has no attribute 'view'

When I run wide and deeptab,I got stuck

Hi pytorch_widedeep team,
First of all thank you for your contributions to the wide_deep field. Recently I am doing a task about classification. The essence is similar to the predicted adult salary level shown in the example in the project readme. All of my input data has been processed as numerical values. The data contains continuous values ​​and some indicators are discrete values ​​(such as: 0,1,2...). In addition, the data contains missing values. My task is a five-category problem.
Question one:
I used the code used to predict adult wages in the readme, changed the data input part and changed "binary" to "multiclass" to adapt to my task.When I executing the following code:

wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = wide_preprocessor.fit_transform(df_train)
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)

tab_preprocessor = TabPreprocessor(embed_cols=embed_cols, continuous_cols=cont_cols)
X_tab = tab_preprocessor.fit_transform(df_train)
deeptabular = TabMlp(
    mlp_hidden_dims=[64, 32],
    column_idx=tab_preprocessor.column_idx,
    embed_input=tab_preprocessor.embeddings_input,
    continuous_cols=cont_cols,
)

model = WideDeep(wide=wide, deeptabular=deeptabular)

I got the following error: "RuntimeWarning: invalid value encountered in true_divide". I understand that it is caused by the occurrence of 0 divided by 0, but the error message given by the code is not enough for me to locate the problem segment. I would like to ask how to solve the problem?
Question two:
In the process of using the function to divide the training set into the validation set:

trainer = Trainer(model, objective="multiclass", metrics=[Accuracy])
trainer.fit(
    X_wide=X_wide,
    X_tab=X_tab,
    target=target,
    n_epochs=15,
    batch_size=16,
    val_split=0.1,
)

I reported the error "IndexError: index 648 is out of bounds for axis 0 with size 530" exceeding the index range of the training set. I have no solution.
Upload the source code to the attachment, and see the ipynb file running results for detailed error information. Looking forward to your answer~
liver_predict.zip

how to save/load "preprocess_deep" and "preprocess_wide"

when I use joblib to dump the preprocess_deep and preprocess_wide:

from joblib import dump, load
...

# deepdense
preprocess_deep = DensePreprocessor(embed_cols=embed_cols, continuous_cols=cont_cols)
X_deep = preprocess_deep.fit_transform(df_train)

# dump my preprocess_deep
dump(preprocess_deep, 'preprocess_deep.pkl')

deepdense = DeepDense(
    hidden_layers=[64, 32],
    deep_column_idx=preprocess_deep.deep_column_idx,
    embed_input=preprocess_deep.embeddings_input,
    continuous_cols=cont_cols,
)

...

preprocess_deep = load("preprocess_deep.pkl")
X_deep_te = preprocess_deep.transform(df_test)

error:

Traceback (most recent call last):
File "wide_deep_baseline_train.py", line 90, in
X_deep_te = preprocess_deep.transform(df_test)
File "/usr/local/lib/python3.7/site-packages/pytorch_widedeep/preprocessing/_preprocessors.py", line 273, in transform
df_emb = self.label_encoder.transform(df_emb)
File "/usr/local/lib/python3.7/site-packages/pytorch_widedeep/utils/dense_utils.py", line 98, in transform
df_inp[k] = df_inp[k].apply(lambda x: v[x])
File "/usr/local/lib64/python3.7/site-packages/pandas/core/series.py", line 4200, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2402, in pandas._libs.lib.map_infer
File "/usr/local/lib/python3.7/site-packages/pytorch_widedeep/utils/dense_utils.py", line 98, in
df_inp[k] = df_inp[k].apply(lambda x: v[x])
KeyError: nan

augmentations/transform while predicting?

first, thanks for the package!
i observed a strange behaviour when doing regression, maybe that's expected, but was not for me.
i tried to use transformations, so before i had:

train_transforms = []
model.compile (method = 'regression', .... transforms = train_transforms) 

and then later with new data

model.predict (X_wide=..., X_deep=......)

this worked fine.

but after changing the transformations to

from torchvision import transforms as tf
train_transforms = [
    tf.ToPILImage(),
    tf.RandomCrop(512, padding=64),
    tf.ToTensor()
]
model.compile (.... transforms = train_transforms) 

suddenly i got much too high values when predicting. i figured that might stem from the transformations, so i added

transform_val = [
    tf.ToPILImage(),
    tf.ToTensor()
]

and added the 'transforms_val = transform_val' parameter to model.compile and changed def _predict in wide.deep.py accordingly:

        if X_test is not None:
            test_set = WideDeepDataset(**X_test, transforms=self.transforms_val)  # type: ignore
        else:
            load_dict = {"X_wide": X_wide, "X_deep": X_deep}
            if X_text is not None:
                load_dict.update({"X_text": X_text})
            if X_img is not None:
                load_dict.update({"X_img": X_img})
            test_set = WideDeepDataset(**load_dict, transforms=self.transforms_val)

this worked. was that change necessary or did i do something wrong and could have the same effect without changes?

ModelCheckpoint save_best_only not working

Good day! So I attempted to use the ModelCheckpoint callback in the hopes of saving the best weights throughout the training process. I was just testing so I temporarily placed 2 epochs as the number of epochs. When I checked the file folder (based on the filepath), I didn't find any '.pt' files. I checked the callbacks documentation and it seems that I didn't miss anything. May I ask if someone could enlighten me on this? I would appreciate your help!

(See screenshot below for reference)

save_best_model

How to deal with absence of values in Target Label on new Dataset?

Hello, I was going through the sample tutorial notebooks provided by you. In the notebook: Save and Load Model and Artifacts,
when we want to evaluate our saved model on new data, you have kept "Target" in the dataframe which is in fact our Label column. The values from this column aren't available to us in the real-world and are something we are expecting our model to predict. How we are providing this as an input column (along with others) to the preprocessor, while this is something which we should be predicting instead? Is this correct?

Screenshot 2021-09-27 at 19-13-10 jrzaurin pytorch-widedeep A flexible package to combine tabular data with text and images

Lack of consistency between deeptabular components

Hi,

Sorry i'm new to this library but it seems to me that there is a lack of consistency in the transition between tabprocessor and deeptabular steps or at least it would be nice to have a snippet in the examples.

If I start with the most basic example for tabprocessor

df = pd.read_csv(...)
embed_cols = [...]
cont_cols = [...]
tab_preprocessor = TabPreprocessor(embed_cols=embed_cols, continuous_cols=cont_cols)
X_tab = tab_preprocessor.fit_transform(df)

Here's my concern:
This works

tabmlp = TabMlp(
    mlp_hidden_dims=[200, 100],
    column_idx=tab_preprocessor.column_idx,
    embed_input=tab_preprocessor.embeddings_input, 
    continuous_cols=cont_cols,
)
outputs_mlp = tabmlp(torch.from_numpy(X_tab))

And this not

tabtransformer = TabTransformer(
    mlp_hidden_dims=[200, 100],
    column_idx=tab_preprocessor.column_idx, 
    embed_input=[(emb[0], emb[2]) for emb in tab_preprocessor.embeddings_input],   # Here i already have to change embeddings format
    continuous_cols=cont_cols
)
outputs_transformer = tabtransformer(torch.from_numpy(X_tab))

The error I get

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-fe7de00feead> in <module>
----> 1 embeddings = tabtransformer(torch.from_numpy(X_tab))

~/anaconda3/envs/matching/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/anaconda3/envs/matching/lib/python3.8/site-packages/pytorch_widedeep/models/tab_transformer.py in forward(self, X)
    428     def forward(self, X: Tensor) -> Tensor:
    429 
--> 430         embed = [
    431             self.embed_layers["emb_layer_" + col](
    432                 X[:, self.column_idx[col]].long()

~/anaconda3/envs/matching/lib/python3.8/site-packages/pytorch_widedeep/models/tab_transformer.py in <listcomp>(.0)
    429 
    430         embed = [
--> 431             self.embed_layers["emb_layer_" + col](
    432                 X[:, self.column_idx[col]].long()
    433             ).unsqueeze(1)

~/anaconda3/envs/matching/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/anaconda3/envs/matching/lib/python3.8/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    154 
    155     def forward(self, input: Tensor) -> Tensor:
--> 156         return F.embedding(
    157             input, self.weight, self.padding_idx, self.max_norm,
    158             self.norm_type, self.scale_grad_by_freq, self.sparse)

~/anaconda3/envs/matching/lib/python3.8/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self

Thank you in advance for your help !

Bug related to `predict_proba`

Initially, the activation function for a multi-class problems was F.softmax(). This is not necessary when using F.cross_entropy since the latter already applies a logSoftmax to the inputs.

I fix this for the .fit method but stills needs to be fixed for the predict_proba method.

Note that this will not change the accuracy metric since Softmax does not change argmax (i.e preserves the rank order of the inputs).

Is 'CUDA' available for pytorch-wideep?

Hi! I have been getting state-of-the-art results from this library. So, kudos to the team!

I was wondering if I could use 'CUDA' or GPU as my device instead of CPU?

Thank you.

Transformers fail without categorical features

The models SAINT, FTTransformer and TabFastFormer contain the code

        if self.n_cont and not self.n_cat and not self.embed_continuous:
            raise ValueError(
                "If only continuous features are used 'embed_continuous' must be set to 'True'"
            )

The variable self.embed_continuous is not set anywhere, probably it has been used in an older version of the code and was then replaced by True later. For data sets without categorical features, this code raises an error. Removing this code should resolve the issue.

How to process data in a minibatch setting

Hi! This is an amazing package, things are very clearly documented and very easy to use on any of my cases.

One question I have is how I can adapt the code here to a minibatch scenario. Looking into the trainer class definition, the fit method does initiate a custom dataloader on the dataset passed into it. So I should be able to write a custom dataloader to take care of data loading.

But where I need some pointers is how I can transform the data. The current code doesn't look like it can easily be adapted to handle partial fits, can you provide some suggestions?

##My thought on how to work around it:
I'm mainly using the TabPreprocessor. In the fit function, it first prepares embeddings for categorical features. If I create a dataframe that contains the same columns as real data and put in dummy examples that contain all possible combinations of categorical features it should work fine.

For the second step of standardizing continuous features, I'm thinking I can just write some dummy values before calling fit and overwriting the StandardScaler object with one that was already fit to all of my data.

Does this workaround seem like it would work to you? Any immediate concerns you might have? I'm currently working on this and will post any questions here. After this tight deadline, I'll also be happy to implement something more permanent to handle minibatches.

Incomplete example in deep_dense.py docstring

The example in the DeepDense class doc string is missing continuous_cols param.
4 categorical columns and one continuous column are created, though when creating the network the continuous_cols is not specified so even though you pass a 5 column tensor to the model with model(X_deep) the result is not using the 5th column at all. For e.g if after running the example you do X_deep[:, 4]=0 and then run model(X_deep) again you get the same result as before.

Need to add the arg continuous_cols=['e'] to fix.

To be continue:When I run Wide and DeepTab,I got stuck

Sum: The original question mentioned that according to the project of "Predicting Adult Salary Levels", I changed the input classification category, "RuntimeWarning" appeared and split the validation set based on the training set and encountered the problem of exceeding the index. The original problem address is as follows:#68

Question about a place of the self.bias in _embeddings_layers.py.

(

)
Hello, first of all, thank you for releasing a wonderful code.
I have a question while looking at the code related to ft_transformer.

Is there any particular reason why an adding bias function is placed after a dropout function?
Is there a problem that occurs when the dropout function is placed after the adding bias function?

Allow using models individually

Right now there is the assumption that the model needs to be comprised by at least the wide and the deep components.

The idea would be to allow the use of individual components alone

About Memory Issues with Large Dataset

Hello, Javier.
Thank you very much.
You do a great job. It is so easy to use. I love your work.

I tried to use pytorch-widedeep to learn my own data, where there are about 30M unique ids.
I noticed that the wide model first builds one-hot-vectors and then inputs them to a dense layer, and found that the one-hot-encoder from sklearn costs a lot of memories.
I think maybe it is equal to input the index of the categories to an Embedding layer, which decreases the memory occupation. However, I am not sure how it will affect the performance.

Reset metrics before validation

At the moment the metrics are not reseted before evaluation which means that if you print metrics as well as losses, the metrics will be computed considering the entire dataset (this does not affect losses).

A fix will happen asap in the current version (will only be available in github) and there is a new version coming where it will also be fixed (available in both, github and pypi)

Computation of AUROC at the epoch end

Hey , I was implementing this binary classification on adult census dataset using pytorch-widedeep in which I am trying to calculate AUROC metric using torchmetric's AUROC at the epoch's end but i am getting this error ,

trainer = Trainer(model, objective="binary", metrics=[AUROC])
trainer.fit( X_wide=X_wide, X_tab=X_tab, target=target, n_epochs=5, batch_size=64, val_split=0.2 )
TypeError: '>' not supported between instances of 'NoneType' and 'int'

Looking for help regarding this .

Thank you, awesome library

Code Structure

At the moment the main class WideDeep is a gigantic class that does everything, builds the model, compile it, trains, validates, predict:

class WideDeep(nn.Module):

It would be useful for me to have someone else having a look to the code and discuss about other potential structures. For example, would it be useful to have a Warmer separated class to warm up different component and Trainer separated class to train/validate/predict?

Tuning TabNet Hyperparameters K-Fold CV

In Scikit-learn a common workflow would be to define your hyperparameter space, the model to be tested and fit like that:

model_cv = RandomizedSearchCV(model(), hyperparameter_space, cv=cv, n_iter=n_iter)


model_cv.fit(x_train, y_train)

Now in pytorch-widedeep the recommended path for training a single model is:

trainer = Trainer(model, objective="binary", metrics=[Accuracy])


trainer.fit(X_wide=X_wide, X_tab=X_tab, target=target, n_epochs=5, batch_size=256, val_split=0.1)

So, how could I do a random search for the best hyperparameter of a model (for example: TabNet or TabTransformer) in a cross validated way ? If I just put the trainer object in the RandomizedSearchCV of the first example and define the hyperparameter space of the model could I make it work right ? If that's not possible, what would be other way to do it ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.