dreamquark-ai / tabnet Goto Github PK

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

Home Page: https://dreamquark-ai.github.io/tabnet/

License: MIT License

Dockerfile 0.22% Makefile 1.90% Jupyter Notebook 35.41% Python 59.96% Shell 1.68% Batchfile 0.31% Smarty 0.05% CSS 0.25% HTML 0.22%

pytorch deep-neural-networks machine-learning-library tabular-data research-paper pytorch-tabnet tabnet

tabnet's Introduction

README

TabNet : Attentive Interpretable Tabular Learning

This is a pyTorch implementation of Tabnet (Arik, S. O., & Pfister, T. (2019). TabNet: Attentive Interpretable Tabular Learning. arXiv preprint arXiv:1908.07442.) https://arxiv.org/pdf/1908.07442.pdf. Please note that some different choices have been made overtime to improve the library which can differ from the orginal paper.

Any questions ? Want to contribute ? To talk with us ? You can join us on Slack

Installation

Easy installation

You can install using pip or conda as follows.

with pip

pip install pytorch-tabnet

with conda

conda install -c conda-forge pytorch-tabnet

Source code

If you wan to use it locally within a docker container:

git clone [email protected]:dreamquark-ai/tabnet.git
cd tabnet to get inside the repository

CPU only

make start to build and get inside the container

GPU

make start-gpu to build and get inside the GPU container

poetry install to install all the dependencies, including jupyter
make notebook inside the same terminal. You can then follow the link to a jupyter notebook with tabnet installed.

What is new ?

from version > 4.0 attention is now embedding aware. This aims to maintain a good attention mechanism even with large number of embedding. It is also now possible to specify attention groups (using grouped_features). Attention is now done at the group level and not feature level. This is especially useful if a dataset has a lot of columns coming from on single source of data (exemple: a text column transformed using TD-IDF).

Contributing

When contributing to the TabNet repository, please make sure to first discuss the change you wish to make via a new or already existing issue.

Our commits follow the rules presented here.

What problems does pytorch-tabnet handle?

TabNetClassifier : binary classification and multi-class classification problems
TabNetRegressor : simple and multi-task regression problems
TabNetMultiTaskClassifier: multi-task multi-classification problems

How to use it?

TabNet is now scikit-compatible, training a TabNetClassifier or TabNetRegressor is really easy.

from pytorch_tabnet.tab_model import TabNetClassifier, TabNetRegressor

clf = TabNetClassifier()  #TabNetRegressor()
clf.fit(
  X_train, Y_train,
  eval_set=[(X_valid, y_valid)]
)
preds = clf.predict(X_test)

or for TabNetMultiTaskClassifier :

from pytorch_tabnet.multitask import TabNetMultiTaskClassifier
clf = TabNetMultiTaskClassifier()
clf.fit(
  X_train, Y_train,
  eval_set=[(X_valid, y_valid)]
)
preds = clf.predict(X_test)

The targets on y_train/y_valid should contain a unique type (e.g. they must all be strings or integers).

Default eval_metric

A few classic evaluation metrics are implemented (see further below for custom ones):

binary classification metrics : 'auc', 'accuracy', 'balanced_accuracy', 'logloss'
multiclass classification : 'accuracy', 'balanced_accuracy', 'logloss'
regression: 'mse', 'mae', 'rmse', 'rmsle'

Important Note : 'rmsle' will automatically clip negative predictions to 0, because the model can predict negative values. In order to match the given scores, you need to use np.clip(clf.predict(X_predict), a_min=0, a_max=None) when doing predictions.

Custom evaluation metrics

You can create a metric for your specific need. Here is an example for gini score (note that you need to specifiy whether this metric should be maximized or not):

from pytorch_tabnet.metrics import Metric
from sklearn.metrics import roc_auc_score

class Gini(Metric):
    def __init__(self):
        self._name = "gini"
        self._maximize = True

    def __call__(self, y_true, y_score):
        auc = roc_auc_score(y_true, y_score[:, 1])
        return max(2*auc - 1, 0.)

clf = TabNetClassifier()
clf.fit(
  X_train, Y_train,
  eval_set=[(X_valid, y_valid)],
  eval_metric=[Gini]
)

A specific customization example notebook is available here : https://github.com/dreamquark-ai/tabnet/blob/develop/customizing_example.ipynb

Semi-supervised pre-training

Added later to TabNet's original paper, semi-supervised pre-training is now available via the class TabNetPretrainer:

# TabNetPretrainer
unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax' # "sparsemax"
)

unsupervised_model.fit(
    X_train=X_train,
    eval_set=[X_valid],
    pretraining_ratio=0.8,
)

clf = TabNetClassifier(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    scheduler_params={"step_size":10, # how to use learning rate scheduler
                      "gamma":0.9},
    scheduler_fn=torch.optim.lr_scheduler.StepLR,
    mask_type='sparsemax' # This will be overwritten if using pretrain model
)

clf.fit(
    X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    from_unsupervised=unsupervised_model
)

The loss function has been normalized to be independent of pretraining_ratio, batch_size and the number of features in the problem. A self supervised loss greater than 1 means that your model is reconstructing worse than predicting the mean for each feature, a loss bellow 1 means that the model is doing better than predicting the mean.

A complete example can be found within the notebook pretraining_example.ipynb.

/!\ : current implementation is trying to reconstruct the original inputs, but Batch Normalization applies a random transformation that can't be deduced by a single line, making the reconstruction harder. Lowering the batch_size might make the pretraining easier.

Data augmentation on the fly

It is now possible to apply custom data augmentation pipeline during training. Templates for ClassificationSMOTE and RegressionSMOTE have been added in pytorch-tabnet/augmentations.py and can be used as is.

Easy saving and loading

It's really easy to save and re-load a trained model, this makes TabNet production ready.

# save tabnet model
saving_path_name = "./tabnet_model_test_1"
saved_filepath = clf.save_model(saving_path_name)

# define new model with basic parameters and load state dict weights
loaded_clf = TabNetClassifier()
loaded_clf.load_model(saved_filepath)

Useful links

Model parameters

n_d : int (default=8)

Width of the decision prediction layer. Bigger values gives more capacity to the model with the risk of overfitting. Values typically range from 8 to 64.
n_a: int (default=8)

Width of the attention embedding for each mask. According to the paper n_d=n_a is usually a good choice. (default=8)
n_steps : int (default=3)

Number of steps in the architecture (usually between 3 and 10)
gamma : float (default=1.3)

This is the coefficient for feature reusage in the masks. A value close to 1 will make mask selection least correlated between layers. Values range from 1.0 to 2.0.
cat_idxs : list of int (default=[] - Mandatory for embeddings)

List of categorical features indices.
cat_dims : list of int (default=[] - Mandatory for embeddings)

List of categorical features number of modalities (number of unique values for a categorical feature) /!\ no new modalities can be predicted
cat_emb_dim : list of int (optional)

List of embeddings size for each categorical features. (default =1)
n_independent : int (default=2)

Number of independent Gated Linear Units layers at each step. Usual values range from 1 to 5.
n_shared : int (default=2)

Number of shared Gated Linear Units at each step Usual values range from 1 to 5
epsilon : float (default 1e-15)

Should be left untouched.
seed : int (default=0)

Random seed for reproducibility
momentum : float

Momentum for batch normalization, typically ranges from 0.01 to 0.4 (default=0.02)
clip_value : float (default None)

If a float is given this will clip the gradient at clip_value.
lambda_sparse : float (default = 1e-3)

This is the extra sparsity loss coefficient as proposed in the original paper. The bigger this coefficient is, the sparser your model will be in terms of feature selection. Depending on the difficulty of your problem, reducing this value could help.
optimizer_fn : torch.optim (default=torch.optim.Adam)

Pytorch optimizer function
optimizer_params: dict (default=dict(lr=2e-2))

Parameters compatible with optimizer_fn used initialize the optimizer. Since we have Adam as our default optimizer, we use this to define the initial learning rate used for training. As mentionned in the original paper, a large initial learning rate of 0.02 with decay is a good option.
scheduler_fn : torch.optim.lr_scheduler (default=None)

Pytorch Scheduler to change learning rates during training.
scheduler_params : dict

Dictionnary of parameters to apply to the scheduler_fn. Ex : {"gamma": 0.95, "step_size": 10}
model_name : str (default = 'DreamQuarkTabNet')

Name of the model used for saving in disk, you can customize this to easily retrieve and reuse your trained models.
verbose : int (default=1)

Verbosity for notebooks plots, set to 1 to see every epoch, 0 to get None.
device_name : str (default='auto') 'cpu' for cpu training, 'gpu' for gpu training, 'auto' to automatically detect gpu.
mask_type: str (default='sparsemax') Either "sparsemax" or "entmax" : this is the masking function to use for selecting features.
grouped_features: list of list of ints (default=None) This allows the model to share it's attention accross feature inside a same group. This can be especially useful when your preprocessing generates correlated or dependant features: like if you use a TF-IDF or a PCA on a text column. Note that feature importance will be exactly the same between features on a same group. Please also note that embeddings generated for a categorical variable are always inside a same group.
n_shared_decoder : int (default=1)

Number of shared GLU block in decoder, this is only useful for TabNetPretrainer.
n_indep_decoder : int (default=1)

Number of independent GLU block in decoder, this is only useful for TabNetPretrainer.

Fit parameters

X_train : np.array or scipy.sparse.csr_matrix

Training features
y_train : np.array

Training targets
eval_set: list of tuple

List of eval tuple set (X, y).
The last one is used for early stopping
eval_name: list of str
List of eval set names.
eval_metric : list of str
List of evaluation metrics.
The last metric is used for early stopping.
max_epochs : int (default = 200)

Maximum number of epochs for trainng.
patience : int (default = 10)

Number of consecutive epochs without improvement before performing early stopping.

If patience is set to 0, then no early stopping will be performed.

Note that if patience is enabled, then best weights from best epoch will automatically be loaded at the end of fit.
weights : int or dict (default=0)

/!\ Only for TabNetClassifier Sampling parameter 0 : no sampling 1 : automated sampling with inverse class occurrences dict : keys are classes, values are weights for each class
loss_fn : torch.loss or list of torch.loss

Loss function for training (default to mse for regression and cross entropy for classification) When using TabNetMultiTaskClassifier you can set a list of same length as number of tasks, each task will be assigned its own loss function
batch_size : int (default=1024)

Number of examples per batch. Large batch sizes are recommended.
virtual_batch_size : int (default=128)

Size of the mini batches used for "Ghost Batch Normalization". /!\ virtual_batch_size should divide batch_size
num_workers : int (default=0)

Number or workers used in torch.utils.data.Dataloader
drop_last : bool (default=False)

Whether to drop last batch if not complete during training
callbacks : list of callback function
List of custom callbacks

pretraining_ratio : float

  /!\ TabNetPretrainer Only : Percentage of input features to mask during pretraining.

  Should be between 0 and 1. The bigger the harder the reconstruction task is.

warm_start : bool (default=False) In order to match scikit-learn API, this is set to False. It allows to fit twice the same model and start from a warm start.
compute_importance : bool (default=True)

Whether to compute feature importance

tabnet's People

Contributors

Stargazers

Watchers

Forkers

j-abi wildcat47 alexandrecameron hal-314 takotab changrongji jtilly entn-at trungnghiahoang96 fortin-alex niuwan1 martinsotir manujosephv eduardocarvp priyalnarang x-malet intrinsic-tech-dev kiminh longjun0615 galacticsurfer akakakakakaa valeman stjordanis saewony lijiashi saswat0 nanaakwasiabayieboateng genka7 albertvillanova andreipit cxz nickhuanga hadraed zhi-hope zergey yangqiu benleungpg naveenkb frankherfert ryanwongsa i8dnlo dsadulla sachinruk michaelgao8 manikant92 pro100olga jingmouren law101 lakimad ddofer bobycv06fpm rlds-107 athewsey alexismignon stockedge chang111 codetcode chillum-codex geodesic1 albertocastelo csuzhhj bennyjg rodrigolima82 amaigo quboanthony abhijit-ml hsviscarra yuntai transconnectome utksh jaredcolerosenberg guolz-ml kukuleta forbu pgsrv nhoues kkontoudi jrfiedler xrosliang 610265158 zeta1999 yinanli617 vanrao-stack cuikaichina panda-puff khuongnd codingmice mohamed-180 hirune924 xinjieinformatik isears miguel-bm prakriti06041999 petomajci teacher-tony12138 rrrajjjj tanish-g dmitriyg228 vibhatha ashiakerwang

tabnet's Issues

Make TabNet Scikit Compatible

Feature request

Currently, the library can't be used as simply as a scikit model. It would be great to be fully scikit compatible

What is the expected behavior?
We need new classes for TabNetRegressor, TabNetClassifier.
We also need to get scikit compatible global explainations.

What is motivation or use case for adding/changing the behavior?

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Training is very slow

Hi all,
Thanks for the clean implementation of this model!
I'm comparing tabnet to an MLP and some gradient boosted tree models on a very large (~terabyte) dataset. Tabnet is several orders of magnitude slower than the MLP with a comparable parameter count. It also seems to occupy a lot of memory on the GPU. Is this expected and is there something I can do about this?

Describe the bug

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Example Regression not working in Google Colab

Describe the bug
I tried to run on gpu and then on cpu and with different embedding sizes.
Still I get a dimension error.

Here is the link to the notebook:
https://colab.research.google.com/drive/1wDQ28PNxtEJA1XZyN2eVA6iTSd6ctf-E?usp=sharing

Maybe related to #94

Performance of pytorch-tabnet on forest cover type dataset

Running out of the box the forest_example, the results differ significantly from the ones in the original paper. Specifically, I get the following:

preds = clf.predict_proba(X_test)
y_true = y_test
test_acc = accuracy_score(y_pred=np.argmax(preds, axis=1), y_true=y_true)
print(f"BEST VALID SCORE FOR {dataset_name} : {clf.best_cost}")
BEST VALID SCORE FOR EPIGN : -0.8830427851320214

print(f"FINAL TEST SCORE FOR {dataset_name} : {test_acc}")
FINAL TEST SCORE FOR EPIGN : 0.0499728922661205

Do you get similar results? Many thanks.

Sample weigh support for regression problems

Feature request

What is the expected behavior?
It would be very helpful to add sample weight support for regression problems. The idea would be to add a 'sample_weight' parameter to the .fit() call, and give a weighted regression.

What is motivation or use case for adding/changing the behavior?
Many datasets involve different sample weights. This is especially common with sports data (where I work), but is frequently used elsewhere.

How should this be implemented in your opinion?
The usual implementation I've seen has been to multiply the individual residuals by the sample weight, but I am not very familiar with the underlying math here, so don't know how it would work.

Are you willing to work on this yourself?
I am happy to help, but my understanding of the underlying code is lacking at the moment.

Ensuring I have attention right

Hi there! Could you please help verify that I've made sure to do attention right? I'm working off the fastai implementation, and so it would be faster to read up here but essentially I made a modification to his model that can return the masks. So it currently looks like so:

learn.model.eval()
for batch_nb, data in enumerate(dl):
  with torch.no_grad():
    out, M_loss, M_explain, masks = learn.model(data[0], data[1], True)
  for key, value in masks.items():
    masks[key] = csc_matrix.dot(value.numpy(), matrix)
  if batch_nb == 0:
    res_explain = csc_matrix.dot(M_explain.numpy(),
                                 matrix)
    res_masks = masks
  else:
    res_explain = np.vstack([res_explain,
                             csc_matrix.dot(M_explain.numpy(),
                                            matrix)])
    for key, value in masks.items():
      res_masks[key] = np.vstack([res_masks[key], value])

From here to plot, I do:

fig, axs = plt.subplots(1, 3, figsize=(20,20))
for i in range(3):
  axs[i].imshow(np.expand_dims(res_masks[0][i], 0))

Now I chose to do the np.expand_dims as it let's us visualize on an indivudal item level what is going on. Is this the correct way to do this sort of analysis? Or should I have included it at a batch level (or does it really not make a difference in the end).

Thanks!

Add CI for Git lint

Add CI to enforce conventional commit : https://www.conventionalcommits.org/en/v1.0.0/

question : only fc is shared, but bn (and glu) is not?

According to the paper, it seems that in the feature transformer in Figure.4(a),
all fc-bn-glu are shared. However, your implementation only shares fc.

is there a reason for this implementation?

[Question/Feature Request] Any explainability output or examples available to try out

Feature request

What is the expected behavior?
No behaviour changes. Rather add examples to the docs or the examples section of the repo.

What is motivation or use case for adding/changing the behavior?
Make it easy to allow users to adapt it into their ML workflow. As explainability is an important topic in the current atmosphere.

How should this be implemented in your opinion?
No implementation needed, just docs and examples either as a python code snippet, a Jupyter notebook or a Kaggle kernel will be sufficient.

Are you willing to work on this yourself?
yes

Research : Boosted-TabNet?

Main Remark

Tabnet architecture is using sequential steps in order to mimic some kind of random forest paradigm.
But since boosting algorithms often outperform random forests shouldn't we try to move towards boosting methods instead of random forest?

Proposed Solutions

One solution I see here would be to predict different things at each step of the tabnet to perform boosting:

first step would remain as now
second step would try to predict the residuals (i.e the difference between the actual target and the first step predictions)
next step would try to predict residuals as well (i.e the difference between the actual target and the sum of previous steps predictions)

This looks like it could work quite easily for regression problems but I'm not sure how it could work for classification tasks, you can't stay in the classification paradigm and try to predict residuals. If anyone knows about a specific loss function that would make that happen I think it's worth a try!

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

Speed up: stop computing masks explanations during training

Feature request

What is the expected behavior?
During training, masks don't need to be available for users. We could skip some computations as discussed in #102

What is motivation or use case for adding/changing the behavior?
This should speed things up

How should this be implemented in your opinion?
not sure yet

Are you willing to work on this yourself?
yes

RuntimeError: CUDA error: device-side assert triggered

Describe the bug

I get this CUDA error when trying to fit the classifier (with GPU).

I've also tried switching to CPU and got a different error => "RuntimeError: Invalid index in gather at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:657" where now the error seems to be related to an index tensor that has invalid indices and I'm not sure on how to solve this.

What is the current behavior?
This error happen when fitting a classifier with exactly the same parameters as in the "census_examples" notebook but on the different dataset.

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Here is the details of the error when running fit with CPUs :

RuntimeError Traceback (most recent call last)
in
7 batch_size=512, virtual_batch_size=128,
8 num_workers=0,
----> 9 drop_last=False
10 )

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit(self, X_train, y_train, X_valid, y_valid, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last)
165 self.patience_counter < self.patience):
166 starting_time = time.time()
--> 167 fit_metrics = self.fit_epoch(train_dataloader, valid_dataloader)
168
169 # leaving it here, may be used for callbacks later

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit_epoch(self, train_dataloader, valid_dataloader)
222 DataLoader with valid set
223 """
--> 224 train_metrics = self.train_epoch(train_dataloader)
225 valid_metrics = self.predict_epoch(valid_dataloader)
226

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in train_epoch(self, train_loader)
487
488 for data, targets in train_loader:
--> 489 batch_outs = self.train_batch(data, targets)
490 if self.output_dim == 2:
491 y_preds.append(torch.nn.Softmax(dim=1)(batch_outs["y_preds"])[:, 1]

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in train_batch(self, data, targets)
530 self.optimizer.zero_grad()
531
--> 532 output, M_loss = self.network(data)
533
534 loss = self.loss_fn(output, targets)

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in forward(self, x)
254 def forward(self, x):
255 x = self.embedder(x)
--> 256 return self.tabnet(x)
257
258 def forward_masks(self, x):

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in forward(self, x)
130
131 for step in range(self.n_steps):
--> 132 M = self.att_transformers[step](prior, att)
133 M_loss += torch.mean(torch.sum(torch.mul(M, torch.log(M+self.epsilon)),
134 dim=1))

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in forward(self, priors, processed_feat)
290 x = self.bn(x)
291 x = torch.mul(x, priors)
--> 292 x = self.sp_max(x)
293 return x
294

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/sparsemax.py in forward(self, input)
89
90 def forward(self, input):
---> 91 return sparsemax(input, self.dim)
92
93

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/sparsemax.py in forward(ctx, input, dim)
41 max_val, _ = input.max(dim=dim, keepdim=True)
42 input -= max_val # same numerical stability trick as for softmax
---> 43 tau, supp_size = SparsemaxFunction._threshold_and_support(input, dim=dim)
44 output = torch.clamp(input - tau, min=0)
45 ctx.save_for_backward(supp_size, output)

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/sparsemax.py in _threshold_and_support(input, dim)
74
75 support_size = support.sum(dim=dim).unsqueeze(dim)
---> 76 tau = input_cumsum.gather(dim, support_size - 1)
77 tau /= support_size.to(input.dtype)
78 return tau, support_size

RuntimeError: Invalid index in gather at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:657

Bug with unordered cat_idx

Describe the bug

If the list of cat_idx is unordered the corresponding cat_dims used into embeddings will not match.

What is the current behavior?
The bug appear into the forward of EmbeddingGenerator.
A for loop walk througth features and take embedding corresponding to each categorical feature from the self.embeddings list wich is build in the same order as cat_idx.

If the current behavior is a bug, please provide the steps to reproduce.

Provide an unordered cat_idx list with corresponding cat_dims.

Solution

Sort the cat_dims and the corresponding emb_dims with respect to cat_idx

        self.embeddings = torch.nn.ModuleList()

        # Sort dims by cat_idx
        sorted_idxs = np.argsort(cat_idxs)
        cat_dims = [cat_dims[i] for i in sorted_idxs]
        self.cat_emb_dims = [self.cat_emb_dims[i] for i in sorted_idxs]

        for cat_dim, emb_dim in zip(cat_dims, self.cat_emb_dims):
            self.embeddings.append(torch.nn.Embedding(cat_dim, emb_dim))

Explain to tensor

Feature request

Currently output of explain is of tensor format, should be of numpy.

What is the expected behavior?
Should be numpy array

What is motivation or use case for adding/changing the behavior?

Everyone expects numpy arrays
How should this be implemented in your opinion?
.detach().numpy()

Are you willing to work on this yourself?
yes

Adding Callbacks

Main Problem

Currently some things can be changed like scheduler or optimizer but it's not possible to do things like changing the loss function, the early stopping metrics and probably some important things for specific problems.

Proposed Solutions

We should find a simple way of using callbacks in order to customize more the training process.
Something that would resemble one of these:

The easier it is and the less invasive solution for the code the better

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

Checkpoints

Feature request

Save/load/average checkpoints.

What is the expected behavior?

What is motivation or use case for adding/changing the behavior?
Smarter early stopping and possibly better generalization on predictions.

How should this be implemented in your opinion?
Good source of inspiration here: https://github.com/Qwicen/node/blob/master/lib/trainer.py

Are you willing to work on this yourself?
yes

Refactorize embeddings

Feature request

Creating an external module for embeddings generation would make code clearer.
Some improvement to skip this part if no embeddings are needed would also make the training faster (see #97 ).

What is the expected behavior?
Nothing would change, just code optimization

What is motivation or use case for adding/changing the behavior?
Code clearer and faster.

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Research : Embedding Aware Attention

Main Problem

When training with large embedding dimensions, the mask size goes up.

One problem I see is that sparsemax does not know about which columns come from the same embedded columns, this could create something a bit difficult for the model to learn:

create embeddings that make sense
mask embeddings without destroying them, in fact since sparsemax is sparse it's very unlikely that all the columns from a same embedding are used, so you lose the power of your embedding

Proposed Solutions

It's an open problem but one way I see as promising is to create embedding aware attention.

The idea would be to mask all dimensions from a same embedding the same way, either by using the mean or the max of the initial mask.

I implemented a first version here : #92

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

Using Pytorch-Tabnet as nn.Module blocks or torchvision models

class Roberta(transformers.BertPreTrainedModel):
    def __init__(self, conf):
        super(TweetModel, self).__init__(conf)
        self.roberta = transformers.RobertaModel.from_pretrained(ROBERTA_PATH, config=conf)
        
        self.dropout = nn.Dropout(0.1)
        self.l0 = nn.Linear(768, 2)
  
        torch.nn.init.normal_(self.l0.weight, std=0.02)
        torch.nn.init.normal_(self.l1.weight, std=0.02)

I want to do something like this with Tabnet and have my own custom model so that I have all the liberties of using a neural net and I don't have to do it like scikit learn again

[Question/Feature Request] You mentioned it works with GPU, does Fast-TabNet work with TPUs?

Feature request

What is the expected behavior?
Same as the outcome on CPUs and GPUs

What is motivation or use case for adding/changing the behaviour?
Better training performance

How should this be implemented in your opinion?
Similar to Tensorflow/Pytorch sends the data to the TPU

Are you willing to work on this yourself?
Happy to contribute along with another experienced developer

Bug with 1 shared layer and 2 independent layers

Describe the bug

There is a problem with the way we deal with layers indexing that deals to a bug.

What is the current behavior?

You'll get an error if trying to set n_shared to 1 and n_independent to 2 for example.

Expected behavior

We should be able to put any value without error.
A fairly simple fix should be done

Add Changelog, and process for release

Add CI to launch unit test

RuntimeError: CUDA error: an illegal memory access was encountered

Describe the bug
I'm having this CUDA error when fitting the classifier. I googled it and find out that this is a common PyTorch error so I have tried to solve this by explicitly setting the gpu device (I have only one GPU Tesla T4) but it didn't work. Although when setting the classifier with parameter : device_name: 'auto' it recognises my GPU devise.
I also tried different batch sizes but without success.

It runs nicely with CPUs though and I'm really not sure on how to make it work with GPU. Would appreciate any help if you have encountered this issue already.

Also, have check my dataset multiple times to ensure they were no NaNs or Inf values in it.

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

The details of the error:

RuntimeError Traceback (most recent call last)
in
7 batch_size=16384, virtual_batch_size=1024,
8 num_workers=0,
----> 9 drop_last=False
10 )

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit(self, X_train, y_train, X_valid, y_valid, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last)
133 virtual_batch_size=self.virtual_batch_size,
134 momentum=self.momentum,
--> 135 device_name=self.device_name).to(self.device)
136
137 self.reducing_matrix = create_explain_matrix(self.network.input_dim,

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in init(self, input_dim, output_dim, n_d, n_a, n_steps, gamma, cat_idxs, cat_dims, cat_emb_dim, n_independent, n_shared, epsilon, virtual_batch_size, momentum, device_name)
250 device_name = 'cpu'
251 self.device = torch.device(device_name)
--> 252 self.to(self.device)
253
254 def forward(self, x):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
--> 425 return self._apply(convert)
426
427 def register_backward_hook(self, hook):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
199 def _apply(self, fn):
200 for module in self.children():
--> 201 module._apply(fn)
202
203 def compute_should_use_set_data(tensor, tensor_applied):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
221 # with torch.no_grad():
222 with torch.no_grad():
--> 223 param_applied = fn(param)
224 should_use_set_data = compute_should_use_set_data(param, param_applied)
225 if should_use_set_data:

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t)
421
422 def convert(t):
--> 423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
425 return self._apply(convert)

RuntimeError: CUDA error: an illegal memory access was encountered

RandomizedSearchCV with pytorch-tabnet

It appears that the TabNetClassifier does not have a get_params method for hyperparameter estimation.

Is this reproducible your end?

Many thanks

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-03d6c8d15377> in <module>()
      4 
      5 start = time()
----> 6 randomSearch.fit(X_train, y_train)
      7 
      8 

1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/base.py in clone(estimator, safe)
     65                             "it does not seem to be a scikit-learn estimator "
     66                             "as it does not implement a 'get_params' methods."
---> 67                             % (repr(estimator), type(estimator)))
     68     klass = estimator.__class__
     69     new_object_params = estimator.get_params(deep=False)

TypeError: Cannot clone object 'TabNetClassifier(n_d=32, n_a=32, n_steps=5,
                 lr=0.02, seed=0,
                 gamma=1.5, n_independent=2, n_shared=2,
                 cat_idxs=[],
                 cat_dims=[],
                 cat_emb_dim=1,
                 lambda_sparse=0.0001, momentum=0.3,
                 clip_value=2.0,
                 verbose=1, device_name="auto",
                 model_name="DreamQuarkTabNet", epsilon=1e-15,
                 optimizer_fn=<class 'torch.optim.adam.Adam'>,
                 scheduler_params={'gamma': 0.95, 'step_size': 20},
                 scheduler_fn=<class 'torch.optim.lr_scheduler.StepLR'>, saving_path="./")' (type <class 'pytorch_tabnet.tab_model.TabNetClassifier'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

device in torch.nn.Module

Hello and thank you for your great work!

What was the idea behind passing device parameter to the constructor of nn.Module and storing it? I've never seen that pattern before in Pytorch.

Num workers as parameters

Feature request

In order to improve speed, user could change num_workers directly in model parameters or fit parameters (probably better on fit parameters).

What is the expected behavior?

This could ease users to try to use as many thread as possible using torch Dataloaders num_workers

What is motivation or use case for adding/changing the behavior?
See #97

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Create a set of benchmark dataset

Feature request

I created some Research Issues that would be interesting to work on. But it's hard to tell if an idea is a good idea without having a clear benchmark on different dataset.

So it would be great to have a few notebooks that could run on different datasets in order to monitor performances uplift of a new implementation.

What is the expected behavior?
The idea would be to run this for each improvement proposal and see whether it helped or not.

How should this be implemented in your opinion?
This issue could be closed little by little by adding new notebooks that each perform a benchmark on one well known dataset.

Or maybe it's a better a idea to incorporate tabnet to existing benchmarks like Catboost Benchmark : https://github.com/catboost/benchmarks

Are you willing to work on this yourself?
yes of course, but any help would be appreciated!

Models don't accept model_name, saving_path

Describe the bug

Models don't accept model_name, saving_path as initialization arguments.

What is the current behavior?

See above.

If the current behavior is a bug, please provide the steps to reproduce.

clf: TabNetClassifier = TabNetClassifier(saving_path="/home/user123/dev/", device_name="cpu")

Expected behavior

Models should accept model_name, saving_path as initialization arguments as specified in the documentation.

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

On a related note: How can models be persisted? The mentioned init parameters strongly suggest that it is possible, but I couldn't find any information on this - either in the documentation nor in the code.

Build doc and make it github pages

Feature request

What is the expected behavior?

What is motivation or use case for adding/changing the behavior?

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Weight initialization different from the original paper

From the experiment section of the TabNet paper:

"Adam optimization algorithm (Kingma & Ba, 2014) and Glorot uniform initialization are used for training of all models."

Also, from the TensorFlow implementation provided by the authors, they used tf.layers.dense which seems to use glorot_uniform by default.

However, in the tab_network.py:

def initialize_non_glu(module, input_dim, output_dim):
    gain_value = np.sqrt((input_dim+output_dim)/np.sqrt(4*input_dim))
    torch.nn.init.xavier_normal_(module.weight, gain=gain_value)
    # torch.nn.init.zeros_(module.bias)
    return


def initialize_glu(module, input_dim, output_dim):
    gain_value = np.sqrt((input_dim+output_dim)/np.sqrt(input_dim))
    torch.nn.init.xavier_normal_(module.weight, gain=gain_value)
    # torch.nn.init.zeros_(module.bias)
    return

So my questions are:

Why use Glorot normal initialization instead of Glorot uniform initialization as described in the paper?
What are the reasons behind the formulas used here to calculate the gain value? Is there any reference for this? The recommended gain value for a linear layer should be the default value 1.

Thanks!

test slack

Describe the bug

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Add template for PR and issues

What is the objective function to optimize?

I notice that for every epoch, there will be train and valid accuracy.
Is the accuracy the metrics for the optimization? I am currently dealing with binary classification problem, and I would like to use auc or recall as an metric. May I be able to do that too?

Thank yo very much for your response.

Multi-class output in binary classification

The original tabnet classifier by google is hard-coded to pass predictions in a multi-class format, regardless of whether num_classes is 2.

Would you know if the above means

there are two output neurons in the model
performance is affected for binary classification problems?

Is your implementation similar in this aspect?

Create network on model instantiation

Feature request

What is the expected behavior?
The network attribute should be created as soon as a model classifier or regressor is instantiated.

What is motivation or use case for adding/changing the behavior?
The network's existence is independent of the fit function and this will help with saving/loading features. None of the network parameters depend on any fit-only information.

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Error with using Pytorch Lr Scheduler

I am trying to use ReduceOnplateau lr scheduler with TabnetRegressor and I am getting the following error:
step() missing 1 required positional argument: 'metrics'

I don't find any argument to pass in the metrics or something I even went through the code of Tabnet Help would be appreciated
Thanks in advance

Can't really set n_independant or n_shared to zero

Describe the bug

What is the current behavior?
It's possible without error to train with n_independent=0 and n_shared=0 and looking at the code it seems that zero is actually 1, so minimal value is 1 and this should not be the case.

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Well I guess 0 and 0 should throw a clear error, but 0 should mean 0.

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Make the model scikit compatible (eg, for grid search)

Feature request

What is the expected behavior?

What is motivation or use case for adding/changing the behavior?

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Ghost Batch Norm : refactorize

Feature request

What is the expected behavior?
As mentioned in #102 with @hengck23 ghost batch norm implementation could probably be improved, his code here could be a good solution : https://gist.github.com/hengck23/c21b8b6f2f34634687ebd8a4e963f560

What is motivation or use case for adding/changing the behavior?

Cleaner and faster implementation

How should this be implemented in your opinion?
see above

Are you willing to work on this yourself?
why not

Research : Binary Mask vs Sparse Mask?

Main Remark

Tabnet architecture is using sparsemax function in order to perform instance-wise feature selection, and this is one of the important feature of TabNet.

One of the interesting properties of sparsemax is that it's outputs sum to 1, but do we really want this?
Is it the role of the mask to perform both selection (0s for unused features) and importance (a value between 0 and 1)?
I would say that the feature transformer should be used to create importance (by summing values of the relu outputs as it's done in the paper) and the masks should output binary masks that would not sum to 1.

On problem I see with non binary maks is that they change the values for the next layers, if someone is 50 year old, and the attention layer think that age is half of the solution then attention for age would be 0.5, and the next layer would see age=25. But how can the next layers differentiate from 75 / 3, 50 /2 and 25? They can't really, so it seems that some information is lost along the way because of the masks, that's why I would be interested to see how binary masks perform!

Proposed Solutions

I'm not quite sure if there are known solutions for this, would thresholding a softmax works? Would you add this threshold as a parameter? or would it be learnt by the model itself? I'm not even sure that it would

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

[Question/Feature Request] Any example of applying tabnet in reinforcement and self-supervised learning

Feature request

What is the expected behavior?
New example in the examples section of the repo.

What is motivation or use case for adding/changing the behavior?
Adding a new application area.

How should this be implemented in your opinion?
Just docs and examples of using tabnet with openai and small data.

Are you willing to work on this yourself?
Maybe. Not sure.

#Abhishek-eBook

Embedding dims does not work for cat_emb_dim > 1

Describe the bug

What is the current behavior?

If you try to set cat_emb_dim to a value bigger than 1 you'll get an DimensionError due to explain and embeddings.
If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

This should work and return sum of importances for embedded dimensions
Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

model saving produces error on Windows

Hi,

ytorch-tabnet 1.0.4,
ON windwos got this error:
OSError: [Errno 22] Invalid argument: './DreamQuarkTabNet_13-03-2020_12:47:25.pt'

In tab_model.py:
Lines 112-113
model_name is defined with:
dt_string = now.strftime("%d-%m-%Y%H:%M:%S")
self.model_name += dt_string

once this is run it produces above error on windows:
torch.save(self.network, self.saving_path+f"{self.model_name}.pt")

--> Please change
line 113 to
dt_string = now.strftime("%d-%m-%Y%H_%M_%S")

Not calling set_params is making model crash (no batch size)

Describe the bug

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Verbosity with LR scheduler is not working properly

Describe the bug

What is the current behavior?
So when setting verbose to a value >1 and a scheduler, the verbosities don't match :
see https://www.kaggle.com/tanulsingh077/achieving-sota-results-with-tabnet#877426

Expected behavior
Well learning rates should follow same verbosity (or potentially be hidden not sure)

Additional context

Research : Change Attention Transformer Inputs

Main Remark

Currently in tabnet architecture, a part of the output of Feature Transformer is used for the predictions (n_d) and the rest (n_a) as input for the next Attentive Transformer.

But I see a flaw in this design, the Feature Transformer (let's call it FT_i) sees masked input from the previous Attentive Transformer (AT_{i-1}), so the input feature of FT_i don't contain all the initial information. How can this help to select other useful features for the next step?

Proposed Solution

I think that attentive transformer should take as input the raw features to select the next step features, using the previous mask as prior to avoid selecting always the same feature as each step would still work.

So an easy way to try this idea would be to use the feature transformer only for predictions. The attentive transformer could be preceded by it's own feature transformer if necessary, but inputs of at attentive block would be initial data + prior of the previous masks.

This could potentially improve the attentive transformer part.

If you find this interesting, don't hesitate to share your ideas in the comment section or open a PR to propose a solution!

Improve verbosity

Feature request

Currently we are plotting scores at #verbose epoch but we should incorporate call backs or at least history to avoid calling matplotlib each time

What is the expected behavior?
Something XGBoost like

What is motivation or use case for adding/changing the behavior?
Many

How should this be implemented in your opinion?
Not quite sure yet

Are you willing to work on this yourself?
yes

Add str and repr method

Would be good to have STR and repr method

getting error while fit

Describe the bug

new() received an invalid combination of arguments - got (list, int), but expected one of:

(*, torch.device device)
didn't match because some of the arguments have invalid types: (!list!, !int!)
(torch.Storage storage)
(Tensor other)
(tuple of ints size, *, torch.device device)
(object data, *, torch.device device)

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

dreamquark-ai / tabnet Goto Github PK

tabnet's Introduction

README

TabNet : Attentive Interpretable Tabular Learning

Installation

Easy installation

Source code

CPU only

GPU

What is new ?

Contributing

What problems does pytorch-tabnet handle?

How to use it?

Default eval_metric

Custom evaluation metrics

Semi-supervised pre-training

Data augmentation on the fly

Easy saving and loading

Useful links

Model parameters

Fit parameters

tabnet's People

Contributors

Stargazers

Watchers

Forkers

tabnet's Issues

Feature request

Feature request

Feature request

Main Remark

Proposed Solutions

Feature request

Here is the details of the error when running fit with CPUs :

Feature request

Main Problem

Proposed Solutions

Feature request

Feature request

Main Problem

Proposed Solutions

Feature request

Feature request

Feature request

Feature request

Feature request

Feature request

Feature request

Main Remark

Proposed Solutions

Feature request

Main Remark

Proposed Solution

Feature request

Recommend Projects

Recommend Topics

Recommend Org

Jobs