GithubHelp home page GithubHelp logo

tabsurvey's Introduction

Open Performance Benchmark on Tabular Data

Basis for various experiments on deep learning models for tabular data. See the Deep Neural Networks and Tabular Data: A Survey paper.

Results

Open performance benchmark results based on (stratified) 5-fold cross-validation. We use the same fold splitting strategy for every data set. The top results for each data set are in bold. The mean and standard deviation values are reported for each baseline model. Missing results indicate that the corresponding model could not be applied to the task type (regression or multi-class classification)

Method HELOC Adult HIGGS Covertype Cal. Housing
Acc↑ AUC↑ Acc↑ AUC↑ Acc↑ AUC↑ Acc↑ AUC↑ MSE↓
Linear Model 73.0±0.0 80.1±0.1 82.5±0.2 85.4±0.2 64.1±0.0 68.4±0.0 72.4±0.0 92.8±0.0 0.528±0.008
KNN 72.2±0.0 79.0±0.1 83.2±0.2 87.5±0.2 62.3±0.1 67.1±0.0 70.2±0.1 90.1±0.2 0.421±0.009
Decision Tree 80.3±0.0 89.3±0.1 85.3±0.2 89.8±0.1 71.3±0.0 78.7±0.0 79.1±0.0 95.0±0.0 0.404±0.007
Random Forest 82.1±0.3 90.0±0.2 86.1±0.2 91.7±0.2 71.9±0.0 79.7±0.0 78.1±0.1 96.1±0.0 0.272±0.006
XGBoost 83.5±0.2 92.2±0.0 87.3±0.2 92.8±0.1 77.6±0.0 85.9±0.0 97.3±0.0 99.9±0.0 0.206±0.005
LightGBM 83.5±0.1 92.3±0.0 87.4±0.2 92.9±0.1 77.1±0.0 85.5±0.0 93.5±0.0 99.7±0.0 0.195±0.005
CatBoost 83.6±0.3 92.4±0.1 87.2±0.2 92.8±0.1 77.5±0.0 85.8±0.0 96.4±0.0 99.8±0.0 0.196±0.004
Model Trees 82.6±0.2 91.5±0.0 85.0±0.2 90.4±0.1 69.8±0.0 76.7±0.0 - - 0.385±0.019
MLP 73.2±0.3 80.3±0.1 84.8±0.1 90.3±0.2 77.1±0.0 85.6±0.0 91.0±0.4 76.1±3.0 0.263±0.008
VIME 72.7±0.0 79.2±0.0 84.8±0.2 90.5±0.2 76.9±0.2 85.5±0.1 90.9±0.1 82.9±0.7 0.275±0.007
DeepFM 73.6±0.2 80.4±0.1 86.1±0.2 91.7±0.1 76.9±0.0 83.4±0.0 - - 0.260±0.006
DeepGBM 78.0±0.4 84.1±0.1 84.6±0.3 90.8±0.1 74.5±0.0 83.0±0.0 - - 0.856±0.065
NODE 79.8±0.2 87.5±0.2 85.6±0.3 91.1±0.2 76.9±0.1 85.4±0.1 89.9±0.1 98.7±0.0 0.276±0.005
NAM 73.3±0.1 80.7±0.3 83.4±0.1 86.6±0.1 53.9±0.6 55.0±1.2 - - 0.725±0.022
Net-DNF 82.6±0.4 91.5±0.2 85.7±0.2 91.3±0.1 76.6±0.1 85.1±0.1 94.2±0.1 99.1±0.0 -
TabNet 81.0±0.1 90.0±0.1 85.4±0.2 91.1±0.1 76.5±1.3 84.9±1.4 93.1±0.2 99.4±0.0 0.346±0.007
TabTransformer 73.3±0.1 80.1±0.2 85.2±0.2 90.6±0.2 73.8±0.0 81.9±0.0 76.5±0.3 72.9±2.3 0.451±0.014
SAINT 82.1±0.3 90.7±0.2 86.1±0.3 91.6±0.2 79.8±0.0 88.3±0.0 96.3±0.1 99.8±0.0 0.226±0.004
RLN 73.2±0.4 80.1±0.4 81.0±1.6 75.9±8.2 71.8±0.2 79.4±0.2 77.2±1.5 92.0±0.9 0.348±0.013
STG 73.1±0.1 80.0±0.1 85.4±0.1 90.9±0.1 73.9±0.1 81.9±0.1 81.8±0.3 96.2±0.0 0.285±0.006

How to use

Using the docker container

The code is designed to run inside a docker container. See the Dockerfile. In the docker file, different conda environments are specified for the various requirements of the models. Therefore, building the container for the first time takes a while.

Just build it as usual via docker build -t <image name> <path to Dockerfile>.

To start the docker container then run:

docker run -v ~/output:/opt/notebooks/output -p 3123:3123 --rm -it --gpus all <image name>

  • The -v ~/output:/opt/notebooks/output option is recommended to have access to the outputs of the experiments on your local machine.

  • The docker run command starts a jupyter notebook (to have a nice editor for small changes or experiments). To have access to the notebook from outside the docker container, -p 3123:3123 connects the notebook to your local machine. You can change the port number in the Dockerfile.

  • If you have GPUs available, add also the --gpus all option to have access to them from inside the docker container.

To enter the running docker container via the command do the following:

  • Call docker ps to find the ID of the running container.
  • Call docker exec -it <container id> bash to enter the container. Now you can navigate to the right directory with cd opt/notebooks/.

Run a single model on a single dataset

To run a single model on a single dataset call:

python train.py --config/<config-file of the dataset>.yml --model_name <Name of the Model>

All parameters set in the config file, can be overwritten by command line arguments, for example:

  • --optimize_hyperparameters Uses Optuna to run a hyperparameter optimization. If not set, the parameters listed in the best_params.yml file are used.

  • --n_trails <number trials> Number of trials to run for the hyperparameter search

  • --epochs <number epochs> Max number of epochs

  • --use_gpu If set, available GPUs are used (specified by gpu_ids)

  • ... and so on. All possible parameters can be found in the config files or calling: python train.y -h

If you are using the docker container, first enter the right conda environment using conda activate <env name> to have all required packages. The train.py file is in the opt/notebooks/ directory.


Run multiple models on multiple datasets

To run multiple models on multiple datasets, there is the bash script testall.sh provided. In the bash script the models and datasets can be specified. Every model needs to know in which conda environment in has to be executed.

If you run inside our docker container, just comment out all models and datasets you don't want to run and then call:

./testall.sh


Computing model attributions (currently supported for SAINT, TabTransformer, TabNet)

The framework provides implementations to compute feature attribution explanations for several models. Additionally, the feature attributions can be automatically compared to SHAP values and a global ablation test which successively perturbs the most important features, can be run. The same parameters as before can be passed, but with some additions:

attribute.py --model_name <Name of the Model> [--globalbenchmark] [--compareshap] [--numruns <int>] [--strategy diag]

  • --globalbenchmark Additionally run the global perturbation benchmark

  • --compareshap Compare attributions to shapley values

  • --numruns <number run> Number of repetitions for the global benchmark

  • --strategy diag SAINT and TabTransformer support another attribution strategy, where the diagonal of the attention map is used. Pass this argument to use it.


Add new models

Every new model should inherit from the base class BaseModel. Implement the following methods:

  • def __init__(self, params, args): Define your model here.
  • def fit(self, X, y, X_val=None, y_val=None): Implement the training process. (Return the loss and validation history)
  • def predict(self, X): Save and return the predictions on the test data - the regression values or the concrete classes for classification tasks
  • def predict_proba(self, X): Only for classification tasks. Save and return the probability distribution over the classes.
  • def define_trial_parameters(cls, trial, args): Define the hyperparameters that should be optimized.
  • (optional) def save_model: If you want to save your model in a specific manner, override this function to.

Add your <model>.py file to the models directory and do not forget to update the models/__init__.py file.


Add new datasets

Every dataset needs a config file specifying its features. Add the config file to the config directory.

Necessary information are:

  • dataset: Name of the dataset
  • objective: Binary, classification or regression task
  • direction: Direction of optimization. In the current implementation the binary scorer returns the AUC-score, hence, should be maximized. The classification scorer uses the log loss and the regression scorer mse, therefore both should be minimized.
  • num_features: Total number of features in the dataset
  • num_classes: Number of classes in classification task. Set to 1 for binary or regression task.
  • cat_idx: List the indices of the categorical features in your dataset (if there are any).

It is recommended to specify the remaining hyperparameters here as well.


Citation

If you use this codebase, please cite our work:

@article{borisov2022deep,
 author={Borisov, Vadim and Leemann, Tobias and Seßler, Kathrin and Haug, Johannes and Pawelczyk, Martin and Kasneci, Gjergji},
  journal={IEEE Transactions on Neural Networks and Learning Systems}, 
  title={Deep Neural Networks and Tabular Data: A Survey}, 
  year={2022},
  volume={},
  number={},
  pages={1-21},
  doi={10.1109/TNNLS.2022.3229161}
}

tabsurvey's People

Contributors

kathrinse avatar tleemann avatar unnir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tabsurvey's Issues

Did the experiments of SAINT eliminate the influence of the pre-training process?

Hi, thanks for this insightful survey on DL for tabular data. And I want to ask for more details about the experiments on SAINT models.
Did the experiments of SAINT eliminate the influence of the pre-training process? Or the model of the reported results is finetuned on the unsupervised pre-training model. I failed to find a description of this in the article, so I will appreciate it if you answer it. Thanks!

If cat_idxs is non-empty, cat_dims must be defined as a list of same length

Hi Kathrin,

I get the following error for california_housing and covertype for the TabNet model.

python train.py  --config config/covertype.yml --model_name TabNet 
.
Namespace(config='config/covertype.yml', model_name='TabNet', dataset='Covertype', objective='classification', 
use_gpu=True, gpu_ids=[0, 1, 2, 3], data_parallel=True, optimize_hyperparameters=False, n_trials=100, 
direction='minimize', 
num_splits=5, shuffle=True, seed=221, scale=True, target_encode=True, one_hot_encode=False, batch_size=128, 
val_batch_size=256, early_stopping_rounds=20, epochs=1000, logging_period=100, num_features=54, num_classes=7, 
cat_idx=None, cat_dims=None)
.
.
raceback (most recent call last):
  File "/scratch1/dun280/TabSurvey/train.py", line 154, in <module>
    main_once(arguments)
  File "/scratch1/dun280/TabSurvey/train.py", line 138, in main_once
    sc, time = cross_validation(model, X, y, args)
  File "/scratch1/dun280/TabSurvey/train.py", line 41, in cross_validation
    loss_history, val_loss_history = curr_model.fit(X_train, y_train, X_test, y_test)  # X_val, y_val)
  File "/scratch1/dun280/TabSurvey/models/tabnet.py", line 38, in fit
    self.model.fit(X, y, eval_set=[(X_val, y_val)], eval_name=["eval"], eval_metric=self.metric,
  File "/home/dun280/.local/lib/python3.9/site-packages/pytorch_tabnet/abstract_model.py", line 223, in fit
    self._set_network()
  File "/home/dun280/.local/lib/python3.9/site-packages/pytorch_tabnet/abstract_model.py", line 570, in _set_network
    self.network = tab_network.TabNet(
  File "/home/dun280/.local/lib/python3.9/site-packages/pytorch_tabnet/tab_network.py", line 567, in __init__
    self.embedder = EmbeddingGenerator(input_dim, cat_dims, cat_idxs, cat_emb_dim)
  File "/home/dun280/.local/lib/python3.9/site-packages/pytorch_tabnet/tab_network.py", line 809, in __init__
    raise ValueError(msg)
ValueError: If cat_idxs is non-empty, cat_dims must be defined as a list of same length.

Unfortunately, once again, this is also happening for my data

Bye
R

New version

Hi @kathrinse, I have been working with your repo for a short time and I like it very much. Can I create a modified version based on this repo (not fork)? I will mention your repo in my acknowledgment.

information in attributionsNone.json

Hi Kathrin,

what is the format of the information in attributionsNone.json?

(Pytorch): python attributions.py --model_name TabNet

and then in R

library(rjson)
myData <- fromJSON(file="output/TabNet/Adult/attributionsNone.json")
myData[[1]]
# [1] "TabNet" "TabNet"
myData[[2]]
# [1] "None" "None"
myData[[3]]
#[1] "Adult" "Adult"
> all.equal(myData[[4]][[1]],myData[[4]][[2]])
#[1] TRUE
> myData[[4]][[2]][1]
#[[1]]
 #[1] 1.232716441 0.000000000 0.000000000 0.016803199 0.018363461 0.005803025
 #[7] 0.438206583 0.303667098 0.000000000 0.000000000 0.619914174 0.000000000
#[13] 1.449993968 0.000000000

Bye
R

Optimal hyperparameters?

Would it be possible to publish (e.g., in the README) the optimal hyperparameters that are used to generate the data in Table IV of the paper? This would make reproducing your results much more convenient!

AssertionError: you must pass in 0 values for your categories input

Hi, @kathrinse. When I tried to train TabTransformer, I Cause this error, seem the args.cat_dims is None thought I passed cat_dims in my config

[]   # args.cat_dims
On Device: cuda
Using dim 128 and batch size 10240
On Device: cuda
[W 2022-10-24 22:25:20,624] Trial 10 failed because of the following error: AssertionError('you must pass in 0 values for your categories input')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "train.py", line 117, in __call__
    sc, time = cross_validation(model, self.X, self.y, self.args)
  File "train.py", line 60, in cross_validation
    loss_history, val_loss_history = curr_model.fit(X_train, y_train, X_test, y_test)  # X_val, y_val)
  File "/content/drive/MyDrive/Code/TabSurvey-main/models/tabtransformer.py", line 103, in fit
    out = self.model(x_categ, x_cont)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/Code/TabSurvey-main/models/tabtransformer.py", line 463, in forward
    assert x_categ.shape[-1] == self.num_categories, f'you must pass in {self.num_categories} ' \
AssertionError: you must pass in 0 values for your categories input

Here is my config

# General parameters
dataset: A
model_name: TabTransformer # LinearModel, KNN, SVM, DecisionTree, RandomForest, XGBoost, CatBoost, LightGBM, ModelTree
                # MLP, TabNet, VIME, TabTransformer, RLN, DNFNet, STG, NAM, DeepFM, SAINT
objective: classification # Don't change
# optimize_hyperparameters: True

# Preprocessing parameters
scale: True
target_encode: True
one_hot_encode: False

# Training parameters
batch_size: 10240
val_batch_size: 2560
early_stopping_rounds: 100
epochs: 2
logging_period: 100

# About the data
num_classes: 4  # for classification
num_features: 18
cat_idxs: [0, 1, 2, 3, 4, 14, 15, 16, 17]
# cat_dims: will be automatically set.
cat_dims: [10, 2, 7, 4, 6, 4, 4, 4, 4]

Hope to hear from you soon. Thanks in advanced.

TabTransformer implementation in study 🤖

Dear @kathrinse,

thanks for your excellent study on deep neural networks for tabular data! I came across your repository and paper during the research for my on-going master's thesis.

I noticed two aspects that I wanted to share with you:

Sorry if I have missed something.

Keep up the excellent work 💯

Best,
Markus

KeyError: 'HIGGS'

Hi kathrinse,

I get the following error for the Higgs data set for all the models that I have tried.
The same models work perfectly well on the other included data sets.

> python train.py --config config/higgs.yml --model_name RandomForest --use_gpu

Traceback (most recent call last):
  File "TabSurvey/train.py", line 154, in <module>
    main_once(arguments)
  File "TabSurvey/train.py", line 133, in main_once
    print(args.parameters[args.dataset])
KeyError: 'HIGGS'

The only change to the code I have made is to download HIGGS.csv.gz and access it locally, i.e. in load_data.py

 path = "/scratch1/dun280/TabSurvey/data/HIGGS.csv.gz"

The error is particularly concerning as I get the same error when I have tried to add my own data sets. I have added a config file and
a section in load_data.py but it falls over with a KeyError.

Bye
R

n_trials parameter not obeyed

Hi Kathrin

for one of my data sets I am setting the parameters as.

Namespace(config='config/wheat_anthesis.yml', model_name='XGBoost', dataset='wheat_anthesis', objective='regression', 
use_gpu=False, gpu_ids=[0, 1, 2, 3], data_parallel=True, optimize_hyperparameters=True, n_trials=20, direction='minimize', 
num_splits=5, shuffle=True, seed=221, scale=True, target_encode=False, one_hot_encode=False, batch_size=251, 
val_batch_size=251, early_stopping_rounds=20, epochs=500, logging_period=100, num_features=44567, num_classes=1, 
cat_idx=None, cat_dims=None)

This is a very time consuming process so I have n_trials=20. However the process runs out of time and gives this output for
XGBoost

 Trial 40 finished with value: 123.61018808245896 and parameters: {'max_depth': 3, 'alpha': 1.0864626779404396e-06, 'lambda': 
0.022185113413679115, 'eta': 0.08421066233419264}. Best is trial 29 with value: 114.67435838448253.

Why does it get to a "Trial 40" when I set n_trials=20? Do these refer to different things?

bye
R
oops -- title should be "n_trials", not "n_trails"
Sounds like I was thinking of "entrails"
R

Can't build docker image because catboost won't install

Logs:

$ docker build -t tabsurvey .
[+] Building 1.1s (18/49)                                                                                                                                                                    
 => [internal] load build definition from Dockerfile                                                                                                                                    0.0s
 => => transferring dockerfile: 37B                                                                                                                                                     0.0s
 => [internal] load .dockerignore                                                                                                                                                       0.0s
 => => transferring context: 2B                                                                                                                                                         0.0s
 => [internal] load metadata for docker.io/continuumio/miniconda3:latest                                                                                                                0.5s
 => [ 1/46] FROM docker.io/continuumio/miniconda3@sha256:977263e8d1e476972fddab1c75fe050dd3cd17626390e874448bd92721fd659b                                                               0.0s
 => CACHED [ 2/46] RUN /opt/conda/bin/conda install jupyter -y                                                                                                                          0.0s
 => CACHED [ 3/46] RUN mkdir /opt/notebooks                                                                                                                                             0.0s
 => CACHED [ 4/46] RUN opt/conda/bin/jupyter notebook --generate-config                                                                                                                 0.0s
 => CACHED [ 5/46] RUN /opt/conda/bin/conda create -n sklearn -y scikit-learn                                                                                                           0.0s
 => CACHED [ 6/46] RUN /opt/conda/bin/conda install -n sklearn -y -c anaconda ipykernel                                                                                                 0.0s
 => CACHED [ 7/46] RUN /opt/conda/envs/sklearn/bin/python -m ipykernel install --user --name=sklearn                                                                                    0.0s
 => CACHED [ 8/46] RUN /opt/conda/bin/conda install -n sklearn -y -c conda-forge optuna                                                                                                 0.0s
 => CACHED [ 9/46] RUN /opt/conda/bin/conda install -n sklearn -y -c conda-forge configargparse                                                                                         0.0s
 => CACHED [10/46] RUN /opt/conda/bin/conda install -n sklearn -y pandas                                                                                                                0.0s
 => CACHED [11/46] RUN /opt/conda/bin/conda create -n gbdt -y                                                                                                                           0.0s
 => CACHED [12/46] RUN /opt/conda/bin/conda install -n gbdt -y -c anaconda ipykernel                                                                                                    0.0s
 => CACHED [13/46] RUN /opt/conda/envs/gbdt/bin/python -m ipykernel install --user --name=gbdt                                                                                          0.0s
 => CACHED [14/46] RUN /opt/conda/envs/gbdt/bin/python -m pip install xgboost==1.5.0                                                                                                    0.0s
 => ERROR [15/46] RUN /opt/conda/envs/gbdt/bin/python -m pip install catboost==1.0.3                                                                                                    0.6s
------
 > [15/46] RUN /opt/conda/envs/gbdt/bin/python -m pip install catboost==1.0.3:
#18 0.547 ERROR: Could not find a version that satisfies the requirement catboost==1.0.3 (from versions: none)
#18 0.547 ERROR: No matching distribution found for catboost==1.0.3
------
executor failed running [/bin/sh -c /opt/conda/envs/gbdt/bin/python -m pip install catboost==1.0.3]: exit code: 1

Let me know if there's anything else I should add here.

Missing package versions in Dockerfile

I noticed that in the Dockerfile, the version number had not been specified for many packages. This already leads to some ML models not being executable because APIs have changed. For example, with the automatically installed numpy version (for currently 1.24.3), the numpy.float data type no longer exists, making the code in models/vime.py not executable. This bug is easily fixed by simply replacing np.float with float.

Another error occurs, for example, for the ML model STG in models/stg_lib/utils.py. In the collections module used, the collection.Sequence attribute no longer exists, making STG unusable.

In the Dockerfile, the package versions should still be added because more errors will occur sooner or later due to changed interfaces. Also, I have not tested all provided ML models, so further errors can not be excluded.

Many greetings
Sedir Mohammed

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object

Hi @kathrinse. When training TabNet with Adult dataset, I cause this error.
Here is

Namespace(batch_size=128, cat_dims=[9, 16, 7, 15, 6, 5, 2, 42], cat_idx=[1, 3, 5, 6, 7, 8, 9, 13], config='config/adult.yml', data_parallel=True, dataset='Adult', direction='maximize', early_stopping_rounds=20, epochs=1000, gpu_ids=[0], logging_period=100, model_name='TabNet', n_trials=10, num_classes=1, num_features=14, num_splits=5, objective='binary', one_hot_encode=False, optimize_hyperparameters=True, scale=True, seed=221, shuffle=True, target_encode=True, use_gpu=True, val_batch_size=256)
Start hyperparameter optimization
Loading dataset Adult...
Dataset loaded!
(32561, 14)
Scaling the data...
[I 2022-10-05 19:09:12,970] A new study created in RDB with name: TabNet_Adult
A new study created in RDB with name: TabNet_Adult
/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py:75: UserWarning: Device used : cuda
  warnings.warn(f"Device used : {self.device}")
/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py:75: UserWarning: Device used : cuda
  warnings.warn(f"Device used : {self.device}")
epoch 0  | loss: 0.68711 | eval_logloss: 1.5043  |  0:00:20s
epoch 1  | loss: 0.39683 | eval_logloss: 0.4977  |  0:00:41s
epoch 2  | loss: 0.38015 | eval_logloss: 0.43938 |  0:01:01s
epoch 3  | loss: 0.36482 | eval_logloss: 0.43644 |  0:01:22s
epoch 4  | loss: 0.34721 | eval_logloss: 0.38523 |  0:01:43s
epoch 5  | loss: 0.34573 | eval_logloss: 0.35584 |  0:02:03s
epoch 6  | loss: 0.34037 | eval_logloss: 0.38542 |  0:02:23s
epoch 7  | loss: 0.33787 | eval_logloss: 0.35565 |  0:02:44s
epoch 8  | loss: 0.32982 | eval_logloss: 0.35525 |  0:03:04s
epoch 9  | loss: 0.32862 | eval_logloss: 0.33821 |  0:03:24s
epoch 10 | loss: 0.32244 | eval_logloss: 0.33319 |  0:03:45s
epoch 11 | loss: 0.32608 | eval_logloss: 0.34302 |  0:04:06s
epoch 12 | loss: 0.3276  | eval_logloss: 0.36721 |  0:04:26s
epoch 13 | loss: 0.32269 | eval_logloss: 0.3386  |  0:04:47s
epoch 14 | loss: 0.32002 | eval_logloss: 0.33012 |  0:05:08s
epoch 15 | loss: 0.31808 | eval_logloss: 0.33689 |  0:05:28s
epoch 16 | loss: 0.31916 | eval_logloss: 0.32849 |  0:05:49s
epoch 17 | loss: 0.31616 | eval_logloss: 0.34039 |  0:06:10s
epoch 18 | loss: 0.31717 | eval_logloss: 0.34637 |  0:06:30s
epoch 19 | loss: 0.31554 | eval_logloss: 0.33508 |  0:06:50s
epoch 20 | loss: 0.318   | eval_logloss: 0.43872 |  0:07:11s
epoch 21 | loss: 0.32983 | eval_logloss: 0.49745 |  0:07:31s
epoch 22 | loss: 0.31808 | eval_logloss: 0.33653 |  0:07:52s
epoch 23 | loss: 0.31731 | eval_logloss: 0.32934 |  0:08:12s
epoch 24 | loss: 0.31352 | eval_logloss: 0.33776 |  0:08:32s
epoch 25 | loss: 0.31438 | eval_logloss: 0.34476 |  0:08:53s
epoch 26 | loss: 0.31483 | eval_logloss: 0.3282  |  0:09:13s
epoch 27 | loss: 0.30911 | eval_logloss: 0.32267 |  0:09:33s
epoch 28 | loss: 0.31008 | eval_logloss: 0.34737 |  0:09:53s
epoch 29 | loss: 0.30756 | eval_logloss: 0.32561 |  0:10:13s
epoch 30 | loss: 0.30834 | eval_logloss: 0.32646 |  0:10:33s
epoch 31 | loss: 0.30615 | eval_logloss: 0.32435 |  0:10:53s
epoch 32 | loss: 0.30466 | eval_logloss: 0.33857 |  0:11:13s
epoch 33 | loss: 0.30495 | eval_logloss: 0.33067 |  0:11:33s
epoch 34 | loss: 0.30485 | eval_logloss: 0.33315 |  0:11:53s
epoch 35 | loss: 0.30466 | eval_logloss: 0.33724 |  0:12:13s
epoch 36 | loss: 0.30336 | eval_logloss: 0.33496 |  0:12:33s
epoch 37 | loss: 0.29928 | eval_logloss: 0.35852 |  0:12:53s
epoch 38 | loss: 0.29941 | eval_logloss: 0.33168 |  0:13:13s
epoch 39 | loss: 0.30065 | eval_logloss: 0.34095 |  0:13:33s
epoch 40 | loss: 0.29873 | eval_logloss: 0.35759 |  0:13:53s
epoch 41 | loss: 0.30008 | eval_logloss: 0.35994 |  0:14:13s
epoch 42 | loss: 0.29637 | eval_logloss: 0.33748 |  0:14:33s
epoch 43 | loss: 0.29404 | eval_logloss: 0.33582 |  0:14:54s
epoch 44 | loss: 0.29512 | eval_logloss: 0.33685 |  0:15:13s
epoch 45 | loss: 0.29254 | eval_logloss: 0.34174 |  0:15:33s
epoch 46 | loss: 0.29284 | eval_logloss: 0.35136 |  0:15:53s
epoch 47 | loss: 0.2898  | eval_logloss: 0.35115 |  0:16:13s

Early stopping occurred at epoch 47 with best_epoch = 27 and best_eval_logloss = 0.32267
/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/callbacks.py:172: UserWarning: Best weights from best epoch are automatically used!
  warnings.warn(wrn_msg)
[W 2022-10-05 19:25:28,854] Trial 0 failed because of the following error: TypeError('default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "train.py", line 95, in __call__
    sc, time = cross_validation(model, self.X, self.y, self.args)
  File "train.py", line 41, in cross_validation
    loss_history, val_loss_history = curr_model.fit(X_train, y_train, X_test, y_test)  # X_val, y_val)
  File "/content/drive/MyDrive/Predict Student Results/Code/TabSurvey-main/models/tabnet.py", line 40, in fit
    batch_size=self.args.batch_size)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 260, in fit
    self.feature_importances_ = self._compute_feature_importances(X_train)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 723, in _compute_feature_importances
    M_explain, _ = self.explain(X, normalize=False)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 320, in explain
    for batch_nb, data in enumerate(dataloader):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 147, in default_collate
    raise TypeError(default_collate_err_msg_format.format(elem.dtype))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object
Trial 0 failed because of the following error: TypeError('default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "train.py", line 95, in __call__
    sc, time = cross_validation(model, self.X, self.y, self.args)
  File "train.py", line 41, in cross_validation
    loss_history, val_loss_history = curr_model.fit(X_train, y_train, X_test, y_test)  # X_val, y_val)
  File "/content/drive/MyDrive/Predict Student Results/Code/TabSurvey-main/models/tabnet.py", line 40, in fit
    batch_size=self.args.batch_size)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 260, in fit
    self.feature_importances_ = self._compute_feature_importances(X_train)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 723, in _compute_feature_importances
    M_explain, _ = self.explain(X, normalize=False)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 320, in explain
    for batch_nb, data in enumerate(dataloader):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 147, in default_collate
    raise TypeError(default_collate_err_msg_format.format(elem.dtype))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object
Traceback (most recent call last):
  File "train.py", line 144, in <module>
    main(arguments)
  File "train.py", line 116, in main
    study.optimize(Objective(args, model_name, X, y), n_trials=args.n_trials)
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/study.py", line 428, in optimize
    show_progress_bar=show_progress_bar,
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/_optimize.py", line 76, in _optimize
    progress_bar=progress_bar,
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/_optimize.py", line 160, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/_optimize.py", line 234, in _run_trial
    raise func_err
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "train.py", line 95, in __call__
    sc, time = cross_validation(model, self.X, self.y, self.args)
  File "train.py", line 41, in cross_validation
    loss_history, val_loss_history = curr_model.fit(X_train, y_train, X_test, y_test)  # X_val, y_val)
  File "/content/drive/MyDrive/Predict Student Results/Code/TabSurvey-main/models/tabnet.py", line 40, in fit
    batch_size=self.args.batch_size)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 260, in fit
    self.feature_importances_ = self._compute_feature_importances(X_train)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 723, in _compute_feature_importances
    M_explain, _ = self.explain(X, normalize=False)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_tabnet/abstract_model.py", line 320, in explain
    for batch_nb, data in enumerate(dataloader):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 147, in default_collate
    raise TypeError(default_collate_err_msg_format.format(elem.dtype))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object

My config is here

# General parameters
dataset: Adult
model_name: TabNet # LinearModel, KNN, SVM, DecisionTree, RandomForest, XGBoost, CatBoost, LightGBM, ModelTree
                # MLP, TabNet, VIME, TabTransformer, RLN, DNFNet, STG, NAM, DeepFM, SAINT
objective: binary # Don't change
# optimize_hyperparameters: True

# GPU parameters
use_gpu: True
gpu_ids: 0
data_parallel: True

# Optuna parameters - https://optuna.org/
n_trials: 10
direction: maximize

# Cross validation parameters
num_splits: 5
shuffle: True
seed: 221 # Don't change

# Preprocessing parameters
scale: True
target_encode: True
one_hot_encode: False

# Training parameters
batch_size: 128
val_batch_size: 256
early_stopping_rounds: 20
epochs: 1000
logging_period: 100

# About the data
num_classes: 1  # for classification
num_features: 14
cat_idx: [1,3,5,6,7,8,9,13]
# cat_dims: will be automatically set.
cat_dims: [9, 16, 7, 15, 6, 5, 2, 42]

Hope to hear from you soon. Thank you so much

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.