ibm / tabformer Goto Github PK

Code & Data for "Tabular Transformers for Modeling Multivariate Time Series" (ICASSP, 2021)

Home Page: https://arxiv.org/abs/2011.01843

License: Apache License 2.0

Python 100.00%

machine-learning artificial-intelligence credit-card-dataset fraud-detection gpt bert tabular-data prsa-dataset huggingface credit-card-transaction transformer pytorch icassp icassp2021

tabformer's Introduction

Tabular Transformers for Modeling Multivariate Time Series

This repository provides the pytorch source code, and data for tabular transformers (TabFormer). Details are described in the paper Tabular Transformers for Modeling Multivariate Time Series, to be presented at ICASSP 2021.

Summary

Modules for hierarchical transformers for tabular data
A synthetic credit card transaction dataset
Modified Adaptive Softmax for handling masking
Modified DataCollatorForLanguageModeling for tabular data
The modules are built within transformers from HuggingFace 🤗. (HuggingFace is ❤️)

Requirements

Python (3.7)
Pytorch (1.6.0)
HuggingFace / Transformer (3.2.0)
scikit-learn (0.23.2)
Pandas (1.1.2)

(X) represents the versions which code is tested on.

These can be installed using yaml by running :

conda env create -f setup.yml

Credit Card Transaction Dataset

The synthetic credit card transaction dataset is provided in ./data/credit_card. There are 24M records with 12 fields. You would need git-lfs to access the data. If you are facing issue related to LFS bandwidth, you can use this direct link to access the data. You can then ignore git-lfs files by prefixing GIT_LFS_SKIP_SMUDGE=1 to the git clone .. command.

PRSA Dataset

For PRSA dataset, one have to download the PRSA dataset from Kaggle and place them in ./data/card directory.

Tabular BERT

To train a tabular BERT model on credit card transaction or PRSA dataset run :

$ python main.py --do_train --mlm --field_ce --lm_type bert \
                 --field_hs 64 --data_type [prsa/card] \
                 --output_dir [output_dir]

Tabular GPT2

To train a tabular GPT2 model on credit card transactions for a particular user-id :


$ python main.py --do_train --lm_type gpt2 --field_ce --flatten --data_type card \
                 --data_root [path_to_data] --user_ids [user-id] \
                 --output_dir [output_dir]

Description of some options (more can be found in args.py):

--data_type choices are prsa and card for Beijing PM2.5 dataset and credit-card transaction dataset respecitively.
--mlm for masked language model; option for transformer trainer for BERT
--field_hs hidden size for field level transformer
--lm_type choices from bert and gpt2
--user_ids option to pick only transacations from particular user ids.

Citation

@inproceedings{padhi2021tabular,
  title={Tabular transformers for modeling multivariate time series},
  author={Padhi, Inkit and Schiff, Yair and Melnyk, Igor and Rigotti, Mattia and Mroueh, Youssef and Dognin, Pierre and Ross, Jerret and Nair, Ravi and Altman, Erik},
  booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={3565--3569},
  year={2021},
  organization={IEEE},
  url={https://ieeexplore.ieee.org/document/9414142}
}

tabformer's People

Contributors

Stargazers

Watchers

Forkers

laxmaan jacobjdobson jingmouren zergey smiyawaki0820 statmixedml cl19951225 chrisdeufel ed-winning ealtman741 ghimry7 giangdip2410 tim5go murugeshmarvel alex1992on dustinvanstee fdoperezi fatcatplus lucifer2288 akashmavle5 calvinmccarter-at-lightmatter mrleaper azurite-r kekayan fanxingye dawnywu alexandrosanat julinam dot-git lokeshk1438 n-theo dylanloader robert-s-lee jirin1a ltcrazy avsolatorio ssharpe42 jinxmirror13 dkopljar27 pxydi soares-f gouonhvag synechronite prabhakars yves03 neogyk rajivmdd mohan-mj rajatbothrajain lizagonch taeyoung90 rpatil524 dominicshanshan lukascironis big-c-note chorsengi2r ankushjain7 sud-911 lubnahub kdx3 cservin69 skpalu darren76038 jtmancilla tcrapse ghas-results fabiocosta0305 bernardo1998 yuningyang720 gnychis paulsendavidjay evn58 helder-daiha bluematrix007 sinchiwai andompesta arvindshmicrosoft sophia-jihye kaalaras pchenwei xiaotingsong

tabformer's Issues

Issue in TransDataCollatorForLanguageModeling

AttributeError: 'TransDataCollatorForLanguageModeling' object has no attribute '_tensorize_batch'

How to reproduce the results on the paper

Hi,

I noticed that there is no code for downstream tasks in this repo and I'm wondering if you could upload it. I want to reproduce your results on the ICASSP paper (e.g. Table 1 on the paper). Based on your paper, you feed the extracted features into MLP / LSTM. However, the detailed config of MLP / LSTM is not clear and I'm not very sure about how you do upsampling for the Fraud Detection Task.

It'll be super helpful if you could provide the code for downstream tasks.

Thanks for your help!

Cannot open transactions.tgz file

Hello,

Please can you help me with the unzipping transactions data. My program says that the archive is damaged.

Thanks!

general questions on performance and efficiency

hi there,

how much time and memory is required to run the model on cpu-only?

data format for regression task

Hi.
I wanna try regression task similar to prsa.
For prsa, I understand the data for training is prepared in dataset/prsa.py.
For other new regression task, where and how can I set the target value and feature data?

Merchant Name is an integer

In the screen shot in the readme, the merchant name is a string, but when I load the CSV file using pandas, I get an integer value instead:

          User  Card  Year  Month  Day   Time   Amount           Use Chip        Merchant Name  Merchant City Merchant State      Zip   MCC Errors? Is Fraud?
0            0     0  2002      9    1  06:21  $134.09  Swipe Transaction  3527213246127876953       La Verne             CA  91750.0  5300    None        No
1            0     0  2002      9    1  06:42   $38.48  Swipe Transaction  -727612092139916043  Monterey Park             CA  91754.0  5411    None        No
2            0     0  2002      9    2  06:22  $120.34  Swipe Transaction  -727612092139916043  Monterey Park             CA  91754.0  5411    None        No
3            0     0  2002      9    2  17:45  $128.95  Swipe Transaction  3414527459579106770  Monterey Park             CA  91754.0  5651    None        No
4            0     0  2002      9    3  06:23  $104.71  Swipe Transaction  5817218446178736267       La Verne             CA  91750.0  5912    None        No
...        ...   ...   ...    ...  ...    ...      ...                ...                  ...            ...            ...      ...   ...     ...       ...
24386895  1999     1  2020      2   27  22:23  $-54.00   Chip Transaction -5162038175624867091      Merrimack             NH   3054.0  5541    None        No
24386896  1999     1  2020      2   27  22:24   $54.00   Chip Transaction -5162038175624867091      Merrimack             NH   3054.0  5541    None        No
24386897  1999     1  2020      2   28  07:43   $59.15   Chip Transaction  2500998799892805156      Merrimack             NH   3054.0  4121    None        No
24386898  1999     1  2020      2   28  20:10   $43.12   Chip Transaction  2500998799892805156      Merrimack             NH   3054.0  4121    None        No
24386899  1999     1  2020      2   28  23:10   $45.13   Chip Transaction  4751695835751691036      Merrimack             NH   3054.0  5814    None        No

[SEP] token not added between transactions in TabGPT

Hello,

From my understanding of the paper, for the TabGPT model, sequences of ten transactions are passed with the [SEP] token between the transactions.

However, after looking at the code, it seems like a [SEP] token is not added between the transactions; self.mlm is False here since the --mlm flag is not included in the command provided in the README.

Thank you very much for any clarification on this.

Logging F1 Score and RMSE

Hi!

Thanks for sharing the code.

What changes would need to be made to the code to output (and log) the F1 Score and RMSE for TabBert? This way I would be able to reproduce the results in the paper.

Currently, training a model just logs the following:
{'train_runtime', 'train_samples_per_second', 'train_steps_per_second', 'train_loss', 'epoch'}

Any help would be much appreciated - thanks.

Real culprits in the dataset?

In the financial transaction dataset, who is considered to be the main source of transactional frauds?

A breakdown of the fraud transactions indicate that all 2,000 users were victims of fraud, whereas, around 3K merchants out of over 100K merchants were involved in fraudulent transactions. Does that mean these 3K merchants are considered to be the main culprits of credit card fraud in this dataset? Or is it assumed that the source of fraud are external parties that conceal their identities by using the merchants' details?

Also interestingly, some of those "shady" merchants have hundreds of thousands of transactions, however, only a smaller fractions of these are marked as fraud.

It is greatly appreciated if this can be clarified, as the paper does not cover the dataset description in detail.

Does LabelEncoder make sense on categorical variables on the transactions dataset?

I took a look at how you encode categorical variables in the transactions dataset, such as 'MCC', 'Marchant Name', etc. and it seems that for all of them, you use LabelEncoder. This creates a false order and I was wondering if it affects learning and performance? Shouldn't you use a one-hot encoding?

Is TABGPT really privacy-preserving?

Hello,
It seems to me that you don't give any guarantee in you paper that someone would not be able to attack the generated dataset (e.g. membership attacks)?

LFS issue: over quota, can't download

I am getting this error while attempting to clone the repo with LFS.

Error downloading object: data/credit_card/transactions.tgz (e9f589a): 
Smudge error: Error downloading data/credit_card/transactions.tgz 
(e9f589a0958f40d60f81b1a2e8428db86e00c05755caf44fb055827976c0efa2): batch response: 
This repository is over its data quota.
Account responsible for LFS bandwidth should purchase more data packs to restore access.

How to reduce model size for fitting gpu memory

Hi!

I have gpu with 32 Gb, but it's not enough for training model using your source code and predefined config.
Please, can you explain how to reduce model params.
I've already reduced embedding dim and nrof attention heads, but maybe you know better way?

Thanks!

Test dataset distribution

Hi,

I am trying to replicate the results from the paper.
Is the distribution used in the test set the same as the train set before upsampling the fraud class (imbalanced) or have you used a balanced test set?
Could you also please comment on how you split the data? Did you split randomly or did you use the bottom part of the dataset as the test set?

Kind regards,
Rafael

Can anyone send me a copy of the data/credit_card/transactions.tgz dataset?

LFS Clone of credit card data not working

Good day,

When trying to lfs pull the credit card data I get the following error:

batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Could you please help?

Kind regards
Ruan

Code example for Fine Tuning

Thank you for sharing this incredible work. I'm relatively new to deep learning and am eager to learn more. Would it be possible for you to share an example of the fine-tuning process? I would greatly appreciate it. Thank you in advance.

Simple question about how to use trained tabgpt model to generate data

Hi there, thanks for making such an interesting work, after using the code
>python main.py --do_train --lm_type gpt2 --field_ce --flatten --data_type card --output_dir ./ --data_root ./data/card/
and I get several checkpoints, so how can I use these models to generate data that keep preservation of data privacy and evalute how model perform?
Thanks.

How to use Model to get transaction embeddings?

Hi Team,
Thanx a lot for sharing the code.
I was able to train Bert model on card dataset but I am facing issue while loading saved model to generate embeddings. Can you please let me know the way to load model weights and way to generate embedding for transaction.

After creating instance of class TabFormerBertLM I am trying to load weights by following command.

tab_net.from_pretrained('/content/drive/MyDrive/TabFormer/checkpoint-500/pytorch_model.bin')

After running this I am getting the following error.
AttributeError: 'TabFormerBertLM' object has no attribute 'from_pretrained'

It will be very helpful if you can guide me to solve this problem.

Thank you.

How to use Tabformer for unsupervised task ( clustering )?

Hello TabFormer Developers, I am dealing with a dataset that contains both continuous and categorical data points, There are no labels in the dataset and the task is to cluster the data points.

I was wondering is it possible to use the tabformer for clustering?

Number of fields

A (very minor) correction for the README file: the number of fields is 15, not 12 (see here).

Missing args.py

Hello!

The code and readme both reference an args.py but it doesn't seem to be committed?

Thanks for publishing your code!

What version of gcc is required?

I have a linux server with no access to internet and no sudo privileges. So I cant install gcc with conda install libgcc.
I tried gcc --version on the server and it says 4.8.5. What version do I need to install to be able to run the process ?
Now I have following error when trying to execute python main.py....:
/lib64/libstdc++.so.6: version 'GLIBCXX_3.4.21' not found

How to load the pretrained model in Fraud Detection Tasks

Hi,
It's very nice of you to opensource the code.
I've run your code and got the model in checkpoints.
When I tried to reproduce the results in Fraud Detection Tasks, I tried many kinds of methods but failed to load the pretrained model.
For example,
model=TabFormerHierarchicalLM.from_pretrained("model_path")
but it failed.
Could you tell me how to load the the pretrained model? Thanks a lot.

predicting using the trained model

@ink-pad
Hi there,
now that i trained a PRSA bert model, how to inference on a .csv file for perhaps classification.
Can you contact me by email, you will find it in my profile, i want to ask some question.

Number of bins for amount

Hi,

interesting paper and code, trying to do something similar in a way. I just had a question: do you use just the predefined 10 bins for amount (quantizing only into 10 bins)? This seems like a really small number - or did I understand it wrong? Just curious about the parameter setup.

Thanks for the answer :)

Keep Getting Warning of "Please add an explicit pip dependency. I'm adding one for you, but still nagging you"

As titled, getting lots of warning messages when creating the virtual environment.

To solve it / suppress the warning, maybe it is better to add an explicit dependency of pip in the yaml file.

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]

Good morning,

I get this error when I run the code. I used this command to run it "python main.py --do_train --mlm --field_ce --lm_type bert --field_hs 64 --data_type 'card' --output_dir 'output_dir_card'". I am suing visual studio code to run the code.

The following is the complete error:

File "main.py", line 152, in
main(opts)
File "main.py", line 48, in main
skip_user=args.skip_user)
File "C:\Users\Ariel\Documents\HSE\3rd Term\HNCC\TabFormer-main\dataset\card.py", line 66, in init
self.encode_data()
File "C:\Users\Ariel\Documents\HSE\3rd Term\HNCC\TabFormer-main\dataset\card.py", line 309, in encode_data
timestamp = self.timeEncoder(data[['Year', 'Month', 'Day', 'Time']])
File "C:\Users\Ariel\Documents\HSE\3rd Term\HNCC\TabFormer-main\dataset\card.py", line 104, in timeEncoder
int)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 5815, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 418, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\blocks.py", line 591, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\dtypes\cast.py", line 1309, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\dtypes\cast.py", line 1242, in astype_array
raise TypeError(msg)
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]

I am using Visual studio code to run the code. I am using python version 3.7 as you mentioned and the libraries you mentioned. Seems there is a problem in the line 48 and 152 of the main.py file. I will be really grateful if you can tell me how to solve it.