GithubHelp home page GithubHelp logo

ibm / tabformer Goto Github PK

View Code? Open in Web Editor NEW
295.0 10.0 77.0 471 KB

Code & Data for "Tabular Transformers for Modeling Multivariate Time Series" (ICASSP, 2021)

Home Page: https://arxiv.org/abs/2011.01843

License: Apache License 2.0

Python 100.00%
machine-learning artificial-intelligence credit-card-dataset fraud-detection gpt bert tabular-data prsa-dataset huggingface credit-card-transaction

tabformer's Introduction

Tabular Transformers for Modeling Multivariate Time Series

This repository provides the pytorch source code, and data for tabular transformers (TabFormer). Details are described in the paper Tabular Transformers for Modeling Multivariate Time Series, to be presented at ICASSP 2021.

Summary

  • Modules for hierarchical transformers for tabular data
  • A synthetic credit card transaction dataset
  • Modified Adaptive Softmax for handling masking
  • Modified DataCollatorForLanguageModeling for tabular data
  • The modules are built within transformers from HuggingFace ๐Ÿค—. (HuggingFace is โค๏ธ)

Requirements

  • Python (3.7)
  • Pytorch (1.6.0)
  • HuggingFace / Transformer (3.2.0)
  • scikit-learn (0.23.2)
  • Pandas (1.1.2)

(X) represents the versions which code is tested on.

These can be installed using yaml by running :

conda env create -f setup.yml

Credit Card Transaction Dataset

The synthetic credit card transaction dataset is provided in ./data/credit_card. There are 24M records with 12 fields. You would need git-lfs to access the data. If you are facing issue related to LFS bandwidth, you can use this direct link to access the data. You can then ignore git-lfs files by prefixing GIT_LFS_SKIP_SMUDGE=1 to the git clone .. command.

figure


PRSA Dataset

For PRSA dataset, one have to download the PRSA dataset from Kaggle and place them in ./data/card directory.


Tabular BERT

To train a tabular BERT model on credit card transaction or PRSA dataset run :

$ python main.py --do_train --mlm --field_ce --lm_type bert \
                 --field_hs 64 --data_type [prsa/card] \
                 --output_dir [output_dir]

Tabular GPT2

To train a tabular GPT2 model on credit card transactions for a particular user-id :


$ python main.py --do_train --lm_type gpt2 --field_ce --flatten --data_type card \
                 --data_root [path_to_data] --user_ids [user-id] \
                 --output_dir [output_dir]
    

Description of some options (more can be found in args.py):

  • --data_type choices are prsa and card for Beijing PM2.5 dataset and credit-card transaction dataset respecitively.
  • --mlm for masked language model; option for transformer trainer for BERT
  • --field_hs hidden size for field level transformer
  • --lm_type choices from bert and gpt2
  • --user_ids option to pick only transacations from particular user ids.

Citation

@inproceedings{padhi2021tabular,
  title={Tabular transformers for modeling multivariate time series},
  author={Padhi, Inkit and Schiff, Yair and Melnyk, Igor and Rigotti, Mattia and Mroueh, Youssef and Dognin, Pierre and Ross, Jerret and Nair, Ravi and Altman, Erik},
  booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={3565--3569},
  year={2021},
  organization={IEEE},
  url={https://ieeexplore.ieee.org/document/9414142}
}

tabformer's People

Contributors

ealtman741 avatar ed-winning avatar ink-pad avatar matrig avatar stevemar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tabformer's Issues

Real culprits in the dataset?

In the financial transaction dataset, who is considered to be the main source of transactional frauds?

A breakdown of the fraud transactions indicate that all 2,000 users were victims of fraud, whereas, around 3K merchants out of over 100K merchants were involved in fraudulent transactions. Does that mean these 3K merchants are considered to be the main culprits of credit card fraud in this dataset? Or is it assumed that the source of fraud are external parties that conceal their identities by using the merchants' details?

Also interestingly, some of those "shady" merchants have hundreds of thousands of transactions, however, only a smaller fractions of these are marked as fraud.

It is greatly appreciated if this can be clarified, as the paper does not cover the dataset description in detail.

Number of fields

A (very minor) correction for the README file: the number of fields is 15, not 12 (see here).

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]

Good morning,

I get this error when I run the code. I used this command to run it "python main.py --do_train --mlm --field_ce --lm_type bert --field_hs 64 --data_type 'card' --output_dir 'output_dir_card'". I am suing visual studio code to run the code.

The following is the complete error:

File "main.py", line 152, in
main(opts)
File "main.py", line 48, in main
skip_user=args.skip_user)
File "C:\Users\Ariel\Documents\HSE\3rd Term\HNCC\TabFormer-main\dataset\card.py", line 66, in init
self.encode_data()
File "C:\Users\Ariel\Documents\HSE\3rd Term\HNCC\TabFormer-main\dataset\card.py", line 309, in encode_data
timestamp = self.timeEncoder(data[['Year', 'Month', 'Day', 'Time']])
File "C:\Users\Ariel\Documents\HSE\3rd Term\HNCC\TabFormer-main\dataset\card.py", line 104, in timeEncoder
int)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 5815, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 418, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\blocks.py", line 591, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\dtypes\cast.py", line 1309, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "C:\Users\Ariel\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\dtypes\cast.py", line 1242, in astype_array
raise TypeError(msg)
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]

I am using Visual studio code to run the code. I am using python version 3.7 as you mentioned and the libraries you mentioned. Seems there is a problem in the line 48 and 152 of the main.py file. I will be really grateful if you can tell me how to solve it.

LFS Clone of credit card data not working

Good day,

When trying to lfs pull the credit card data I get the following error:

batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Could you please help?

Kind regards
Ruan

How to use Tabformer for unsupervised task ( clustering )?

Hello TabFormer Developers, I am dealing with a dataset that contains both continuous and categorical data points, There are no labels in the dataset and the task is to cluster the data points.

I was wondering is it possible to use the tabformer for clustering?

Number of bins for amount

Hi,

interesting paper and code, trying to do something similar in a way. I just had a question: do you use just the predefined 10 bins for amount (quantizing only into 10 bins)? This seems like a really small number - or did I understand it wrong? Just curious about the parameter setup.

Thanks for the answer :)

Code example for Fine Tuning

Thank you for sharing this incredible work. I'm relatively new to deep learning and am eager to learn more. Would it be possible for you to share an example of the fine-tuning process? I would greatly appreciate it. Thank you in advance.

How to load the pretrained model in Fraud Detection Tasks

Hi,
It's very nice of you to opensource the code.
I've run your code and got the model in checkpoints.
When I tried to reproduce the results in Fraud Detection Tasks, I tried many kinds of methods but failed to load the pretrained model.
For example,
model=TabFormerHierarchicalLM.from_pretrained("model_path")
but it failed.
Could you tell me how to load the the pretrained model? Thanks a lot.

What version of gcc is required?

I have a linux server with no access to internet and no sudo privileges. So I cant install gcc with conda install libgcc.
I tried gcc --version on the server and it says 4.8.5. What version do I need to install to be able to run the process ?
Now I have following error when trying to execute python main.py....:
/lib64/libstdc++.so.6: version 'GLIBCXX_3.4.21' not found

How to reproduce the results on the paper

Hi,

I noticed that there is no code for downstream tasks in this repo and I'm wondering if you could upload it. I want to reproduce your results on the ICASSP paper (e.g. Table 1 on the paper). Based on your paper, you feed the extracted features into MLP / LSTM. However, the detailed config of MLP / LSTM is not clear and I'm not very sure about how you do upsampling for the Fraud Detection Task.

It'll be super helpful if you could provide the code for downstream tasks.

Thanks for your help!

[SEP] token not added between transactions in TabGPT

Hello,

From my understanding of the paper, for the TabGPT model, sequences of ten transactions are passed with the [SEP] token between the transactions.

However, after looking at the code, it seems like a [SEP] token is not added between the transactions; self.mlm is False here since the --mlm flag is not included in the command provided in the README.

Thank you very much for any clarification on this.

Merchant Name is an integer

In the screen shot in the readme, the merchant name is a string, but when I load the CSV file using pandas, I get an integer value instead:

          User  Card  Year  Month  Day   Time   Amount           Use Chip        Merchant Name  Merchant City Merchant State      Zip   MCC Errors? Is Fraud?
0            0     0  2002      9    1  06:21  $134.09  Swipe Transaction  3527213246127876953       La Verne             CA  91750.0  5300    None        No
1            0     0  2002      9    1  06:42   $38.48  Swipe Transaction  -727612092139916043  Monterey Park             CA  91754.0  5411    None        No
2            0     0  2002      9    2  06:22  $120.34  Swipe Transaction  -727612092139916043  Monterey Park             CA  91754.0  5411    None        No
3            0     0  2002      9    2  17:45  $128.95  Swipe Transaction  3414527459579106770  Monterey Park             CA  91754.0  5651    None        No
4            0     0  2002      9    3  06:23  $104.71  Swipe Transaction  5817218446178736267       La Verne             CA  91750.0  5912    None        No
...        ...   ...   ...    ...  ...    ...      ...                ...                  ...            ...            ...      ...   ...     ...       ...
24386895  1999     1  2020      2   27  22:23  $-54.00   Chip Transaction -5162038175624867091      Merrimack             NH   3054.0  5541    None        No
24386896  1999     1  2020      2   27  22:24   $54.00   Chip Transaction -5162038175624867091      Merrimack             NH   3054.0  5541    None        No
24386897  1999     1  2020      2   28  07:43   $59.15   Chip Transaction  2500998799892805156      Merrimack             NH   3054.0  4121    None        No
24386898  1999     1  2020      2   28  20:10   $43.12   Chip Transaction  2500998799892805156      Merrimack             NH   3054.0  4121    None        No
24386899  1999     1  2020      2   28  23:10   $45.13   Chip Transaction  4751695835751691036      Merrimack             NH   3054.0  5814    None        No

Missing args.py

Hello!

The code and readme both reference an args.py but it doesn't seem to be committed?

Thanks for publishing your code!

How to reduce model size for fitting gpu memory

Hi!

I have gpu with 32 Gb, but it's not enough for training model using your source code and predefined config.
Please, can you explain how to reduce model params.
I've already reduced embedding dim and nrof attention heads, but maybe you know better way?

Thanks!

Simple question about how to use trained tabgpt model to generate data

Hi there, thanks for making such an interesting work, after using the code
>python main.py --do_train --lm_type gpt2 --field_ce --flatten --data_type card --output_dir ./ --data_root ./data/card/
and I get several checkpoints, so how can I use these models to generate data that keep preservation of data privacy and evalute how model perform?
Thanks.

Test dataset distribution

Hi,

I am trying to replicate the results from the paper.
Is the distribution used in the test set the same as the train set before upsampling the fraud class (imbalanced) or have you used a balanced test set?
Could you also please comment on how you split the data? Did you split randomly or did you use the bottom part of the dataset as the test set?

Kind regards,
Rafael

LFS issue: over quota, can't download

I am getting this error while attempting to clone the repo with LFS.

Error downloading object: data/credit_card/transactions.tgz (e9f589a): 
Smudge error: Error downloading data/credit_card/transactions.tgz 
(e9f589a0958f40d60f81b1a2e8428db86e00c05755caf44fb055827976c0efa2): batch response: 
This repository is over its data quota.
Account responsible for LFS bandwidth should purchase more data packs to restore access.

data format for regression task

Hi.
I wanna try regression task similar to prsa.
For prsa, I understand the data for training is prepared in dataset/prsa.py.
For other new regression task, where and how can I set the target value and feature data?

Is TABGPT really privacy-preserving?

Hello,
It seems to me that you don't give any guarantee in you paper that someone would not be able to attack the generated dataset (e.g. membership attacks)?

How to use Model to get transaction embeddings?

Hi Team,
Thanx a lot for sharing the code.
I was able to train Bert model on card dataset but I am facing issue while loading saved model to generate embeddings. Can you please let me know the way to load model weights and way to generate embedding for transaction.

After creating instance of class TabFormerBertLM I am trying to load weights by following command.

tab_net.from_pretrained('/content/drive/MyDrive/TabFormer/checkpoint-500/pytorch_model.bin')

After running this I am getting the following error.
AttributeError: 'TabFormerBertLM' object has no attribute 'from_pretrained'

It will be very helpful if you can guide me to solve this problem.

Thank you.

predicting using the trained model

@ink-pad
Hi there,
now that i trained a PRSA bert model, how to inference on a .csv file for perhaps classification.
Can you contact me by email, you will find it in my profile, i want to ask some question.

Error in TabFormerGPT2LMHeadModel class

Hi,

Noticed that input_only is an invalid argument to get_field_keys method in vocab: (line 49)
field_names = self.vocab.get_field_keys(input_only=True, ignore_special=True)

Should it be this instead?
field_names = self.vocab.get_field_keys(remove_target=True, ignore_special=True)

Also, are you able to include code snippet for generation of synthetic data using trained GPT-2?
Would like to recreate the results stated in your paper. Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.