GithubHelp home page GithubHelp logo

git-disl / bert4eth Goto Github PK

View Code? Open in Web Editor NEW
96.0 7.0 17.0 7.04 MB

BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection (WWW23)

Python 98.64% Shell 1.36%
bert transformer deanonymization blockchain ethereum fraud-detection www2023 phishing-detection pretrained-language-model

bert4eth's Introduction

BERT4ETH

This is the repo for the code (TensorFlow version) and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023. Here you can find our slides.

If you find this repository useful, please give us a star : ) Thank you!

Update: I've recently added a section (Section 5.5) discussing the multi-hop modeling capability of BERT4ETH to the paper on arXiv. (10/30)

BERT4ETH-PyTorch: Here you can find the PyTorch implementation: https://github.com/Bayi-Hu/BERT4ETH_PyTorch

Some notes:

Note 1: The master branch hosts the basic BERT4ETH. If you wish to run the basic model, there is no need to download the ERC-20 log dataset. Advanced features such as In/out separation and ERC20 log can be found in the old branch but are not recommended due to the inefficiency of computation and memory.

Note 2: Despite BERT4ETH is a sequential model, it is able to capture three-hop relationship from a graph perspective. (For more details please refer to our slides.) multi_hop_modeling.png

Note 3: The results reported in our paper are the best results among five times experiments (pre-training). The outcomes might slightly vary between different runs of pre-training, steps of checkpoints, and runs of cascaded MLP classifier training. Below are our recent results on the phishing detection task with fixed training:

Getting Start

Requirements

  • Python >= 3.6
  • TensorFlow >= 2

I use python 3.9, tensorflow 2.9.2 with CUDA 11.2, numpy 1.19.5.

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

cd BERT4ETH/Data; # Labels are already included
unzip ...;

Step 3: Transaction Sequence Generation

cd Model;
python gen_seq.py --bizdate=bert4eth_exp

Pre-training

Step 1: Pre-training Data Generation from Sequence

python gen_pretrain_data.py --bizdate=bert4eth_exp  \ 
                            --max_seq_length=100  \
                            --dupe_factor=10 \
                            --masked_lm_prob=0.8 

Step 2: Pre-train BERT4ETH

python run_pretrain.py --bizdate=bert4eth_exp \
                       --max_seq_length=100 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip \
                       --neg_sample_num=5000 \ 
                       --neg_share=True \ 
                       --checkpointDir=bert4eth_exp 
Parameter Description
bizdate The signature for this experiment run.
max_seq_length The maximum length of BERT4ETH.
masked_lm_prob The probability of masking an address.
epochs Number of training epochs, default = 5.
batch_size Batch size, default = 256.
learning_rate Learning rate for the optimizer (Adam), default = 1e-4.
num_train_steps The maximum number of training steps, default = 1000000,
save_checkpoints_steps The parameter controlling the step of saving checkpoints, default = 8000.
neg_strategy Strategy for negative sampling, default zip, options (uniform, zip, freq).
neg_share Whether enable in-batch sharing strategy, default = True.
neg_sample_num The negative sampling number for one batch, default = 5000.
checkpointDir Specify the directory to save the checkpoints.

Step 3: Output Representation

python output_embed.py --bizdate=bert4eth_exp \
                       --init_checkpoint=bert4eth_exp/model_104000 \
                       --max_seq_length=100 \
                       --neg_sample_num=5000 \
                       --neg_strategy=zip \
                       --neg_share=True

I have generated a version of embedding file, you can unzip it under the directory of "Model/inter_data/" and test the results.

Testing on output account representation

Phishing Account Detection

python run_phishing_detection.py --init_checkpoint=bert4eth_exp/model_104000 # Random Forest (RF)

python run_phishing_detection_dnn.py --init_checkpoint=bert4eth_exp/model_104000 # DNN, better than RF

De-anonymization (ENS dataset)

python run_dean_ENS.py --metric=euclidean \
                       --init_checkpoint=bert4eth_exp/model_104000

De-anonymization (Tornado Cash)

python run_dean_Tornado.py --metric=euclidean \
                           --init_checkpoint=bert4eth_exp/model_104000

Fine-tuning for phishing account detection

python gen_finetune_phisher_data.py --bizdate=bert4eth_exp \ 
                                    --max_seq_length=100 
python run_finetune_phisher.py --init_checkpoint=bert4eth_exp/model_104000 \
                               --bizdate=bert4eth_exp \ 
                               --max_seq_length=100 \ 
                               --checkpointDir=tmp

Citation

@inproceedings{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  booktitle={Proceedings of the ACM Web Conference 2023},
  pages={2189--2197},
  year={2023}
}

Q&A

If you have any questions, you can either open an issue or contact me ([email protected]), and I will reply as soon as I see the issue or email.

bert4eth's People

Contributors

bayi-hu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert4eth's Issues

Consultation on experiment

Hello, that's a great job!
Could you possibly tell me the detail information about the environmental version in your experiment? like the version of OS, Tensorflow, CUDA, cuDNN and TensorRT.
Thank You.

Classification of Sybils

I'm wondering whether we should add a classifier and then train that and 1 or 2 layers of bert4eth using our data set of labels or would it make more sense to somehow use the deanonamization as demonstrated w the ENS example as a first step, perhaps checking to see if it predicted these labels.

Any suggestions would be great as of course the permutations are vast - and time is short

array has an inhomogeneous shape after 1 dimensions.

Hello,

I'm getting that error when running
python gen_pretrain_data.py --bizdate=bert4eth_exp --max_seq_length=100 --dupe_factor=10 --masked_lm_prob=0.8

Traceback (most recent call last):
  File "$PATH\BERT4ETH\Model\gen_pretrain_data.py", line 448, in <module>
    main()
  File "$PATH\BERT4ETH\Model\gen_pretrain_data.py", line 402, in main
    seqs = np.random.permutation(seqs)
  File "mtrand.pyx", line 4703, in numpy.random.mtrand.RandomState.permutation
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (631617,) + inhomogeneous part.

Can't detect phishing accounts

Hello,
I set the parameters as written in the readme, but could not detect phishing accounts. If you don't mind, can you give me some trained models, or examples.
Thank you.

Testing files are missing

Hello,
It would be very nice of you if you added the testing files that are mentioned in the README.md.
Thanks.

May I ask if there is a problem with my tf version

i try to run code in my local enveriment,but i have some version problem.

"TypeError: init() got multiple values for argument 'activation'"
when i try to run the "run_pretrain.py".
i try 1.x tf and 2.x tf all can not work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.