frontlibrary/transformers-pytorch-gpu:4.6.1-pyarrow
~$ docker run --runtime=nvidia -it --rm -v $HOME/SL4DU:/workspace frontlibrary/transformers-pytorch-gpu:4.6.1-pyarrow
- Python==3.9 (There may be some problem for numpy with 3.10)
- nltk
- numpy (can be automatically installed when installing scipy)
- scipy (if you have problem, see this solution)
- torch==1.8
- pyarrow
- tqdm
- transformers==4.5.1
- sklearn
- stop_words
- Initialize directories
SL4DU code data pretrained
- Download code and the Ubuntu data
~/SL4DU/code$ git clone https://github.com/RayXu14/SL4DU.git ~/SL4DU/data$ wget https://www.dropbox.com/s/2fdn26rj6h9bpvl/ubuntu_data.zip ~/SL4DU/data$ unzip ubuntu_data.zip
- Add bert-base-uncased pretrained model in pretrained
- config.json
- vocab.txt
- pytorch_model.bin
- Preprocess data
~/SL4DU/code/SL4DU$ python3 preprocess.py --task=RS --dataset=Ubuntu --raw_data_path=../../data/ubuntu_data --pkl_data_path=../../data/ubuntu_data --pretrained_model=bert-base-uncased
- Reproduce BERT result
~/SL4DU/code/SL4DU$ python3 -u train.py --save_ckpt --task=RS --dataset=Ubuntu --pkl_data_path=../../data/ubuntu_data --pretrained_model=bert-base-uncased --add_EOT --freeze_layers=0 --train_batch_size=8 --eval_batch_size=100 --log_dir=? # --pkl_valid_file=test.pkl
- Add post-ubuntu-bert-base-uncased in pretrained
- Download whang's Ubuntu ckpt and use deprecated/whangpth2bin.py to transform it into our form; compared to bert-base-uncased, only need to +1 for vocab size in config.json and add a new word [EOS] after vocab.txt
- Or use our pretrained models (already transformed) instead
- Reproduce BERT-VFT result
~/SL4DU/code/SL4DU$ python3 -u train.py --save_ckpt --task=RS --dataset=Ubuntu --pkl_data_path=../../data/ubuntu_data --pretrained_model=post-ubuntu-bert-base-uncased --freeze_layers=8 --train_batch_size=16 --eval_batch_size=100 --log_dir=? #--pkl_valid_file=test.pkl
- Reproduce SL4RS result
~/SL4DU/code/SL4DU$ python3 -u train.py --save_ckpt --task=RS --dataset=Ubuntu --pkl_data_path=../../data/ubuntu_data --pretrained_model=post-ubuntu-bert-base-uncased --freeze_layers=8 --train_batch_size=4 --eval_batch_size=100 --log_dir=? --use_NSP --use_UR --use_ID --use_CD --train_view_every=80 #--pkl_valid_file=test.pkl
- Evaluation
~/SL4DU/code/SL4DU$ python3 -u eval.py --task=Ubuntu --data_path=../../data/ubuntu_data --pretrained_model=post-ubuntu-bert-base-uncased --freeze_layers=8 --eval_batch_size=100 --log_dir ? --load_path=?
Using Whang's repo
Remember to transform the saved model to our form using deprecated/whangpth2bin.py.
set the number of epochs as 2 for post-training with 10 duplication data and set the virtual batch size as 384