Rumor-Detection Documentation

Supervised by Dr. JinHong

GLAN Model
- Env
- CUDA
- Dataset
- Train Models
PLAN Model
PPA Model
Essay
Anaconda Env
Research Logs
Experiment Results
Author

GLAN Model

Env

Install GLAN model dependencies:

pip install -r requirements.txt

CUDA

First go to https://developer.nvidia.com/cuda-11.6.0-download-archive download CUDA Toolkit.

Enter nvcc -V in the terminal to see the CUDA version.

My GLAN environment is cuda11.6.

Edit the ~/.bashrc file
Change the PATH and LD_LIBRARY_PATH environment variables

export PATH=/usr/local/cuda-11.6$PATH

export LD_LIRBRARY_PATH=/usr/local/cuda-11.6

Save the content to exit, enter source ~/.bashrc to make the environment take effect.
Insert nvcc -Vto look up the cuda tooltik.

Need to install the corresponding version of torch+cuda

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 (cuda116)

pip install torch_scatter-2.0.9-cp39-cp39-linux_x86_64.whl

pip install torch_sparse-0.6.16+pt113cu116-cp39-cp39-linux_x86_64.whl

pip install torch_cluster-1.6.0+pt113cu116-cp39-cp39-linux_x86_64.whl

pip install torch_spline_conv-1.2.1+pt113cu116-cp39-cp39-linux_x86_64.whl

pip install torch-geometric==1.7.2

All installation packages can be found in the folder Env

Check cuda is available

python3 check.py

Check the cuda version is correct or not

Dataset

GLAN/Dataset contains three dataset: twitter15, twitter16 and weibo

.pkl files are the model files.dev.pkl, train.pkl, test.pkl

vocab.pkl used to the preprocess word vectors.

replies.rar file is the replies, can't open that.

graph.txt is the graph file, display different tweets' relationship.

Train Models

[1] Change train dataset

if __name__ == '__main__':
    #task = 'twitter16'
    # task = 'twitter16'
    task = 'twitter16'
    print("task: ", task)

    if task == 'weibo':
        config['num_classes'] = 2
        config['batch_size'] = 64
        config['reg'] = 1e-5
        config['target_names'] = ['NR', 'FR']

    model = GLAN
    train_and_test(model, task)

Change the task name.(twitter15,twitter16,weibo) [2] Run the main file

python3 run_new.py

[3] If you want to retrain the model, open these:

 #train codes:
    nn.fit(X_train_tid, X_train_source, X_train_replies, y_train,
    X_dev_tid, X_dev_source, X_dev_replies, y_dev)
    nn=nn.cuda()

If there is no need for retraining, please close the interface and directly use the pre-trained model.

[4] If you want to change the test dataset, such as using the twitter15 model to run the twitter16 dataset:

    #test dataset
    y_pred = nn.predict(X_test_tid_twitter16, X_test_source_twitter16,X_test_replies_twitter16)
    print(classification_report(y_test_twitter16, y_pred, target_names=config_16['target_names'], digits=3)) #y =Wx+b  b is the bias, models decide the W.

replace different dataset name you want to use.

PS: Because the three dataset contain different number of nodes, the bigger on can test the smaller one.

And the sequence as big to small is weibo(20493) > twitter15(4459) > twitter16(3550)

Therefore, we can use GLAN(weibo trained) to run twitter15 and twitter16 as test datasets. Using twitter15 to run twitter16 as test dataset. Twitter16 only can run by itself, no others.

PLAN Model

Dataset

https://www.dropbox.com/sh/w3bh1crt6estijo/AAD9p5m5DceM0z63JOzFV7fxa?dl=0

glove english word vector: https://drive.google.com/file/d/19t9-sihELSk0Asf-EFAOCb9WfyPgtRMu/view?usp=sharing

word2vec chinese word vector:

the glove and word2vec word vectors files will use to the word vectors preprocess.

Download these files and extract to GLAN/code/dataset folder.

Env

Install PLAN model dependencies:

pip install -r requirements.txt

CUDA

First go to https://developer.nvidia.com/cuda-11.1.1-download-archive download CUDA Toolkit.

Enter nvcc -V in the terminal to see the CUDA version.

My PLAN environment is cuda11.1.

Edit the ~/.bashrc file
Change the PATH and LD_LIBRARY_PATH environment variables

export PATH=/usr/local/cuda-11.6$PATH

export LD_LIRBRARY_PATH=/usr/local/cuda-11.6

Save the content to exit, enter source ~/.bashrc to make the environment take effect.
Insert nvcc -Vto look up the cuda tooltik.

After finish that:

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/cu111/torch_stable.html

All installation packages can be found in the folder Env

Check cuda is available

python3 check.py

Check the cuda version is correct or not

Common error reports

OSError: [E941] Can't find model 'en'. It looks like you're trying to load a model from a shortcut, which is obsolete as of spaCy v3.0. To load the model, use its full name instead:

This is because the direct use of pip install spacy, the spacy version is not match with the author's old version.

Because older versions of spacy are not compatible with many of the latest packages, we use the following solution.

import spacy

nlp = spacy.load('en')

Replace above with

import spacy

nlp = spacy.load('en_core_web_sm')

error: "can't find model 'en'" it says to use its full name 'en_core_web_sm' but then i get the error: "cant find model 'en_core_web_sm' it doesnt seem to be a python package or a valid path to a data directory"

This is an issue where higher versions of Spacy are not compatible with previous older versions

You need to install en_core_web_sm

python -m spacy download en_core_web_sm

ImportError: cannot import name 'NestedField' from 'torchtext.data' (/home/ame/anaconda3/envs/PLAN/lib/python3.9/site-packages/torchtext/data/init.py)

This is an issue where later versions of TorchText are not compatible with older versions

Because older versions of spacy are not compatible with many of the latest packages, we use the following solution.

pip install torch==0.9.1

change torchtext.databecomestorchtext.legacy.data

After that, maybe new error occur:

userwarning: [w108] the rule-based lemmatizer did not find pos annotation for one or more tokens. check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.

nlp = spacy.load("en_core_web_sm", disable = ["parser", "tagger", "ner","lemmatizer"])

Train Models

[1] Config train settings

Open the PLAM/codes/config_new.py file.

Train settings:

      # Training
        self.num_epoch = 300
        self.batch_size = 4
        self.batch_size_test = 4
        self.num_classes = 4

Word Embedding settings: Decided by your train dataset, English datasets use the glove/6B.300d.txt file, Chinese datasets use the word2vec.txt file.

self.glove_directory = "/home/ame/rumor/PLAN/codes/data/word2vec/"
        self.glove_file = "word2vec.txt"
        self.vector_path = ""
        self.vocab_path = ""
        self.max_vocab = 20000
        self.emb_dim = 300
        self.num_structure_index = 5

Other settings please look the config file, many detail remark have been indicated well.

[2] Config test settings

Open the PLAN/codes/test_new.ipynb

Choose model:

folder = "/home/ame/rumor/PLAN/logs/weibo_full_repair_model/weibo_full_repair/best_model"

Choose test datasets:

test_file_path = "/home/ame/rumor/PLAN/codes/data/twitter15_16/twitter16/split_data/structure_v2/split_4/train_unique_w_structure_v2_modified.json"

Choose word vector preprocess file:

# Getting the vocab vectors
#Chinese
vec = vocab.Vectors(name = word2vec_file, cache = word2vec_directory)
#English
#vec = vocab.Vectors(name =glove_file, cache =glove_directory)

You can use above three steps to run different PLAN models test different datasets.

[3] Add new dataset

You can add new datasets in any language, provided that the json file format of the PLAN model is followed as shown in the following example:

{"id_": 552783745565347840, "label": 0, "tweets": ["Ten killed in shooting at headquarters of French satirical weekly Charlie Hebdo, says French media citing witnesses #c4news", "@Channel4News @GidonShaviv must be that peace loving religion again", "@Channel4News my god what is going on,no doubt as it seems to be satirical mag,members of a certain religion have been offended they kill.", "@Channel4News @theresacfc ffs what's going on??", "@Channel4News I think the majority of people ( including you guys) are sick and tired of Islamic immigration and all the problems it brings", "@Channel4News 1 cleric for each victim, this will stop the hate speech from mosques. the streets will echo with silence by outraged muslims"], "time_delay": [0, 0, 0, 2, 3, 8], "structure": [[4, 2, 2, 2, 2, 2], [3, 4, 2, 2, 2, 2], [3, 3, 4, 2, 2, 2], [3, 3, 3, 4, 2, 2], [3, 3, 3, 3, 4, 2], [3, 3, 3, 3, 3, 4]]}

The json file must contain the id, label, tweets.

time_dealy and structure these two parameters are extra parameters used to improve the accuracy.

[4] Extra python tools package

Extra tools.ipynb this file contain many functions for this PLAN model research, it is very helpful when want to do some files process operations.

This is the list functions for this jupyter notebook file:

1. Consolidate all the information in TreeWeibo into one file weibo_timedelay.txt
2. Integrate original data with comment data weibo_id_text.txt + weibo_timedelay.txt
3. Re-integrate into the original data content and message categories label: weibo.txt
4. Convert a txt file to a json file in PLAN data format.
5. Count the number of rows in the newly generated json file.
6. Counting the number of different elements in a txt file.
7. Counting txt file columns (to prevent data confusion).
8. Allocation of datasets at 70% 15% 15%.
9. Count the largest element value and the largest number of elements contained in time_delay[] in all json data.
10. Change the time_dealy[] in the Weibo dataset to a maximum of 100.
11. Instead of changing data greater than 100 to 100, the data in the corresponding tweet[] is deleted and the data in the corresponding tweet[] is deleted.
12. The label parameter in the statistics weibo data.
13. Delete the elements of time_dealy[] from the Weibo dataset.
14. Splitting json data.
15. Look up json files not recognised by the PLAN model.
16. Merge json data.
17. We divided the Weibo dataset into 5 parts and selected one of them for testing, following the ratio of 70% for the training set, 15% for test set 1, and 15% for test set 2.
18. Create weibo dataset, correct label.
19. Statistic twitter15/16 labels
20. Change the twitter15 label.

PS: Since the PLAN model is based on TensorFlow, the training time is very long, but the testing phase does not take very long.

Trained Models

Label-2-Twitter15 Model:

https://drive.google.com/file/d/1EMwaqLXl4vO8whp5dEjjjWJ5a1QA2j-Z/view?usp=sharing

Label-2-Twitter16 Model:

https://drive.google.com/file/d/1fe4am7CbS3cncivr3a7HDfbTeqz_NtJ9/view?usp=sharing

Label-2-Pheme Model:

https://drive.google.com/file/d/1-2ZNnEe0zX_Do30t7caZs5-Ubw7Ohi-d/view?usp=sharing

Weibo Model:

https://drive.google.com/file/d/1jsCY2BJAMbo37TyVSdZk_vmxfQiuj1Ii/view?usp=sharing

Twitter15 Model:

https://drive.google.com/file/d/1iQr9aCAMomR3fApE-5_zJbiFBw76c58T/view?usp=sharing

Twitter16 Model:

https://drive.google.com/file/d/1CgDYIRhG3ZqZ6ZVyowvhNNu0V4sM22J3/view?usp=sharing

Pheme Model:

PPA Model

Dataset

The data set used by the author uses the Baidu network disk download tool, the download speed is slow, and it is not recommended to download directly.

https://drive.google.com/file/d/1wds1SVweGzKOfx-vaw15ouIbSLsOva_v/view?usp=share_link

Extract this folder into PPA/data/.

Env

Install PPA model dependencies:

pip install -r requirements.txt

CUDA

First go to https://developer.nvidia.com/cuda-11.1.1-download-archive download CUDA Toolkit.

Enter nvcc -V in the terminal to see the CUDA version.

My PPA environment is cuda11.1.

Edit the ~/.bashrc file
Change the PATH and LD_LIBRARY_PATH environment variables

export PATH=/usr/local/cuda-11.1$PATH

export LD_LIRBRARY_PATH=/usr/local/cuda-11.1

Save the content to exit, enter source ~/.bashrc to make the environment take effect.
Insert nvcc -Vto look up the cuda tooltik.

After finish that:

pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html

All installation packages can be found in the folder Env

Check cuda is available

python3 check.py

Check the cuda version is correct or not

Common error reports

typeerror: vampire.forward: return type none is not a typing.dict[str, torch.tensor].

reason: pytorch version problem
valueerror("incompatible component merge:\n - '*mpich*'\n - 'mpi_mpich_*'")

reason: For anaconda version issues, it is recommended to use conda install -c conda-forge <package_name> one by one
from allennlp.modules.scalar_mix import scalarmix modulenotfounderror: no module named 'allennlp'

reason:

[1] The allennlp.models and allnnlp versions must be the same.

[2] It is recommended to install allennlp.models directly, so that the same version of allnnlp will be installed automatically

[3] Allennlp package has strict version matching restrictions, including the following packages: transformers, torch, torchvison, cach-path, huggingface-hub, etc.

For example, the following error will appear: ERROR: cached-path 1.1.6 has requirement filelock<3.9,>=3.4, but you'll have filelock 3.0.12 which is incompatible. ERROR: cached-path 1.1.6 has requirement huggingface-hub<0.11.0,>=0.8.1, but you'll have huggingface-hub 0.13.1 which is incompatible. Similar to this, you need to choose the right version to match.

Details can be found in the https://github.com/allenai/allennlp-models

Train Model

[1] Config settings: Open the PPA/get_args.py file.

In this file you can set different experiment parameters.

More details have remarked in this file.

[2] Choose Dataset In PPA/get_args.py file:

cur_dataset = "T"
if cur_dataset == "T":
    _args = get_response_twitter_args()
elif cur_dataset == "W":
    _args = get_response_weibo_args()
elif cur_dataset == "P":
    _args = get_response_pheme5_args()

T = Twitter

W = Weibo

P = Pheme

[3] Three main file:

python3 MainResponseWAE4Early.py

Running file for Twitter15 and Twitter16 dataset.

python3 WeiboMainResponseWAE4Early.py

Running file for Weibo dataset.

python3 Pheme5MainResponseWAE4Early.py

Running file for PHEME dataset.

Save Model

The initial codes can't save PPA model, so we can add these codes in the three main files.

# save model path
        torch.save({"model" : model.state_dict()}, "./model_Twitter16/e" + str(epoch))

# choose test model
        model.load_state_dict(torch.load("./model_Twitter/e5")['model'])
        model.eval()

More details I have uploaded the three new main files which I have edited. MainResponseWAE4Early_new.py, Pheme5MainResponseWAE4Early_new.py, WeiboMainResponseWAE4Early_new.py.