salt-nlp / mixtext Goto Github PK

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

License: MIT License

Python 12.63% Jupyter Notebook 87.37%

mixtext textclassification semisupervised-learning dataaugmentation interpolation computation natural-language-processing textgeneration machine-learning

mixtext's Introduction

MixText

This repo contains codes for the following paper:

Jiaao Chen, Zichao Yang, Diyi Yang: MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL'2020)

If you would like to refer to it, please cite the paper mentioned above.

Getting Started

These instructions will get you running the codes of MixText.

Requirements

Python 3.6 or higher
Pytorch >= 1.3.0
Pytorch_transformers (also known as transformers)
Pandas, Numpy, Pickle
Fairseq

Code Structure

|__ data/
        |__ yahoo_answers_csv/ --> Datasets for Yahoo Answers
            |__ back_translate.ipynb --> Jupyter Notebook for back translating the dataset
            |__ classes.txt --> Classes for Yahoo Answers dataset
            |__ train.csv --> Original training dataset
            |__ test.csv --> Original testing dataset
            |__ de_1.pkl --> Back translated training dataset with German as middle language
            |__ ru_1.pkl --> Back translated training dataset with Russian as middle language

|__code/
        |__ transformers/ --> Codes copied from huggingface/transformers
        |__ read_data.py --> Codes for reading the dataset; forming labeled training set, unlabeled training set, development set and testing set; building dataloaders
        |__ normal_bert.py --> Codes for BERT baseline model
        |__ normal_train.py --> Codes for training BERT baseline model
        |__ mixtext.py --> Codes for our proposed TMix/MixText model
        |__ train.py --> Codes for training/testing TMix/MixText

Downloading the data

Please download the dataset and put them in the data folder. You can find Yahoo Answers, AG News, DB Pedia here, IMDB here.

Pre-processing the data

For Yahoo Answer, We concatenate the question title, question content and best answer together to form the text to be classified. The pre-processed Yahoo Answer dataset can be downloaded here.

Note that for AG News and DB Pedia, we only utilize the content (without titles) to do the classifications, and for IMDB we do not perform any pre-processing.

We utilize Fairseq to perform back translation on the training dataset. Please refer to ./data/yahoo_answers_csv/back_translate.ipynb for details.

Here, we have put two examples of back translated data, de_1.pkl and ru_1.pkl, in ./data/yahoo_answers_csv/ as well. You can directly use them for Yahoo Answers or generate your own back translated data followed the ./data/yahoo_answers_csv/back_translate.ipynb.

Training models

These section contains instructions for training models on Yahoo Answers using 10 labeled data per class for training.

Training BERT baseline model

Please run ./code/normal_train.py to train the BERT baseline model (only use labeled training data):

python ./code/normal_train.py --gpu 0,1 --n-labeled 10 --data-path ./data/yahoo_answers_csv/ \
--batch-size 8 --epochs 20

Training TMix model

Please run ./code/train.py to train the TMix model (only use labeled training data):

python ./code/train.py --gpu 0,1 --n-labeled 10 --data-path ./data/yahoo_answers_csv/ \
--batch-size 8 --batch-size-u 1 --epochs 50 --val-iteration 20 \
--lambda-u 0 --T 0.5 --alpha 16 --mix-layers-set 7 9 12 --separate-mix True

Training MixText model

Please run ./code/train.py to train the MixText model (use both labeled and unlabeled training data):

python ./code/train.py --gpu 0,1,2,3 --n-labeled 10 \
--data-path ./data/yahoo_answers_csv/ --batch-size 4 --batch-size-u 8 --epochs 20 --val-iteration 1000 \
--lambda-u 1 --T 0.5 --alpha 16 --mix-layers-set 7 9 12 \
--lrmain 0.000005 --lrlast 0.0005

mixtext's People

Contributors

Stargazers

Watchers

mixtext's Issues

AttributeError: 'BertConfig' object has no attribute 'chunk_size_feed_forward'

Hello, what version of transformers are you using in the code? Tried to run and instance the MixText model but got an error of no attribute 'chunk_size_feed_forward' when creating the layers.

model = MixText(n_labels, args['mix_option']).cuda()

And using 'bert-base-uncased' as pretrained model.

yml file of environment

Hello!

I am trying to reproduce your work, but I have issues with the dependencies. Can you provide the yml file of the conda environment that you used or at least specify the version of each package (transformers torch, ...)?

Thanks,
Taha

original training and test data

Hi, could you please upload the data files of all datasets?

Besides, pre-processed yahoo dataset is not reachable now. Could you renew it?

Thanks

About args.val-iterations.

Dear authors, I noticed that if val-iteration is always set to 1000 for batch size:4, batch size u:8, for each epoch only 4000 labeled data and 8000 unlabeled data is used. This is fine when num_labeled is 10 per class. How about num_labeled is 2500? Is args.val-iterations set to 1000 for every experiments? Thanks alot.

Sequences too long for backtranslation

I was trying to use the backtranslation code with the IMDB polarity dataset. But the sequences are very long and fairseq is giving an error. How you handle such long sequences? Thanks for the code.

I want to know what 102 means

Sorry for disturbing you again. I tried to change the max_seq_len to 512, and change the BERT to Electra. but I'm not sure that 102 in this line of code needs to be changed

https://github.com/GT-SALT/MixText/blob/f17198d98e1bbb012d8e33cd26e228dd43bb4673/code/train.py#L313

Not using entropy minimization loss in the final model?

Hey @jiaaoc @diyiy,

A quick clarification question -- looks like the lambda-u-hinge parameter is always set to zero and also following the code, the hinge loss (entropy minimization loss, Sec 4.4 in the paper) is never added to the final loss in the MixText/TMix configuration of the code, i.e., line 353 is never executed.

Is that correct? Or am I missing something?

tcmalloc: large alloc 44581601280 bytes

Hi @jiaaoc
All my compliments to you and your team for the great results with this model.👍
When I run the following code on the Yahoo Answer dataset you provided on Colab (RAM=24G, Memory=16G)
!python ./code/train.py --gpu 0 --n-labeled 10 --data-path ./data/yahoo_answers_csv/ --batch-size 4 --batch-size-u 8 --epochs 20 --val-iteration 1000 --lambda-u 1 --T 0.5 --alpha 16 --mix-layers-set 7 9 12 --lrmain 0.000005 --lrlast 0.0005
get the error about tcmalloc：

“””
2020-11-24 14:11:01.275980: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
GPU num: 1
Whether mix: True
Mix layers sets: [7, 9, 12]
tcmalloc: large alloc 44581601280 bytes == 0x7f6ef822c000 @ 0x7f7a28081001 0x7f7a25453765 0x7f7a254b7bb0 0x7f7a254badb8 0x7f7a254bb395 0x7f7a2555265d 0x50a4a5 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50cc96 0x5095c8 0x50a2fd 0x50beb4 0x507be4 0x50ad03 0x634e72 0x634f27 0x6386df 0x639281 0x4b0dc0 0x7f7a27c7cbf7 0x5b259a
^C
“””
Does this error mean that more than 40G of memory is required to run the above code? If so, it is difficult to find such a machine.😢
Looking forward to your reply😊

A little question about pre-processed Yahoo Answer dataset.

hello @jiaaoc 😊
First of all, all my compliments to you and your team for the great results with this model.
I want to use it on my Chinese datasets, But I'm seeing that the first and second columns of the pre-processed Yahoo Answer dataset you provided are labels.

label	label	content
5	5	why doesn't an optical mouse work on a glass table? ......
6	6	what is the best off-road motorcycle trail ?......

Do I need to repeat the labels twice on my dataset?

The pre-processed Yahoo Answer dataset

Hello:
I am trying to run this project, but the linked URL of The pre-processed Yahoo Answer dataset is invalid. Could you please make up the URL of The pre-processed Yahoo Answer dataset?

Thanks,
Rosit

How to use the method for bert ner task?

Loss function for supervised loss

I have a question about what exactly is the supervised loss mentioned in paper or here in the code this Lx. It is not exactly cross entropy so was a bit confused, a bit more explanation about this loss function would be really helpful?

Lx = - torch.mean(torch.sum(F.log_softmax(outputs_x, dim=1) * targets_x, dim=1))

Unlabelled Data Formatting

Hey,

I was trying to test your model for my custom data which contains both labelled and unlabelled sentences. Though I am not sure how I need to format, structure and keep my unlabelled data.

I was able to run it successfully on the preprocessed Yahoo dataset but there were only labelled examples present in it as far as I could observe.

Changing the sequence length for BERT & performance on data with multi-class labels.

I've been working on this and got results close to what was mentioned in the paper for the IMDB dataset.
However, i'm facing a couple of issues.

Changing the sequence length that BERT accepts hasn't been straightforward. I want it changed to 128. I couldn't see that in 'mixtext.py'. 'normal_bert.py' contains 'length=256' in the definition for 'forward' but I'm unsure if this the one to change as I don't see 'length' being used anywhere else.
Please let me know how to change the sequence length for both normal_bert and mix_text.
I tried to run this on the sklearn's NewsGroup dataset - which contains 20 labels. I've appropriately changed the 'read_data.py' to take in this data & also performed back-translations as mentioned. But I couldn't get good results with 'MixText'. Whatever the amount of labeled data that I use, I get almost similar, very low accuracies. Training time also remains the same irrespective of what the labeled data is.
I'm assuming there is no issue with my data reading & pre-processing because 'normal_bert' runs fine.
Please do let me know if there is any known issue with using MixText for a multi-class dataset.

Thanks!

About dataset construction

Hello, your work is excellent. Based on the construction of the dataset, I have a small question. After reading your code, I found that the unlabeled data is obtained by stratified sampling from the labeled data, so that the number of samples in the training set is balanced, but in real scenarios, we can't know the unlabeled data's label and then construct such a balanced train set. Can your method deal with such situation?

KeyError: unlabeled_train_iter.next()

%run /code/train.py --gpu=0 --n-labeled=10 --data-path /yahoo_answers_csv/ --batch-size=4 --batch-size-u=8 --epochs=50 --val-iteration=20 --lambda-u=0 --T=0.5 --alpha=16 --mix-layers-set 7 9 12 --separate-mix=True

train(labeled_trainloader, unlabeled_trainloader, model, optimizer, scheduler, criterion, epoch, n_labels, train_aug)
204 (inputs_u, inputs_u2, inputs_ori), (length_u,
--> 205 length_u2, length_ori) = unlabeled_train_iter.next()
206 except:

/torch/utils/data/dataloader.py in next(self)
520 self._reset()
--> 521 data = self._next_data()
522 self._num_yielded += 1

/torch/utils/data/dataloader.py in _next_data(self)
560 index = self._next_index() # may raise StopIteration
--> 561 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
562 if self._pin_memory:

/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:

/torch/utils/data/_utils/fetch.py in (.0)
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:

/code/read_data.py in getitem(self, idx)
209 if self.aug is not None:
--> 210 u, v, ori = self.aug(self.text[idx], self.ids[idx])
211 encode_result_u, length_u = self.get_tokenized(u)

/code/read_data.py in call(self, ori, idx)
22 def call(self, ori, idx):
---> 23 out1 = self.de[idx]
24 out2 = self.ru[idx]

KeyError: 9226

May I ask what is the reason for this place? Thank you very much.

how to set unlable data from training dataset?

Hi, I am trying to implement your code in Chinese datasets, for the function get_data, your code split label and un-lable samples by: train_unlabeled_idxs.extend(idxs[500: 500 + 5000]), but idxs comes from idxs = np.where(labels == i)[0], so I think all the idxs are lable data, why you choose unlable from idxs? we will appreciate if you can upload some sample datasets on your experiment. thanks

attention mask

I want to ask in mixtext.py, why use torch.ones to define the attention_mask2 and torch.zeros to define token_type_ids2 instead of using the real attention mask and token types. Also, in the hidden representations, after the hidden representations are mixed, the attention_mask is used without mixing attention_mask2, can you explain the reason for that? Thank you!

Details about Attention Mask

Hi authors, great work! I have a question about the attention mask:
Suppose I am to perform TMix over two sequences with different lengths, I can pass their respective attention masks to the BertEncoder4Mix. But after I perform the mixup at mix_layer, which attention_mask should I use to pass along for the rest of the BERT layers? My intuition is that you should be using the attention mask for the longer sequence (otherwise some text tokens may be masked out) for the mixed hidden states, may I check if you agree with this?

After briefly reading through the codes in MixText.py , it seems that you've just ignored the attention mask and didn't mask out the padding tokens? Would this affect the model performance compared to add the masking properly?

Missing Special Inputs for the BERT such as <CLS> or <SEP>

Hi,

I'm currently working on your repository, and it seems that some special tokens for the BERT are missing. (e.g. or )

From my understanding, those special characters are not concatenated to the input text and are fed to the BERT model.

Is it right? Or am I missing it?

Thank you

How to use other models such as DistilBert or RoBerta?

Is there an interface to choose different models?

TypeError: init_weights() missing 1 required positional argument: 'module'

python ./code/normal_train.py --gpu 0,1 --n-labeled 10 --data-path ./data/yahoo_answers_csv/
--batch-size 8 --epochs 20

python ./code/train.py --gpu 0,1 --n-labeled 10 --data-path ./data/yahoo_answers_csv/
--batch-size 8 --batch-size-u 1 --epochs 50 --val-iteration 20
--lambda-u 0 --T 0.5 --alpha 16 --mix-layers-set 7 9 12 --separate-mix True

Traceback (most recent call last):
File "./code/train.py", line 451, in
main()
File "./code/train.py", line 112, in main
model = MixText(n_labels, args.mix_option).cuda()
File "/xxxxx/MixText-master/code/mixtext.py", line 172, in init
self.bert = BertModel4Mix.from_pretrained('bert-base-uncased')
File "/xxxxx/MixText-master/venv/lib/python3.6/site-packages/pytorch_transformers/modeling_utils.py", line 474, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/xxxxx/MixText-master/code/mixtext.py", line 17, in init
self.init_weights()
TypeError: init_weights() missing 1 required positional argument: 'module'

Question about the back translations.

Can I not do data augmentation on unlabelled data?

Few questions about the repository

Hello. I read & enjoyed your paper. I think this work(paper&repository) is a charm.
I have a few questions regarding your repository,

Is the baseline, i.e. UDA(Xie et al., 2019), tested from the implemented code?
If so, would you provide a way to run the UDA code?
As for the other dataset except the 'Yahoo', do I have to preprocess them as well? or just use the raw files from the google drive you uploaded?

I think your work is wonderful :)

Thank you.

Pretty low accuracy for yahoo answers mixtext

Hi,
Firstly I quite liked the paper and enjoyed reading it. I was trying to implement it and tried using this code for implementation on yahoo answers dataset but unfortunately the best accuracy does not cross 0.24. Since I have 2 gpus at my disposal I made the batch-size 2 and batch-size u 4 (also tried batch size 3 and batch size-u 4) and val-iteration 2000. So I was wondering, if you could let me know should I change other parameters to make it work? ( I understand that reducing the batch size should have some impact but not sure that the impact can be this significant so thought it might need to do with some other factors or weighting)?

The command I used (i downloaded the dataset from the link provided and placed it in the directory)

python ./code/train.py --gpu 0,1 --n-labeled 10 --data-path ./data/yahoo_answers_csv/ --batch-size 2 --batch-size-u 4 --epochs 20 --val-iteration 2000 --lambda-u 1 --T 0.5 --alpha 16 --mix layers-set 7 9 12 --lrmain 0.000005 --lrlast 0.0005

The second question I had was the use of args.val_iteration? As I understand it is in a way number of batches to be processed in an epoch. So I was wondering how does it work if my number of labeled data per class is 10 == 100 labeled examples(for 10 classes in yahoo answers). So if I have batch size of 2 that would be 50 batches for data loader? So does it repeat instances in cyclic manner or just randomly pics 2 instances always?

Thanks

Parameter problem

I recently studied your project and carefully read your paper on ACL. However, when I tried to run your code, I didn't find the setting of the parameter epoch, and the setting of other parameters in the code were vague, which was not fully specified in the paper, especially the parameter epoch. Therefore, I can't get the results in your paper. Can you provide me with a table on the parameter setting of each dataset?

We look forward to your reply

Access to Sampled Data?

In the experiments you did down-sampling for the datasets (e.g., sample 10 labeled example per class). May I check if the sampled datasets are available for downloading? I'm asking this because if we do the sampling by ourselves, the difference in sampled data may cause the experiment results to be not directly comparable.

Reproducing UDA Results

Hi @diyiy @jiaaoc ,

just a quick question: Can I also use the code found in this repo to reproduce your UDA results reported in the paper? If so, how?

Thanks :)

hyper-parameters

I notice that:
# Based on translation qualities, choose different weights here.
# For AG News: German: 1, Russian: 0, ori: 1
# For DBPedia: German: 1, Russian: 1, ori: 1
# For IMDB: German: 0, Russian: 0, ori: 1
# For Yahoo Answers: German: 1, Russian: 0, ori: 1 / German: 0, Russian: 0, ori: 1
You give the setting of hyper-parameters for AG News, DBPedia, IMDB, and Yahoo Answers dataset.
It means German: 1, Russian: 0, ori: 1 and German: 0, Russian: 0, ori: 1achieves similar results?
In other words, can I use any setting to get the results reported in the paper?

Can you provide all back translation data?

Thanks for your interesting work!
I am trying to do some experiments using your code. I find that it is too expensive to augment unlabled data through back translation by myself because of limited resource. I can only find back translation data of Yahoo dataset in this codebase. Can you provide all back translation data?
Thanks!