nlpyang / presumm Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 465.0 13 MB

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders

License: MIT License

Python 100.00%

presumm's People

Contributors

Stargazers

Watchers

Forkers

chenmoshushi wenwenyu fatmas1982 xw-syu anand191 pankajmehar mingchen62 nlngh jonathanfly entn-at merajat ash1998 dragomirradev stevejsteiner udnet96 1000-7 codeaudit zhongxiangboy mahjiong thorphan sincerass prometeoai yzykkpl roderickwlucas zubenalgubi wwwpig2004 hyusheng mingzi151 jimmy646 chuqi-yang tahmedge tma15 ranjeetds antonysama wagu0809 sumanth-bmsce noobpythoner alirezabayatmk datashinobi hatrix233 fdelrio89 hzyang95 uliang elhamdolatabadi iamrupok troyvo-nativo troybvo cuptea wind91725 boaz-yin denisogr dsandeep0138 sid8519 xhyandwyy nakhunchumpolsathien fajri91 junzhi-wen garg-amit sydneyotten qwang70 kanaozaki pan-qi zhuoyuwei josephd8 r4ghu phillip1029 w6688j natheemy mqrshiyan xiongziqi annav1asova manba036 sjsingh21 lgnroy mehwishfatimah alison05921 isabelcachola shujian2015 konradbachusz-zz arita37 mrbananahuman thenghiapham amirstudy hanseokoh onikazu dreaminvoker g-jing cckeima aesmin alva-coding judelee19 rafaelwo brs1977 chmodsss jc7k phelan164 shilonosov smeaktrobush alcamech gagan94

presumm's Issues

Regarding Decoding Phase

I am confused regarding the decoding phase. Suppose S1, S2, S3 are three sentences and each sentence has tokens t1, t2,...... tn. (n denotes total number of tokens in the sentence).

What I understand is that at the end of the encoding phase, we will have tokens -? [CLS] [S1, T1][S1 T,2].... [SEP] [CLS] [S2 ,T1] [S2 T,2].... [SEP] [CLS] [S3 ,T1] [S3 .T2].... [SEP] .....

So during decoding the phase, is not the decoder focusing on all the tokens including [CLS] [Si, Tj] [SEP] etc. ? Or is it just focusing on [CLS] token of each sentence? I did not find detailed explanation regarding decoding phase in the paper.

Index out of range when testing on CNN Daily Mail dataset

Facing index out of range error
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in 10 11 predictor = build_predictor(args, tokenizer, symbols, summ) ---> 12 predictor.translate(test_iter, step)


~\Documents\Abstractive_Summarisation\PreSUMM_pytorch\src\models\predictor.py in translate(self, data_iter, step, attn_debug)

147 gold_tgt_len = batch.tgt.size(1)

148 self.min_length = gold_tgt_len + 20

--> 149 self.max_length = gold_tgt_len + 60

150 batch_data = self.translate_batch(batch)

151 translations = self.from_batch(batch_data)
~\Documents\Abstractive_Summarisation\PreSUMM_pytorch\src\models\predictor.py in translate_batch(self, batch, fast)

216 return self._fast_translate_batch(

217 batch,

--> 218 self.max_length,

219 min_length = self.min_length)

220
~\Documents\Abstractive_Summarisation\PreSUMM_pytorch\src\models\predictor.py in _fast_translate_batch(self, batch, max_length, min_length)

233 segs = batch.segs

234 mask_src = batch.mask_src

--> 235

236 src_features = self.model.bert(src, segs, mask_src)

237 dec_states = self.model.decoder.init_decoder_state(src, src_features, with_cache=True)
~\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)

487 result = self._slow_forward(*input, **kwargs)

488 else:

--> 489 result = self.forward(*input, **kwargs)

490 for hook in self._forward_hooks.values():

491 hook_result = hook(self, input, result)
~\Documents\Abstractive_Summarisation\PreSUMM_pytorch\src\models\model_builder.py in forward(self, x, segs, mask)

125 def forward(self, x, segs, mask):

126 if(self.finetune):

--> 127 top_vec, _ = self.model(x, segs, attention_mask=mask)

128 else:

129 self.eval()
~\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)

487 result = self._slow_forward(*input, **kwargs)

488 else:

--> 489 result = self.forward(*input, **kwargs)

490 for hook in self._forward_hooks.values():

491 hook_result = hook(self, input, result)
~\Anaconda3\envs\pytorch\lib\site-packages\pytorch_transformers\modeling_bert.py in forward(self, input_ids, token_type_ids, attention_mask, position_ids, head_mask)

710 head_mask = [None] * self.config.num_hidden_layers

711

--> 712 embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)

713 encoder_outputs = self.encoder(embedding_output,

714 extended_attention_mask,
~\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)

487 result = self._slow_forward(*input, **kwargs)

488 else:

--> 489 result = self.forward(*input, **kwargs)

490 for hook in self._forward_hooks.values():

491 hook_result = hook(self, input, result)
~\Anaconda3\envs\pytorch\lib\site-packages\pytorch_transformers\modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids)

263

264 words_embeddings = self.word_embeddings(input_ids)

--> 265 position_embeddings = self.position_embeddings(position_ids)

266 token_type_embeddings = self.token_type_embeddings(token_type_ids)

267
~\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)

487 result = self._slow_forward(*input, **kwargs)

488 else:

--> 489 result = self.forward(*input, **kwargs)

490 for hook in self._forward_hooks.values():

491 hook_result = hook(self, input, result)
~\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\sparse.py in forward(self, input)

116 return F.embedding(

117 input, self.weight, self.padding_idx, self.max_norm,

--> 118 self.norm_type, self.scale_grad_by_freq, self.sparse)

119

120 def extra_repr(self):
~\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)

1452 # remove once script supports set_grad_enabled

1453 no_grad_embedding_renorm(weight, input, max_norm, norm_type)

-> 1454 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

1455

1456

RuntimeError: index out of range at c:\n\pytorch_1559129895673\work\aten\src\th\generic/THTensorEvenMoreMath.cpp:191

unable to preprocess sample json file into pt file

Hi Yang, Do you know how I can fix this problem?

This is the command I use to reproduce the error:

python preprocess.py -mode format_to_bert \
-raw_path /root/PreSumm/json_data/ \
-save_path /root/PreSumm/sample_data/  \
-lower \
-n_cpus 1 \
-log_file /root/PreSumm/logs/preprocess.log

I got the following error when trying to convert the sample json file to pt file:

Traceback (most recent call last):
  File "preprocess.py", line 73, in <module>
    eval('data_builder.'+args.mode + '(args)')
  File "<string>", line 1, in <module>
  File "/root/PreSumm/src/prepro/data_builder.py", line 287, in format_to_bert
    for d in pool.imap(_format_to_bert, a_lst):
  File "/root/anaconda3/envs/xsum/lib/python3.6/site-packages/multiprocess/pool.py", line 735, in next
    raise value
KeyError: '[SEP]'

Auto-regressive generation implementation

I have found the loop for batches/transformer layers but did not manage to find the loop for generating the outputs at different steps autoregressivey in TransformerDecoder for trainig and it seems to me there is only one pass while generating the decoder outputs.

Could you please guide me where the auto-regressive generation is implemented?

Thanks a lot

Data preparation : missing urls

urls folder is not uploaded to this repo (needed for data preparation).

Is it the same folder as in BertSum ?

Impact of number of GPU

I have only 2 GPU. I'm trying to train the Abstractive model on these 2 GPU and I was wondering was is the impact of training it on only 2 GPU instead of 4 like mentioned in the paper.

Since every GPU run :

PreSumm/src/train_abstractive.py

Line 75 in d5954c9

train_abs_single(args, device_id)

It seems to me that hyperparameters should be adapted based on the number of GPU used :

Training on 50k steps on 1 GPU : in total it is 50k steps.
Training on 50k steps on 3 GPU : this means each GPU will run 50k steps, resulting in a training of 150k steps (?)

Did I understand the code right ?
What are the others hyper parameters that I need to change based on the number of GPU used ? warmup steps ? save checkpoint steps ?

Decoder initial state not initialised?

Hi,
thanks for the generosity in sharing the codes. I have a question. I noticed that the decoder is not initialised with either the last hidden states of the bert representation(output from bert), or the encoded states of the source sub-tokens, as decoder is usually initialised with something.

PreSumm/src/models/decoder.py

Line 266 in 29a6b1a

def _init_cache(self, memory_bank, num_layers):

In the above line of code, I believe the "memory_bank" variable is usually used to do that but its not being used in your codes.

Is there an intentional design? Or am I missing out something..

Thanks!

how is accuracy measured and is it a percent?

I m trying to train a model and after 40K step I still have an accuracy of 6-6.3. When you were training your model, what was your accuracy around this step? I did make a few changes to your parameters and some of the code. So I m thinking I m going in the wrong direction.

ValueError: too many values to unpack (expected 5)

Hello, when I set mode to validate or test, I encountered the following problem. Do you have a good solution? Thank you！@nlpyang
[2019-09-16 17:03:34,533 INFO] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin from cache at ../../temp/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
[2019-09-16 17:03:37,976 INFO] Loading test dataset from cnndm/cnndm.test.0.bert.pt, number of examples: 2001
[2019-09-16 17:03:51,366 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../../temp/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Traceback (most recent call last):
File "train.py", line 124, in
validate_abs(args, device_id)
File "/home/cai/yym/ddl/final/PreSumm-master/train_abstractive.py", line 155, in validate_abs
test_abs(args, device_id, cp, step)
File "/home/cai/yym/ddl/final/PreSumm-master/train_abstractive.py", line 225, in test_abs
predictor.translate(test_iter, step)
File "/home/cai/yym/ddl/final/PreSumm-master/models/predictor.py", line 153, in translate
translations = self.from_batch(batch_data)
File "/home/cai/yym/ddl/final/PreSumm-master/models/predictor.py", line 106, in from_batch
batch.tgt_str, batch.src))
ValueError: too many values to unpack (expected 5)

❓ Question : Why ROUGE-2 for oracle sentences ?

In your (amazing) paper, it is said that oracle sentences are made using ROUGE-2 :

But in the code, it seems you match oracle sentences by using both ROUGE-1 and ROUGE-2 (similar to BertSum) :

PreSumm/src/prepro/data_builder.py

Line 188 in d5954c9

rouge_score = rouge_1 + rouge_2

Is it a typo in the paper ?

Did you fine tuning BERT in the abstractive summarizer and BERTSUM

Hello,
I am wondering did you fine tuning BERT in the encoder in your abstracitve summarizer and BERTSUM model? (or you just used the pre-trained model)

Thank you!

Abstractor validation : ValueError (too many values to unpack)

I ran the Abstractor validation with following command :

python train.py -task abs 
-mode validate 
-batch_size 3000 
-test_batch_size 500 
-bert_data_path ~/PreSumm/bert_data/cnndm 
-log_file ../logs/val_abs_bert_cnndm 
-model_path ../models/BertSumExtAbs 
-sep_optim true 
-use_interval true 
-visible_gpus 1 
-max_pos 512 
-max_length 200 
-alpha 0.95 
-min_length 50 
-result_path ../logs/abs_bert_cnndm 
-test_all

But at test time (after evaluating every checkpoints on valid dataset), I got this error :

Traceback (most recent call last):
File "train.py", line 124, in
validate_abs(args, device_id)
File "[...]/PreSumm/src/train_abstractive.py", line 140, in validate_abs
test_abs(args, device_id, cp, step)
File "[...]/PreSumm/src/train_abstractive.py", line 225, in test_abs
predictor.translate(test_iter, step)
File "[...]/PreSumm/src/models/predictor.py", line 157, in translate
translations = self.from_batch(batch_data)
File "[...]/PreSumm/src/models/predictor.py", line 110, in from_batch
batch.tgt_str, batch.src))
ValueError: too many values to unpack (expected 5)

is the comment out code for max_pos need to be uncommented?

If the max_pos is greater than 512 then should we uncomment out this

 # if(args.max_pos>512):
        #     my_pos_embeddings = nn.Embedding(args.max_pos, self.bert.model.config.hidden_size)
        #     my_pos_embeddings.weight.data[:512] = self.bert.model.embeddings.position_embeddings.weight.data
        #     my_pos_embeddings.weight.data[512:] = self.bert.model.embeddings.position_embeddings.weight.data[-1][None,:].repeat(args.max_pos-512,1)
        #     self.bert.model.embeddings.position_embeddings = my_pos_embeddings

and also include it in the ExtSummarizer to increase position size?

Increase the summary length, or number of sentences

Hello!

How can I tune this model to generate more sentences of summary? Right now it has an average of 3 sentences for ext model. I would like to have 2 more or 5 more sentences...

Thank you very much for your time!

All the best.

if the max is 512 for tokenization, do you tokenize in chunks?

if i m reading your codebase correctly, text that is longer than 512 get shortened to match the size for berttokenizer. What if the_ article size is greater than 512? How are you chunking the code? Can you point to the location in your where you are splitting the text once it's larger than 512 for bert? file name line number if possible for review? Also once they are encoded are you concatenating the chunks before it goes through the decoder?

Using Model for Inference

Hi, thanks for your work here. How do we use the pretrained models for inference (summarization) on article text? I have downloaded the trained cnndm model (.pt), but how do I load this in a python program to use? Thanks!

Dependency issue

Hi,

Thanks for sharing this excellent work. Is it possible to create a requirements.txt file so as to make the env setup procedure easier?

Thanks!

Dependency to 'pytorch_pretrained_bert' left in train_extractive.py

Thanks for the amazing code repo ! 😃

In the dependencies list, you mentioned pytorch_transformers, so you moved to the new version of HuggingFace library.

However there is still a dependency to pytorch_pretrained_bert in train_extractive.py.

the candidate results of all the samples are the same

hello!
first thanks for your contribution.
when I try to test test the BertSumAbs .the cmd is :

python train.py -task abs -mode test -batch_size 30 -test_batch_size 5 -bert_data_path ../bert_data_cnndm_final/cnndm -log_file ../logs/val_abs_bert_cnndm_eng -model_path ../models/abs_trans_eng/ -sep_optim true -use_interval true -visible_gpus 0 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path ../results/abs_bert_cnndm_eng/ -test_from ../models/abs_trans_eng/model_step_200000.pt

i got the wrong candidate file like this :
for example : PreSumm-master/results/abs_bert_cnndm_eng/.200000.candidate:
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s
new : new : : : new york 's the u.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s

It seems like all the test samples got the same predication result and then the result is so bed. but the file both ".200000.gold" and ".200000.raw_src" are correct
is there anything wrong ?

Is there a method on how to restrict the output size?

Hey
I would want to restrict the output size by words/characters. Are there any methods?
I looked at max_length & max_tgt_len however they seem to not be working.
I set
recall_eval as False
max_length as 20
max_tgt_len as 20
min_length as 10
However the output remained of the constant size

Note: Using the below inference command ( Pre Trained CNN DM model Extractive )
python src/train.py -task ext -mode test -test_from ./models/bertext.pt -batch_size 3000 -test_batch_size 500 -bert_data_path ./bert_path/ -log_file ./logs/val_abs_bert_cnndm -sep_optimtrue -use_interval true -visible_gpus -1 -max_pos 512 -max_length 20 -max_tgt_len 20 -alpha 0.95 -min_length 10 -result_path ./results/ -report_rouge false -recall_eval False

Step 4. Format to Simpler Json Files

Could you clear to what should the Step 4. Format to Simpler Json Files do .
my case : i have my own data-set . i am trying to apply these steps on it. Now I performed to Step 3. Sentence Splitting and Tokenization and generated Json files .
regarding my own data-set step 4 did not perform any thing. after studying the code related to step 4 in function called format_to_lines -->data_builder.py . this function compare my json file by name with mapping file with the same name in URL directory. I think the isse in this loop

for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
            temp.append(hashhex(line.strip()))
        corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}

the corpus_mapping[corpus_type] length are

corpus_mapping valid 13368
corpus_mapping test 11490
corpus_mapping train 287227

train_files,` valid_files, test_files = [], [], []
    print("glob jason",glob.glob(pjoin(args.raw_path, '*.json')))
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        print("f",f)
        real_name = f.split('/')[-1].split('.')[0]
        print("real_name",real_name)
        if (real_name in corpus_mapping['valid']):
            valid_files.append(f)
        elif (real_name in corpus_mapping['test']):
            test_files.append(f)
        elif (real_name in corpus_mapping['train']):
            train_files.append(f)
        # else:
        #     train_files.append(f)
    print("len train_files, valid_files, test_files ",len(train_files), len(valid_files), len(test_files ))

len train_files, valid_files, test_files 0 0 0
could you help me ?

Question on the term "pretrained BertSum"

On your paper, you mentioned "The encoder is pretrained BertSum" and im not entirely sure on the meaning of "pretrained BertSum".

For an input document, you modify BERT as follows:
(1) Add CLS tokens in the beginning of each sentence
(2) interval segment embeddings

This leads to BertSum.

On what task(s), do u perform and call the trained model "pretrained BertSum" ?

Just MLM task ? or borth MLM and NSP just like in BERT pretrain ?

Thanks in advance

Training condition

Hi there,
Why in trainer_ext.py, the condition for training is
if self.n_gpu == 0 or (i % self.n_gpu == self.gpu_rank):
which is if there are GPUs more than 1(included), the training process is passed?

Simple JSON File

What do you mean by Simpler JSON File?

In BERT Extractive Setting, what does 'xent' mean?? How to evaluate ext mode?

In case of BERT extractive model(BERTExt), how do I evaluate the training accuracy?? xent???
Step 500/100000; xent: 0.46; lr: 0.0000010; 107 docs/s; 1001 sec
I don't know what 'xent' meaning.. please let me know

Training speed too slow with multiple-gpus

Hi,

Has anyone tried to train with multiple-gpus?
I am using the same settings as in the README for BertAbs(with 4 gpus and batch size 140). However, right now for 200k steps to finish, it is estimated to take around 10 days.

(I am using pytorch==1.1)

How to Understand “abs_batch_size_fn” function

I see that the train_batch_size is 140, but when i exam the shape of data i found that they always be [3, 512] or [2, 512] and so on. Finally, i found the fundtion in src/models/data_loader.py named "abs_batch_size_fn" control this value, but i can not understand why this function like this.

I failed to replicate the results of the abs mission

Thank you for your sharing!
when i run the code
"python train.py -task abs -mode validate -test_all -bert_data_path /home/PreSumm/src/bert_data/ -ext_dropout 0.1 -model_path abs -lr 2e-3 -visible_gpus 1,2,3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 500 -train_steps 50000 -accum_count 2 -log_file ../logs/ext_bert_cnndm -use_interval true -warmup_steps 10000 -max_pos 512" ,
and then i got the result:

1 ROUGE-1 Average_R: 0.33535 (95%-conf.int. 0.33294 - 0.33767)
1 ROUGE-1 Average_P: 0.43293 (95%-conf.int. 0.43023 - 0.43579)
1 ROUGE-1 Average_F: 0.36442 (95%-conf.int. 0.36217 - 0.36671)

1 ROUGE-2 Average_R: 0.14269 (95%-conf.int. 0.14078 - 0.14462)
1 ROUGE-2 Average_P: 0.18674 (95%-conf.int. 0.18429 - 0.18927)
1 ROUGE-2 Average_F: 0.15570 (95%-conf.int. 0.15358 - 0.15776)

1 ROUGE-L Average_R: 0.31230 (95%-conf.int. 0.30999 - 0.31456)
1 ROUGE-L Average_P: 0.40353 (95%-conf.int. 0.40070 - 0.40642)
1 ROUGE-L Average_F: 0.33950 (95%-conf.int. 0.33733 - 0.34176)

[2019-09-09 21:34:50,109 INFO] Rouges at step 46000

ROUGE-F(1/2/3/l): 36.44/15.57/33.95
ROUGE-R(1/2/3/l): 33.53/14.27/31.23

Why is this result so much lower than the one given in the paper ? But the results I got on the ext task were the same as in the paper.

is there a plant to release the 800 position model?

Is there any plan to release your trained model for 800 max_pos on cnn/dailymail?

Regarding Separating Sentences.

Hi, how did you separate different sentences in an Article? I mean which token was used to separate them?

❓ Use of `recall_eval` parameter

I have difficulties understanding the recall_eval parameter.

I far as I understand : it is used to produce summary that is the same length of the gold summary, and truncate the number of predicted sentences to the same number with the gold summary.

Did I understand well ?

Also in this code :

PreSumm/src/models/predictor.py

Lines 156 to 168 in 8ad9dd0

 if(self.args.recall_eval): 

 _pred_str = '' 

 gap = 1e3 

 for sent in pred_str.split('<q>'): 

 can_pred_str = _pred_str+ '<q>'+sent.strip() 

 can_gap = math.fabs(len(_pred_str.split())-len(gold_str.split())) 

 # if(can_gap>=gap): 

 if(len(can_pred_str.split())>=len(gold_str.split())+10): 

 pred_str = _pred_str 

 break 

 else: 

 gap = can_gap 

 _pred_str = can_pred_str

I don't understand the role of the variable gap / can_gap, it seems unused. Can you explain this part of the code if you have time ?

evaluate and test

Downloaded, then processed and trained 'cnndm_sample.train.0.json' (documented here: https://github.com/antonysama/PreSumm/blob/master/README.md). However Evaluate and Test, steps give the following errors:

###Evaluate code:
python PreSumm/src/train.py -task abs -mode validate -test_all -batch_size 8 -test_batch_size 2 -bert_data_path ~/o3/PreSumm/bert_data/ -log_file ~/o3/PreSumm/logs/val_abs_bert_cnndm -model_path ~/o3/PreSumm/models -sep_optim true -use_interval true -visible_gpus -1 -max_pos 512 -max_length 10 -alpha 0.95 -min_length 5 -result_path ~/o3/PreSumm/logs/abs_bert_cnndm

##Evaluate Error:
loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin from cache at ../temp/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157 Traceback (most recent call last): File "PreSumm/src/train.py", line 124, in <module> validate_abs(args, device_id) File "/home/antony/o3/PreSumm/src/train_abstractive.py", line 131, in validate_abs xent = validate(args, device_id, cp, step) File "/home/antony/o3/PreSumm/src/train_abstractive.py", line 187, in validate shuffle=False, is_test=False) File "/home/antony/o3/PreSumm/src/models/data_loader.py", line 136, in __init__ self.cur_iter = self._next_dataset_iterator(datasets) File "/home/antony/o3/PreSumm/src/models/data_loader.py", line 156, in _next_dataset_iterator self.cur_dataset = next(dataset_iter) File "/home/antony/o3/PreSumm/src/models/data_loader.py", line 94, in load_dataset yield _lazy_dataset_loader(pt, corpus_type) File "/home/antony/o3/PreSumm/src/models/data_loader.py", line 78, in _lazy_dataset_loader dataset = torch.load(pt_file) File "/home/antony/.local/lib/python3.6/site-packages/torch/serialization.py", line 382, in load f = open(f, 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/home/antony/o3/PreSumm/bert_data/.valid.pt'

##Test code:
python PreSumm/src/train.py -task abs -mode test -test_from ~/o3/PreSumm/bert_data/model_step_148000.pt -batch_size 16 -test_batch_size 2 -bert_data_path ~/o3/PreSumm/bert_data/ -log_file ~/o3/PreSumm/logs/val_abs_bert_cnndm -sep_optim true -use_interval true -visible_gpus -1 -max_pos 512 -max_length 10 -alpha 0.95 -min_length 5 -result_path ~/o3/PreSumm/logs/abs_bert_cnndm

##Test error:
Similar to 'Evaluate' error above, but ends with 'File not found...test.pt'

Using Model for Inference

Hi ! I am also trying to use the model for inference but when I use the above mode with a singular checkpoint. Using this command
python train.py -task ext -mode train -bert_data_path ../bert_data/cnndm -ext_dropout 0.1 -model_path ../models -lr 2e-3 -visible_gpus 0 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 50000 -accum_count 2 -log_file ../logs/ext_bert_cnndm -use_interval true -warmup_steps 10000 -max_pos 512

I get this error during train,validate or test：
File "D:\code\PreSumm-master\PreSumm-master\src\models\data_loader.py", line 195, in preprocess
tgt = ex['tgt'][:self.args.max_tgt_len][:-1]+[2]
KeyError: 'tgt'

Dimension error while running TransformerAbs baseline

Hi thanks for open-sourcing the code!

I am trying to run TransformerAbs with this command and got an error seemingly related to the hidden size of the decoder.

python src/train.py  -task abs -mode train -encoder baseline \
 -dec_dropout 0.2 -sep_optim true -lr_bert 0.002 -lr_dec 0.2 \
 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 \
 -accum_count 20 -use_interval true -warmup_steps_bert 20000 \
 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 2 \
 -log_file logs/abs_transformer_baseline_train.log

RuntimeError: Given normalized_shape=[768], expected input with shape [*, 768], but got input of size[1, 78, 512]

In model_builder.py, class AbsSummarizer, by changing hidden_size=args.enc_hidden_size to hidden_size=args.dec_hidden_size, the model could run.

if (args.encoder == 'baseline'):
            bert_config = BertConfig(self.bert.model.config.vocab_size, hidden_size=args.enc_hidden_size,
                                     num_hidden_layers=args.enc_layers, num_attention_heads=8,
                                     intermediate_size=args.enc_ff_size,
                                     hidden_dropout_prob=args.enc_dropout,
                                     attention_probs_dropout_prob=args.enc_dropout)
            self.bert.model = BertModel(bert_config)

Could you please check this?
Thanks!

License for the trained models

Can you guys add a licensing file for the trained models?

btw Cool Paper. I was looking into summarisation APIs. to test the model the license for the models will be more helpful.

Training step issue

Hi there, here I found that the step += 1 is under the for loop of train_iter.
And in this case, only the first 100 samples of the first batch are used to train, is that correct?

No module named 'models'

Hi, I tried to run the code in jupyter notebook, but couldn't get past the 'import' section. It keeps telling me the issue mentioned. How do I solve this problem?
Please respond.

❓ Confused about `batch_size`

I'm having difficulties to wrap my head around the batch_size parameter.

What exactly is the batch_size parameter ?

It's not the real batch size (i.e. how many samples can be processed at once).
So what is it exactly ?
And how can I choose the real batch size from this argument ?

Preprocessing txt files to JSON

Thank you @nlpyang for the excellent work!

The following can probably be avoided if you alter the ~/.bashrc file as advised in https://stanfordnlp.github.io/CoreNLP/download.html (see step 4, second part). But I just wanted to leave it here in case it's helpful for others.

This might be a comment helpful to others trying to use this pipeline for the processing of other texts who are not already familiar with CoreNLP.

If you follow the README.dm and get to option 2 of preprocessing, step 3, where the aim is to create tokenized JSONs based on the input txt (i.e. Sentence Splitting and Tokenization), you will be calling the preprocess.py file, which in turn will call prepro/data_builder.py, from which Standford CoreNLP java code will be called. I only got it to work after moving the unzipped files I downloaded from Stanford CoreNLP into my PreSumm/src folder (where I'm calling the bash command in the README from), and also changed the following line from data_builder.py:

command = ['java', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit', '-ssplit.newlineIsSentenceBreak', 'always', '-filelist', 'mapping_for_corenlp.txt', '-outputFormat', 'json', '-outputDirectory', tokenized_stories_dir]
to:
command = ['java', '-cp', '*', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit', '-ssplit.newlineIsSentenceBreak', 'always', '-filelist', 'mapping_for_corenlp.txt', '-outputFormat', 'json', '-outputDirectory', tokenized_stories_dir]

I would be happy if a) anyone who did it differently could share their solution, b) if you are stuck on the same problem, maybe my hack would help you, c) @nlpyang, perhaps consider adding this little comment to the README file, or a comment advising to do otherwise.

Thanks again!

How to run this project if I have downloaded your pre-processed data?

python train.py -task ext -mode train -bert_data_path BERT_DATA_PATH -ext_dropout 0.1 -model_path MODEL_PATH -lr 2e-3 -visible_gpus 0,1,2 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 50000 -accum_count 2 -log_file ../logs/ext_bert_cnndm -use_interval true -warmup_steps 10000 -max_pos 512
Traceback (most recent call last):
File "train.py", line 10, in
from train_abstractive import validate_abs, train_abs, baseline, test_abs, test_text_abs
File "/home/aditya64/text summerizer/PreSumm-master/src/train_abstractive.py", line 15, in
from pytorch_transformers import BertTokenizer
ModuleNotFoundError: No module named 'pytorch_transformers'

why don't use bert_tokenizer to split the source text?

custom data

How would one tun this on custom data such as a couple of news articles that are not from CNN, DM or the ExSum sources used here?

Loading pretrained model from README

Thanks so much for sharing your code and all your hard work nlpyang!

I have a quick question -- I understand I can follow the README instructions to train a local version of the model. However, it seems like you also provided the links to pretrained models for download. I followed the one for abstractive summarization, i.e. CNN/DM Abstractive, got the .pt file and tried to load it with pytorch.

However, pytroch requires some models.py module to load the .pt file and I don't believe the source code has a file of that name. Should I rename model_builder.py to models.py? Is there another file I can use as the model definition? Should I just train my own local copy and forget it?

If anyone got the pretrained model to work please let me know!

PyTorch 1.2.0 compatibility issues

The current code is not compatible with PyTorch version 1.2.0 :

File "[...]/PreSumm/src/models/data_loader.py", line 33, in init
mask_src = 1 - (src == 0)
File "[...]/presum/lib/python3.6/site-packages/torch/tensor.py", line 325, in rsub
return _C._VariableFunctions.rsub(self, other)
RuntimeError: Subtraction, the - operator, with a bool tensor is not supported. If you are trying to invert a mask, use the ~ or bitwise_not() operator instead.

This masking issue is discussed here.

I simply went back to PyTorch 1.1.0, but maybe there is other compatibility issues ?

You might want to specify the version of PyTorch in the README :)

Did you run the Transformer baseline using this code?

Thanks for the code! Did you run the Transformer baseline using this code?

Error while using model for inference : RuntimeError: expected device cpu and dtype Byte but got device cpu and dtype Bool

I am using the abstractive model for inference using following command

python3 train.py -task abs -mode validate -batch_size 3000 -test_batch_size 500 -bert_data_path ../bert_data/cnndm -log_file ../logs/val_abs_bert_cnndm -model_path ../../model/ -sep_optim true -use_interval true -visible_gpus -1 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path ../logs/abs_bert_cnndm

I am getting below error.

File "train.py", line 124, in
validate_abs(args, device_id)
File "/home/exa00117/Practice/textSummerization/presumm/PreSumm-master/src/train_abstractive.py", line 154, in validate_abs
validate(args, device_id, cp, step)
File "/home/exa00117/Practice/textSummerization/presumm/PreSumm-master/src/train_abstractive.py", line 196, in validate
stats = trainer.validate(valid_iter, step)
File "/home/exa00117/Practice/textSummerization/presumm/PreSumm-master/src/models/trainer.py", line 197, in validate
outputs, _ = self.model(src, tgt, segs, clss, mask_src, mask_tgt, mask_cls)
File "/home/exa00117/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/exa00117/Practice/textSummerization/presumm/PreSumm-master/src/models/model_builder.py", line 243, in forward
decoder_outputs, state = self.decoder(tgt[:, :-1], top_vec, dec_state)
File "/home/exa00117/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/exa00117/Practice/textSummerization/presumm/PreSumm-master/src/models/decoder.py", line 202, in forward
step=step)
File "/home/exa00117/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/exa00117/Practice/textSummerization/presumm/PreSumm-master/src/models/decoder.py", line 64, in forward
dec_mask = torch.gt(tgt_pad_mask + self.mask[:, :tgt_pad_mask.size(1), :tgt_pad_mask.size(1)], torch.tensor(0))
RuntimeError: expected device cpu and dtype Byte but got device cpu and dtype Bool

is the configuration same for bert large?

If I were to train using bert large what modifications if any should be done. For example number of decoder layers, attention heads for extractor and decoder, dec_ff_size, enc_ff_size, enc_hidden_size, ext_hidden_size, dec_hidden_size? Can the defaults be used?

🐛`-test_all` does not actually test all checkpoints

When testing checkpoints with -mode validate and -test_all options, I expect the code to run the validation on every checkpoint.

However, sometimes it silently stops after evaluating X checkpoints. I believe it's from this code :

PreSumm/src/train_extractive.py

Lines 117 to 118 in 8ad9dd0

 if (i - max_step > 10): 

 break

Is it early-stopping ? (If there is no improvement after 10 checkpoints, remaining checkpoints are not evaluated)

If so, it might be better to let the user the choice and put this as a parameter.

Restart training from an existing checkpoint

Hi, thanks for sharing your code! Since the training takes a very long time, I am wondering if it is possible to stop the training at a particular checkpoint, save the checkpoint, and then restart the training after loading that same saved checkpoint. If yes, can you please let me know how it can be done? Thanks a lot!

❓ Question about predictions cleaning : Why replace '+' ?

I have a question about the post-processing of model's prediction.

At this line :

PreSumm/src/models/predictor.py

Line 154 in 29a6b1a

 pred_str = pred.replace('[unused0]', '').replace('[unused3]', '').replace('[PAD]', '').replace('[unused1]', '').replace(r' +', ' ').replace(' [unused2] ', '<q>').replace('[unused2]', '').strip() 

I understand you clean the prediction to match the gold text better (for ROUGE score) :

You remove the beginning of sentence / end_of_sentence tokens
You replace the sentence_separator token with <q> (because it's the sentence separator used in the gold text)

But I don't understand this one :

.replace(r' +', ' ')

Why do you need to replace + sign ? BERT tokenizer does not add any + sign during tokenization, so if it appear it must be part of the text.

	if(self.args.recall_eval):
	_pred_str = ''
	gap = 1e3
	for sent in pred_str.split('<q>'):
	can_pred_str = _pred_str+ '<q>'+sent.strip()
	can_gap = math.fabs(len(_pred_str.split())-len(gold_str.split()))
	# if(can_gap>=gap):
	if(len(can_pred_str.split())>=len(gold_str.split())+10):
	pred_str = _pred_str
	break
	else:
	gap = can_gap
	_pred_str = can_pred_str

nlpyang / presumm Goto Github PK

presumm's People

Contributors

Stargazers

Watchers

Forkers

presumm's Issues

1 ROUGE-1 Average_R: 0.33535 (95%-conf.int. 0.33294 - 0.33767) 1 ROUGE-1 Average_P: 0.43293 (95%-conf.int. 0.43023 - 0.43579) 1 ROUGE-1 Average_F: 0.36442 (95%-conf.int. 0.36217 - 0.36671)

1 ROUGE-2 Average_R: 0.14269 (95%-conf.int. 0.14078 - 0.14462) 1 ROUGE-2 Average_P: 0.18674 (95%-conf.int. 0.18429 - 0.18927) 1 ROUGE-2 Average_F: 0.15570 (95%-conf.int. 0.15358 - 0.15776)

The following can probably be avoided if you alter the ~/.bashrc file as advised in https://stanfordnlp.github.io/CoreNLP/download.html (see step 4, second part). But I just wanted to leave it here in case it's helpful for others.

Recommend Projects

Recommend Topics

Recommend Org

Jobs

1 ROUGE-1 Average_R: 0.33535 (95%-conf.int. 0.33294 - 0.33767)
1 ROUGE-1 Average_P: 0.43293 (95%-conf.int. 0.43023 - 0.43579)
1 ROUGE-1 Average_F: 0.36442 (95%-conf.int. 0.36217 - 0.36671)

1 ROUGE-2 Average_R: 0.14269 (95%-conf.int. 0.14078 - 0.14462)
1 ROUGE-2 Average_P: 0.18674 (95%-conf.int. 0.18429 - 0.18927)
1 ROUGE-2 Average_F: 0.15570 (95%-conf.int. 0.15358 - 0.15776)