Thank you for making the code open-sourced. It seems that there is a

We directly use the same evaluation s of <a href="https://github.com/xinyadu/nqg

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Results of Question Generation on SQuAD about unilm HOT 5 CLOSED

microsoft commented on May 13, 2024

Results of Question Generation on SQuAD

from unilm.

Comments (5)

wenhui0924 commented on May 13, 2024

We directly use the same evaluation scripts of Du et al., (2017) to make the comparisons fair. We suggest you also follow the same evaluation scripts so that the results could be compared to other research works.

We only conducted the necessary post-processing to detokenize the subword units into words. In our model, texts are tokenized to subword units by WordPiece (Wu et al., 2016). For instance, the word "forecasted" is split to "forecast" and "##ed", where "##" indicates the pieces belong to one word. So when we do the evaluation, we need to first conduct detokenization. In the script qg/eval_on_unilm_tokenized_ref.py, we merge the pieces of each word (for instance, "forecast" and "##ed" are merged to "forecasted").

Notice that we merge the pieces of each word in the released output to make the generated questions readable, and we do not merge the pieces in the gold tokenized question file (test.q.tok.txt). If you would like to use your own evaluation package (such as nlg-eval), the detokenization step is required. You could use the detokenize function in qg/eval_on_unilm_tokenized_ref.py to conduct detokenization. Otherwise, there would be a mismatch between the tokenization, which results in incorrect evaluation results.

In addition, if you did not follow the evaluation scripts as in Du et al., (2017), the results might be slightly different as Du et al., (2017) consider multiple references (if any).

You could provide us with your evaluation command in case you are still confused.

from unilm.

deepaknlp commented on May 13, 2024

Thanks for the detailed clarification.
However, if I use the gold test question (nqg_processed_data/tgt-test.txt) provided by Du et al., (2017) and your released test set (qg.test.output.txt) then there will no need to detokenize them.

The results using the Du et al. (2017) script are as follows:

python2 eval.py --out_file ../../qg.test.output.txt --src_file ../../nqg_processed_data/src-test.txt --tgt_file ../../nqg_processed_data/tgt-test.txt 
scores: 

Bleu_1: 0.42213
Bleu_2: 0.26280
Bleu_3: 0.17740
Bleu_4: 0.12514
METEOR: 0.26111
ROUGE_L: 0.38709

The same with nlg-eval are as follows:

nlg-eval --references nqg_processed_data/tgt-test.txt --hypothesis qg.test.output.txt 
Bleu_1: 0.368868
Bleu_2: 0.222525
Bleu_3: 0.148107
Bleu_4: 0.103617
METEOR: 0.246009
ROUGE_L: 0.362789
CIDEr: 0.948406

Kindly clarify this.

from unilm.

donglixp commented on May 13, 2024

Hi @deepaknlp ,

I think the reason is that our model is a cased model, so the file qg.test.output.txt is also cased. However, the file nqg_processed_data/tgt-test.txt provided by Du et al., (2017) is uncased. If you directly feed them into the evaluation script, there would be many case mismatches. Could you try to convert qg.test.output.txt into lower-case and then run the evaluation script?

from unilm.

deepaknlp commented on May 13, 2024

Thank you @donglixp for helping me out. I got improved results by lower-casing the prediction file.

from unilm.

donglixp commented on May 13, 2024

Great to know that your issue has been solved.

from unilm.

Results of Question Generation on SQuAD about unilm HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs