GithubHelp home page GithubHelp logo

Comments (5)

wenhui0924 avatar wenhui0924 commented on May 13, 2024

We directly use the same evaluation scripts of Du et al., (2017) to make the comparisons fair. We suggest you also follow the same evaluation scripts so that the results could be compared to other research works.

We only conducted the necessary post-processing to detokenize the subword units into words. In our model, texts are tokenized to subword units by WordPiece (Wu et al., 2016). For instance, the word "forecasted" is split to "forecast" and "##ed", where "##" indicates the pieces belong to one word. So when we do the evaluation, we need to first conduct detokenization. In the script qg/eval_on_unilm_tokenized_ref.py, we merge the pieces of each word (for instance, "forecast" and "##ed" are merged to "forecasted").

Notice that we merge the pieces of each word in the released output to make the generated questions readable, and we do not merge the pieces in the gold tokenized question file (test.q.tok.txt). If you would like to use your own evaluation package (such as nlg-eval), the detokenization step is required. You could use the detokenize function in qg/eval_on_unilm_tokenized_ref.py to conduct detokenization. Otherwise, there would be a mismatch between the tokenization, which results in incorrect evaluation results.

In addition, if you did not follow the evaluation scripts as in Du et al., (2017), the results might be slightly different as Du et al., (2017) consider multiple references (if any).

You could provide us with your evaluation command in case you are still confused.

from unilm.

deepaknlp avatar deepaknlp commented on May 13, 2024

Thanks for the detailed clarification.
However, if I use the gold test question (nqg_processed_data/tgt-test.txt) provided by Du et al., (2017) and your released test set (qg.test.output.txt) then there will no need to detokenize them.

The results using the Du et al. (2017) script are as follows:

python2 eval.py --out_file ../../qg.test.output.txt --src_file ../../nqg_processed_data/src-test.txt --tgt_file ../../nqg_processed_data/tgt-test.txt 
scores: 

Bleu_1: 0.42213
Bleu_2: 0.26280
Bleu_3: 0.17740
Bleu_4: 0.12514
METEOR: 0.26111
ROUGE_L: 0.38709

The same with nlg-eval are as follows:

nlg-eval --references nqg_processed_data/tgt-test.txt --hypothesis qg.test.output.txt 
Bleu_1: 0.368868
Bleu_2: 0.222525
Bleu_3: 0.148107
Bleu_4: 0.103617
METEOR: 0.246009
ROUGE_L: 0.362789
CIDEr: 0.948406

Kindly clarify this.

from unilm.

donglixp avatar donglixp commented on May 13, 2024

Hi @deepaknlp ,

I think the reason is that our model is a cased model, so the file qg.test.output.txt is also cased. However, the file nqg_processed_data/tgt-test.txt provided by Du et al., (2017) is uncased. If you directly feed them into the evaluation script, there would be many case mismatches. Could you try to convert qg.test.output.txt into lower-case and then run the evaluation script?

from unilm.

deepaknlp avatar deepaknlp commented on May 13, 2024

Thank you @donglixp for helping me out. I got improved results by lower-casing the prediction file.

from unilm.

donglixp avatar donglixp commented on May 13, 2024

Great to know that your issue has been solved.

from unilm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.