Comments (5)
We directly use the same evaluation scripts of Du et al., (2017) to make the comparisons fair. We suggest you also follow the same evaluation scripts so that the results could be compared to other research works.
We only conducted the necessary post-processing to detokenize the subword units into words. In our model, texts are tokenized to subword units by WordPiece (Wu et al., 2016). For instance, the word "forecasted" is split to "forecast" and "##ed", where "##" indicates the pieces belong to one word. So when we do the evaluation, we need to first conduct detokenization. In the script qg/eval_on_unilm_tokenized_ref.py
, we merge the pieces of each word (for instance, "forecast" and "##ed" are merged to "forecasted").
Notice that we merge the pieces of each word in the released output to make the generated questions readable, and we do not merge the pieces in the gold tokenized question file (test.q.tok.txt). If you would like to use your own evaluation package (such as nlg-eval), the detokenization step is required. You could use the detokenize
function in qg/eval_on_unilm_tokenized_ref.py
to conduct detokenization. Otherwise, there would be a mismatch between the tokenization, which results in incorrect evaluation results.
In addition, if you did not follow the evaluation scripts as in Du et al., (2017), the results might be slightly different as Du et al., (2017) consider multiple references (if any).
You could provide us with your evaluation command in case you are still confused.
from unilm.
Thanks for the detailed clarification.
However, if I use the gold test question (nqg_processed_data/tgt-test.txt)
provided by Du et al., (2017) and your released test set (qg.test.output.txt
) then there will no need to detokenize them.
The results using the Du et al. (2017) script are as follows:
python2 eval.py --out_file ../../qg.test.output.txt --src_file ../../nqg_processed_data/src-test.txt --tgt_file ../../nqg_processed_data/tgt-test.txt
scores:
Bleu_1: 0.42213
Bleu_2: 0.26280
Bleu_3: 0.17740
Bleu_4: 0.12514
METEOR: 0.26111
ROUGE_L: 0.38709
The same with nlg-eval are as follows:
nlg-eval --references nqg_processed_data/tgt-test.txt --hypothesis qg.test.output.txt
Bleu_1: 0.368868
Bleu_2: 0.222525
Bleu_3: 0.148107
Bleu_4: 0.103617
METEOR: 0.246009
ROUGE_L: 0.362789
CIDEr: 0.948406
Kindly clarify this.
from unilm.
Hi @deepaknlp ,
I think the reason is that our model is a cased model, so the file qg.test.output.txt
is also cased. However, the file nqg_processed_data/tgt-test.txt
provided by Du et al., (2017) is uncased. If you directly feed them into the evaluation script, there would be many case mismatches. Could you try to convert qg.test.output.txt
into lower-case and then run the evaluation script?
from unilm.
Thank you @donglixp for helping me out. I got improved results by lower-casing the prediction file.
from unilm.
Great to know that your issue has been solved.
from unilm.
Related Issues (20)
- text diffuser style
- [KOSMOS-2] Few-shot grounding
- [WavLM] Finetuning for speaker diarization
- Layout LM v2 having no performance decrease with blank white image
- Can Kosmos-2 run in lower precision? HOT 1
- Can I use the pretraining code of BEiTv2 to pretrain a custom backbone transformer?
- Fine-tunning TextDiffuser2 Inpaiting
- Please, can you help tracing Kosmos-2 training data?
- fixed
- 1 Click Windows, RunPod & Linux Installer for Kosmos-2 with Batch Image captioning feature - not an issue HOT 1
- Small size on WavLM?
- Buying Layoutlmv3 license query
- I want to use layoutlmv3 to implement document information extraction tasks. What format of dataset should I use for training and what tools should I use to annotate the dataset?
- bug(layoutlmv3):
- Broken link, Can't download the checkpoints associated with the Text Detection task for DiT HOT 1
- How to input bounding boxes to Kosmos-2?
- textdiffusers: input to the text_segmenter HOT 1
- an ERROR when I try to run VLMO on datasets flicker30k
- BitNet Code Release? HOT 4
- Fine-tune textdiffuser-2 inpainting with customized dataset
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from unilm.