I try to reproduce the result (mostly for BLEU) for encoder-decoder with attention (in

Yes, I also tried dataset download from <a href="https://github.com/cocoxu/simplificat

" Yes, I also tried dataset download from <a href="https://github.com/cocoxu/simpl

I got it. so you use the 8-ref bleu evaluation from <a href="https://github.com/Xi

Am I running the correct test file? about dress HOT 12 CLOSED

Sanqiang commented on July 28, 2024

Am I running the correct test file?

from dress.

Comments (12)

XingxingZhang commented on July 28, 2024

It looks like you are using single BLEU evaluation. There are 8 references in wikilarge test sets (available here https://github.com/cocoxu/simplification).

The code I released can be used to produce system output of EncDecA, Dress and Dress-Ls. Please follow the evaluation protocols described in our paper. More suggestions can be found here

from dress.

XingxingZhang commented on July 28, 2024

< why turn off the ignore case, I think the uppercase or lowercase makes no difference for the words in this data set?

I don't think it matters :)

from dress.

Sanqiang commented on July 28, 2024

Yes, I also tried dataset download from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus
But the 8 references dataset doesn't do the NER replacement (e.g. PEOPLE@1, LOCATION@1 that kind of things), so I cannot directly use your code (I am supposed to be put preprocess with NER tools test files as input).
I wonder if you still do the NER replacement for 8 references dataset?

Meanwhile, I wonder could you show me the command you run for 8 references test set (since there are 8 references for iBLEU, the Xu Wei's paper didn't indicate how to do with 8 references, take the mean or max for 8 reference performances?)?

I downloaded the dataset from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus/truecased (because true cased is good for NER tool), and use Stanford NER tool to do the replacement. (I think I am doing the correct thing because I see exactly same output when I replace from wiki.full.aner.ori.test to wiki.full.aner.test). But since your code seems doesn't support 8 references data set, I tried my TensorFlow encoder-decoder model, which follows similar setting as yours, but still not get 88% BLEU. But my model can generate similar performance based on the Wikilarge/Wikismall test set (non 8 reference ones) through mteval-v13a.pl. So I think perhaps I just use the wrong script.

In addition, are you still work on this task? So far, it is true that Seq2seq without RL prefers copy complex text into simpler one but I think you can solve it in the evaluation stage (through RL) but I think the major reason id the attention (I am running experiments for now to prove it).

from dress.

XingxingZhang commented on July 28, 2024

"
Yes, I also tried dataset download from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus
But the 8 references dataset doesn't do the NER replacement (e.g. PEOPLE@1, LOCATION@1 that kind of things), so I cannot directly use your code (I am supposed to be put preprocess with NER tools test files as input).
I wonder if you still do the NER replacement for 8 references dataset?
"
The "wiki.full.aner.map.t7" file in "data-simplification/wikilarge" folder contains all you need for NER anonymization/de-anonymization. Note that in test set, I only did NER anonymization for complex sentences and one of the reference sentences. But it doesn't matter since your system output will be de-anonymized anyway.

from dress.

XingxingZhang commented on July 28, 2024

"
Meanwhile, I wonder could you show me the command you run for 8 references test set (since there are 8 references for iBLEU, the Xu Wei's paper didn't indicate how to do with 8 references, take the mean or max for 8 reference performances?)?
"
BLEU evaluation in default assumes there are multiple references [1][2]. Please refer to the documentations of Joshua or mt-eval-v13 for how to evaluate BLEU with multiple references.

[1] BLEU: a Method for Automatic Evaluation of Machine Translation
[2] https://en.wikipedia.org/wiki/BLEU

from dress.

Sanqiang commented on July 28, 2024

I think I can achieve similar result in the paper,
This is what I do:
(1) I still use scripts/mteval-v13a.pl to eval your output with single ground truth, get a score for BLEU(I, O) roughly 60%.
(2) I use scripts/multi-bleu.perl to eval your output with 8 references, get a score for BLEU(I, R) roughly 90%. But the original reference is all lowercase, so I use true cased references ones so that it is matchable.
(3) iBLEU = 0.9 * BLEU(I, R) + 0.1 * BLEU(I, O) as Xu's paper, the result is similar to your paper.

Am I correct? or any bias from what you did?

from dress.

XingxingZhang commented on July 28, 2024

< Am I correct? or any bias from what you did?
No. I didn't use iBLEU and didn't mention iBLEU anywhere

from dress.

XingxingZhang commented on July 28, 2024

Here are the instructions for 8-ref bleu evaluation on wikilarge https://github.com/XingxingZhang/dress/tree/master/experiments/evaluation/BLEU

Good luck!

from dress.

Sanqiang commented on July 28, 2024

I got it.
so you use the 8-ref bleu evaluation from https://github.com/XingxingZhang/dress/tree/master/experiments/evaluation/BLEU

I wonder you only use 8-references or you use 9-references (8 reference plus original single ground truth)? (Based on the script you provide, I think you use 8-references ones but just double check)

from dress.

XingxingZhang commented on July 28, 2024

did you get the correct bleu score?
=> BLEU = 0.8885

from dress.

Sanqiang commented on July 28, 2024

Yes, I can get correct bleu. Thank you.

from dress.

XingxingZhang commented on July 28, 2024

awesome!

from dress.

Am I running the correct test file? about dress HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs