jhclark / multeval Goto Github PK

Easy Bootstrap Resampling and Approximate Randomization for BLEU, METEOR, and TER using Multiple Optimizer Runs. This implements "Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability" from ACL 2011.

Home Page: http://www.cs.cmu.edu/~jhclark

License: Other

Shell 0.80% Perl 6.07% Python 0.28% Java 14.88% Groff 77.97%

multeval's People

Contributors

Stargazers

Watchers

multeval's Issues

GOT RESULT: baseline: BLEU: RESAMPLED_MIN: 0.000000

I'm trying to use the example data and I'm getting zero minimum scores for all metrics and systems.
I do not believe it is possible for any sizable sample to get zero BLEU (much less TER!)

Exception when running from outside multeval's folder

The latest version of multeval tries to read a file called "./constants" to determine its version number. When multeval is executed from a location other than its own folder, this file does not exist, leading to an exception and the inability to run multeval.

A suggestion on the JBLEU class

The parameter N in the class JBLEU is fixed to 4.

    private static final int N = 4;

Why don't we get the value of N from the constuction like this?

    private static int N;
    public JBLEU(int n_gram) {
        N = n_gram;
    }
    public JBLEU(int n_gram, int verbosity) {
        N = n_gram;
        this.verbosity = verbosity;
    }

In function:

public double score(int[] suffStats, double[] allResults),

     double ngramOrderWeight = 0.25;

should be

     double ngramOrderWeight = 1.0 / N;

What do you think?

insufficient precision in p-value output

Precision used to print p-values (%.2f in the main output, and %.1f in the latex table) is not sufficient for p-values, which are often less than 0.01.

Non-parallel inputs detected

When i am trying to calculate BLEU score using multi-eval.
It is throwing me a error:

Expected 999 references, but got 500
at multeval.HypothesisManager.loadRefs(HypothesisManager.java:66)
at multeval.HypothesisManager.loadData(HypothesisManager.java:50)
at multeval.MultEvalModule.run(MultEvalModule.java:109)
at multeval.MultEval.main(MultEval.java:115)

No option to use case-sensitive TER?

It seems TER is by default computed in case-insensitive mode and there is no way to pass the option to the TER scorer?

Who can explain some parameters's meaning?

./multeval.sh eval --refs example/refs.test2010.lc.tok.en.*
--hyps-baseline example/hyps.lc.tok.en.baseline.opt*
--meteor.language en

what it is 'example/refs.test2010.lc.tok.en.' and 'hyps.lc.tok.en.baseline.opt'.

我的理解是，--refs 后面跟的是原始的标注数据 --hyps-baseline 后面跟的是baseline系统跑出来的结果
我不能理解，为什么 --hyps-baseline后面跟的参数，在example里面是三个文件。

Problem with --meteor.language ar

I'm having issues running multeval with METEOR in Arabic --meteor.language ar.
Running independently (e.g. METEOR standalone) is no problem, but running it inside multeval gives:

RESULT: baseline: METEOR: RESAMPLED_MEAN_AVG: 0.000000
RESULT: baseline: METEOR: RESAMPLED_STDDEV_AVG: 0.000000
RESULT: baseline: METEOR: RESAMPLED_MIN: 0.000000
RESULT: baseline: METEOR: RESAMPLED_MAX: 0.000000

What am I doing wrong?

BTW: I'm running multeval-0.5.1 with meteor-1.4

Some short options are broken

When executing ./multeval.sh eval -R example/refs.test2010.lc.tok.en.0 --hyps-baseline example/hyps.lc.tok.en.baseline.opt0 --metrics ter I get the following error:

Failed to specify required options: [refs]

But when changing -R to --refs everything works as expected.
Other flags such as -t or -d are also affected, they do not change the behavior of multeval.
I do not know, if every short flag is broken.

Found existing METEOR installation at ./lib/meteor-1.4 Error: Could not find or load main class multeval.MultEval

./multeval.sh eval --refs example/refs.test2010.lc.tok.en.*
--baseline example/hyps.lc.tok.en.baseline.opt*
--meteor.language en
I get the following error while trying to run the sample files-
Found existing METEOR installation at ./lib/meteor-1.4
Error: Could not find or load main class multeval.MultEval

different p-values compared to other tools

Hi,

I typically run multeval for bleu and ter but haven't assessed statistical significance so far. Now that I actually need it , I find it (1) difficult to grasp what exactly multeval computes (I checked issue #8 and it clarifies somehow what is going on) and (2) to run it 'correctly'. With (1) what I mean is that according to Koehn's paper (https://www.aclweb.org/anthology/W04-3250) I would assume you take different samples from sys1 and sys2 score w.r.t. the reference and assess the differences. If in 95% of the cases the scores differ favouring one of the systems then the difference is statistically significant. Or am I getting it wrong? Furthermore, I compared the multeval tool to mteval for the same number of samples and shuffles and the scores are completely different.
2. Maybe this all comes from me not running multeval correctly. I have one reference and the output of two MT systems. As multeval doesn't like it when there is only one variant for system 1 and the baseline I use copies, e.g. for system 1 I will use sys1.test.out and sys1.test.out.copy (and they are identical). Is this a good way to invoke multeval?

Thanks.
Cheers,
Dimtiar

Old release 0.4.2 does not build

Error output resulting from ant when building old release 0.4.2:

[javac]                     if (!metric.isThreadsafe()) {
[javac]                                ^
[javac]   symbol:   method isThreadsafe()
[javac]   location: variable metric of type Metric<?>
[javac] 1 error

Build succeeds after uncommenting the following line in src/multeval/metrics/Metric.java:

//public abstract boolean isThreadsafe();

confusing system naming

The example from README:
./multeval.sh eval --refs example/refs.test2010.lc.tok.en.* --hyps-baseline example/hyps.lc.tok.en.baseline.opt* --hyps-sys1 example/hyps.lc.tok.en.sys1.opt* --hyps-sys2 example/hyps.lc.tok.en.sys2.opt* --meteor.language en --metrics bleu

Reading Hypotheses for system 1 opt run 1 file /home/den/development/multeval/example/hyps.lc.tok.en.sys2.opt0
Reading Hypotheses for system 1 opt run 2 file /home/den/development/multeval/example/hyps.lc.tok.en.sys2.opt1
Reading Hypotheses for system 1 opt run 3 file /home/den/development/multeval/example/hyps.lc.tok.en.sys2.opt2
Reading Hypotheses for system 2 opt run 1 file /home/den/development/multeval/example/hyps.lc.tok.en.sys1.opt0
Reading Hypotheses for system 2 opt run 2 file /home/den/development/multeval/example/hyps.lc.tok.en.sys1.opt1

Reading Hypotheses for system 2 opt run 3 file /home/den/development/multeval/example/hyps.lc.tok.en.sys1.opt2

Note that sys1 becomes "system 2" and sys2 becomes "system 1."
If one were to overlook this bit of output and naively assume that system 1 means sys1 (as I had), he would be up for a big surprise.

On top of that, the systems are named "system 2" and "system 3" here:
Performing approximate randomization to estimate p-value between baseline system and system 2 (of 3)
Performing approximate randomization to estimate p-value between baseline system and system 3 (of 3)

Might I suggest adding parameters --sys1-name and --sys2-name and using these names everywhere in the output to avoid any confusion.

Comparing two systems with one hypothesis set each

I would expect the following command line to compare two systems based on only one output each, but instead it crashes:

multeval-0.4$ ./multeval.sh eval --refs example/refs.test2010.lc.tok.en.0 --hyps-baseline example/hyps.lc.tok.en.baseline.opt0 --hyps-sys1 example/hyps.lc.tok.en.sys1.opt0 --metrics bleu
Found existing METEOR installation at ./lib/meteor-1.3
Loading metric: bleu
Found library jBLEU at file:/home/denero/third_party/multeval-0.4/multeval-0.4.jar
Using 6 threads
Reading Hypotheses for system baseline opt run 1 file /home/denero/third_party/multeval-0.4/example/hyps.lc.tok.en.baseline.opt0
Reading Hypotheses for system 1 opt run 1 file /home/denero/third_party/multeval-0.4/example/hyps.lc.tok.en.sys1.opt0
Reading non-laced references file /home/denero/third_party/multeval-0.4/example/refs.test2010.lc.tok.en.0
Collecting sufficient statistics for metric: BLEU
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at multeval.SuffStatManager.saveStats(SuffStatManager.java:50)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:287)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:1)
at multeval.parallel.HypothesisLevelMetricWorkerPool$1.run(HypothesisLevelMetricWorkerPool.java:37)
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at multeval.SuffStatManager.saveStats(SuffStatManager.java:50)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:287)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:1)
at multeval.parallel.HypothesisLevelMetricWorkerPool$1.run(HypothesisLevelMetricWorkerPool.java:37)
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at multeval.SuffStatManager.saveStats(SuffStatManager.java:50)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:287)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:1)
at multeval.parallel.HypothesisLevelMetricWorkerPool$1.run(HypothesisLevelMetricWorkerPool.java:37)
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1

Thanks for the help!

wrong p-values

Since I cannot re-open #3, I'm creating a new issue because the problem is far from resolved.

Let me re-state the problem.

In the supplied example the hypotheses of the baseline system and system 2 are absolutely identical:

$ md5sum example/hyps.lc.tok.en.{baseline,sys2}*
c6786428423a9623bebe07186c627d85 example/hyps.lc.tok.en.baseline.opt0
2642687137a847c4a73c21a7f894a892 example/hyps.lc.tok.en.baseline.opt1
b7f58d941800959fea30a89a8d4ecc6e example/hyps.lc.tok.en.baseline.opt2
c6786428423a9623bebe07186c627d85 example/hyps.lc.tok.en.sys2.opt0
2642687137a847c4a73c21a7f894a892 example/hyps.lc.tok.en.sys2.opt1
b7f58d941800959fea30a89a8d4ecc6e example/hyps.lc.tok.en.sys2.opt2

The p-value reported system 2 vs. baseline system is very low, meaning that system 2 is considered highly different from the baseline (although we know they are identical):

Performing approximate randomization to estimate p-value between baseline system and system 2 (of 3)
RESULT: system 1: BLEU: P_VALUE: 0.000300
Performing approximate randomization to estimate p-value between baseline system and system 3 (of 3)
RESULT: system 2: BLEU: P_VALUE: 0.000100
n=3 BLEU (s_sel/s_opt/p)
baseline 18.5 (0.3/0.1/-)
system 1 18.8 (0.3/0.3/0.00)
system 2 18.5 (0.3/0.1/0.00)

If a test really does find two identical system highly different then I sorry to say it but such a test is useless. But I do hope it's simply a bug.
Again, I'm using the example data bundled with this tool, so you should be able to observe the same numbers.

odd p-values reported

Below are the p-values reported on the bundled example data.
The p-values seem oddly low, considering the small difference in BLEU (especially for system 1 which has all scores identical to the baseline despite the hypotheses being different -- an oddity on its own).

Performing approximate randomization to estimate p-value between baseline system and system 2 (of 3)
GOT RESULT: system 1: BLEU: P_VALUE: 0.000100
GOT RESULT: system 1: METEOR: P_VALUE: 0.000100
GOT RESULT: system 1: TER: P_VALUE: 0.000100
GOT RESULT: system 1: Length: P_VALUE: 0.000100
Performing approximate randomization to estimate p-value between baseline system and system 3 (of 3)
GOT RESULT: system 2: BLEU: P_VALUE: 0.000300
GOT RESULT: system 2: METEOR: P_VALUE: 0.000100
GOT RESULT: system 2: TER: P_VALUE: 0.000100

GOT RESULT: system 2: Length: P_VALUE: 0.093091

NoClassDefFoundError: Lter/TERcost

Trying to run multeval.sh, I get

NoClassDefFoundError: Lter/TERcost
...
Caused by: java.lang.ClassNotFoundException: ter.TERcost

I had to put all TER classes in a package (add "package ter;" to each .java source & recompile) in order run multeval.

jhclark / multeval Goto Github PK

multeval's People

Contributors

Stargazers

Watchers

Forkers

multeval's Issues

Reading Hypotheses for system 2 opt run 3 file /home/den/development/multeval/example/hyps.lc.tok.en.sys1.opt2

GOT RESULT: system 2: Length: P_VALUE: 0.093091

Recommend Projects

Recommend Topics

Recommend Org

Jobs