GithubHelp home page GithubHelp logo

jhclark / multeval Goto Github PK

View Code? Open in Web Editor NEW
203.0 203.0 41.0 4.66 MB

Easy Bootstrap Resampling and Approximate Randomization for BLEU, METEOR, and TER using Multiple Optimizer Runs. This implements "Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability" from ACL 2011.

Home Page: http://www.cs.cmu.edu/~jhclark

License: Other

Shell 0.80% Perl 6.07% Python 0.28% Java 14.88% Groff 77.97%

multeval's People

Contributors

brendano avatar jhclark avatar mjdenkowski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multeval's Issues

Exception when running from outside multeval's folder

The latest version of multeval tries to read a file called "./constants" to determine its version number. When multeval is executed from a location other than its own folder, this file does not exist, leading to an exception and the inability to run multeval.

A suggestion on the JBLEU class

The parameter N in the class JBLEU is fixed to 4.

    private static final int N = 4;

Why don't we get the value of N from the constuction like this?

    private static int N;
    public JBLEU(int n_gram) {
        N = n_gram;
    }
    public JBLEU(int n_gram, int verbosity) {
        N = n_gram;
        this.verbosity = verbosity;
    }

In function:

public double score(int[] suffStats, double[] allResults),

     double ngramOrderWeight = 0.25;

should be

     double ngramOrderWeight = 1.0 / N;

What do you think?

Non-parallel inputs detected

When i am trying to calculate BLEU score using multi-eval.
It is throwing me a error:

Expected 999 references, but got 500
at multeval.HypothesisManager.loadRefs(HypothesisManager.java:66)
at multeval.HypothesisManager.loadData(HypothesisManager.java:50)
at multeval.MultEvalModule.run(MultEvalModule.java:109)
at multeval.MultEval.main(MultEval.java:115)

Who can explain some parameters's meaning?

./multeval.sh eval --refs example/refs.test2010.lc.tok.en.*
--hyps-baseline example/hyps.lc.tok.en.baseline.opt*
--meteor.language en

what it is 'example/refs.test2010.lc.tok.en.' and 'hyps.lc.tok.en.baseline.opt'.

我的理解是,--refs 后面跟的是原始的标注数据 --hyps-baseline 后面跟的是baseline系统跑出来的结果
我不能理解,为什么 --hyps-baseline后面跟的参数,在example里面是三个文件。

Problem with --meteor.language ar

I'm having issues running multeval with METEOR in Arabic --meteor.language ar.
Running independently (e.g. METEOR standalone) is no problem, but running it inside multeval gives:

RESULT: baseline: METEOR: RESAMPLED_MEAN_AVG: 0.000000
RESULT: baseline: METEOR: RESAMPLED_STDDEV_AVG: 0.000000
RESULT: baseline: METEOR: RESAMPLED_MIN: 0.000000
RESULT: baseline: METEOR: RESAMPLED_MAX: 0.000000

What am I doing wrong?

BTW: I'm running multeval-0.5.1 with meteor-1.4

Some short options are broken

When executing ./multeval.sh eval -R example/refs.test2010.lc.tok.en.0 --hyps-baseline example/hyps.lc.tok.en.baseline.opt0 --metrics ter I get the following error:

Failed to specify required options: [refs]

But when changing -R to --refs everything works as expected.
Other flags such as -t or -d are also affected, they do not change the behavior of multeval.
I do not know, if every short flag is broken.

different p-values compared to other tools

Hi,

I typically run multeval for bleu and ter but haven't assessed statistical significance so far. Now that I actually need it , I find it (1) difficult to grasp what exactly multeval computes (I checked issue #8 and it clarifies somehow what is going on) and (2) to run it 'correctly'. With (1) what I mean is that according to Koehn's paper (https://www.aclweb.org/anthology/W04-3250) I would assume you take different samples from sys1 and sys2 score w.r.t. the reference and assess the differences. If in 95% of the cases the scores differ favouring one of the systems then the difference is statistically significant. Or am I getting it wrong? Furthermore, I compared the multeval tool to mteval for the same number of samples and shuffles and the scores are completely different.
2. Maybe this all comes from me not running multeval correctly. I have one reference and the output of two MT systems. As multeval doesn't like it when there is only one variant for system 1 and the baseline I use copies, e.g. for system 1 I will use sys1.test.out and sys1.test.out.copy (and they are identical). Is this a good way to invoke multeval?

Thanks.
Cheers,
Dimtiar

Old release 0.4.2 does not build

Error output resulting from ant when building old release 0.4.2:

[javac]                     if (!metric.isThreadsafe()) {
[javac]                                ^
[javac]   symbol:   method isThreadsafe()
[javac]   location: variable metric of type Metric<?>
[javac] 1 error

Build succeeds after uncommenting the following line in src/multeval/metrics/Metric.java:

//public abstract boolean isThreadsafe();

confusing system naming

The example from README:
./multeval.sh eval --refs example/refs.test2010.lc.tok.en.* --hyps-baseline example/hyps.lc.tok.en.baseline.opt* --hyps-sys1 example/hyps.lc.tok.en.sys1.opt* --hyps-sys2 example/hyps.lc.tok.en.sys2.opt* --meteor.language en --metrics bleu


Reading Hypotheses for system 1 opt run 1 file /home/den/development/multeval/example/hyps.lc.tok.en.sys2.opt0
Reading Hypotheses for system 1 opt run 2 file /home/den/development/multeval/example/hyps.lc.tok.en.sys2.opt1
Reading Hypotheses for system 1 opt run 3 file /home/den/development/multeval/example/hyps.lc.tok.en.sys2.opt2
Reading Hypotheses for system 2 opt run 1 file /home/den/development/multeval/example/hyps.lc.tok.en.sys1.opt0
Reading Hypotheses for system 2 opt run 2 file /home/den/development/multeval/example/hyps.lc.tok.en.sys1.opt1

Reading Hypotheses for system 2 opt run 3 file /home/den/development/multeval/example/hyps.lc.tok.en.sys1.opt2

Note that sys1 becomes "system 2" and sys2 becomes "system 1."
If one were to overlook this bit of output and naively assume that system 1 means sys1 (as I had), he would be up for a big surprise.

On top of that, the systems are named "system 2" and "system 3" here:
Performing approximate randomization to estimate p-value between baseline system and system 2 (of 3)
Performing approximate randomization to estimate p-value between baseline system and system 3 (of 3)

Might I suggest adding parameters --sys1-name and --sys2-name and using these names everywhere in the output to avoid any confusion.

Comparing two systems with one hypothesis set each

I would expect the following command line to compare two systems based on only one output each, but instead it crashes:

multeval-0.4$ ./multeval.sh eval --refs example/refs.test2010.lc.tok.en.0 --hyps-baseline example/hyps.lc.tok.en.baseline.opt0 --hyps-sys1 example/hyps.lc.tok.en.sys1.opt0 --metrics bleu
Found existing METEOR installation at ./lib/meteor-1.3
Loading metric: bleu
Found library jBLEU at file:/home/denero/third_party/multeval-0.4/multeval-0.4.jar
Using 6 threads
Reading Hypotheses for system baseline opt run 1 file /home/denero/third_party/multeval-0.4/example/hyps.lc.tok.en.baseline.opt0
Reading Hypotheses for system 1 opt run 1 file /home/denero/third_party/multeval-0.4/example/hyps.lc.tok.en.sys1.opt0
Reading non-laced references file /home/denero/third_party/multeval-0.4/example/refs.test2010.lc.tok.en.0
Collecting sufficient statistics for metric: BLEU
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at multeval.SuffStatManager.saveStats(SuffStatManager.java:50)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:287)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:1)
at multeval.parallel.HypothesisLevelMetricWorkerPool$1.run(HypothesisLevelMetricWorkerPool.java:37)
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at multeval.SuffStatManager.saveStats(SuffStatManager.java:50)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:287)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:1)
at multeval.parallel.HypothesisLevelMetricWorkerPool$1.run(HypothesisLevelMetricWorkerPool.java:37)
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at multeval.SuffStatManager.saveStats(SuffStatManager.java:50)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:287)
at multeval.MultEvalModule$2.doWork(MultEvalModule.java:1)
at multeval.parallel.HypothesisLevelMetricWorkerPool$1.run(HypothesisLevelMetricWorkerPool.java:37)
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1

Thanks for the help!

wrong p-values

Since I cannot re-open #3, I'm creating a new issue because the problem is far from resolved.

Let me re-state the problem.

  1. In the supplied example the hypotheses of the baseline system and system 2 are absolutely identical:

$ md5sum example/hyps.lc.tok.en.{baseline,sys2}*
c6786428423a9623bebe07186c627d85 example/hyps.lc.tok.en.baseline.opt0
2642687137a847c4a73c21a7f894a892 example/hyps.lc.tok.en.baseline.opt1
b7f58d941800959fea30a89a8d4ecc6e example/hyps.lc.tok.en.baseline.opt2
c6786428423a9623bebe07186c627d85 example/hyps.lc.tok.en.sys2.opt0
2642687137a847c4a73c21a7f894a892 example/hyps.lc.tok.en.sys2.opt1
b7f58d941800959fea30a89a8d4ecc6e example/hyps.lc.tok.en.sys2.opt2

  1. The p-value reported system 2 vs. baseline system is very low, meaning that system 2 is considered highly different from the baseline (although we know they are identical):

Performing approximate randomization to estimate p-value between baseline system and system 2 (of 3)
RESULT: system 1: BLEU: P_VALUE: 0.000300
Performing approximate randomization to estimate p-value between baseline system and system 3 (of 3)
RESULT: system 2: BLEU: P_VALUE: 0.000100
n=3 BLEU (s_sel/s_opt/p)
baseline 18.5 (0.3/0.1/-)
system 1 18.8 (0.3/0.3/0.00)
system 2 18.5 (0.3/0.1/0.00)

If a test really does find two identical system highly different then I sorry to say it but such a test is useless. But I do hope it's simply a bug.
Again, I'm using the example data bundled with this tool, so you should be able to observe the same numbers.

odd p-values reported

Below are the p-values reported on the bundled example data.
The p-values seem oddly low, considering the small difference in BLEU (especially for system 1 which has all scores identical to the baseline despite the hypotheses being different -- an oddity on its own).


Performing approximate randomization to estimate p-value between baseline system and system 2 (of 3)
GOT RESULT: system 1: BLEU: P_VALUE: 0.000100
GOT RESULT: system 1: METEOR: P_VALUE: 0.000100
GOT RESULT: system 1: TER: P_VALUE: 0.000100
GOT RESULT: system 1: Length: P_VALUE: 0.000100
Performing approximate randomization to estimate p-value between baseline system and system 3 (of 3)
GOT RESULT: system 2: BLEU: P_VALUE: 0.000300
GOT RESULT: system 2: METEOR: P_VALUE: 0.000100
GOT RESULT: system 2: TER: P_VALUE: 0.000100

GOT RESULT: system 2: Length: P_VALUE: 0.093091

NoClassDefFoundError: Lter/TERcost

Trying to run multeval.sh, I get

NoClassDefFoundError: Lter/TERcost
...
Caused by: java.lang.ClassNotFoundException: ter.TERcost

I had to put all TER classes in a package (add "package ter;" to each .java source & recompile) in order run multeval.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.