GithubHelp home page GithubHelp logo

neulab / compare-mt Goto Github PK

View Code? Open in Web Editor NEW
464.0 19.0 59.0 6.03 MB

A tool for holistic analysis of language generations systems

License: BSD 3-Clause "New" or "Revised" License

Python 99.59% Shell 0.41%

compare-mt's Introduction

compare-mt

by NeuLab @ CMU LTI, and other contributors

Integration Tests

compare-mt (for "compare my text") is a program to compare the output of multiple systems for language generation, including machine translation, summarization, dialog response generation, etc. To use it you need to have, in text format, a "correct" reference, and the output of two different systems. Based on this, compare-mt will run a number of analyses that attempt to pick out salient differences between the systems, which will make it easier for you to figure out what things one system is doing better than another.

Basic Usage

First, you need to install the package:

# Requirements
pip install -r requirements.txt
# Install the package
python setup.py install

Then, as an example, you can run this over two included system outputs.

compare-mt --output_directory output/ example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng

This will output some statistics to the command line, and also write a formatted HTML report to output/. Here, system 1 and system 2 are the baseline phrase-based and neural Slovak-English systems from our EMNLP 2018 paper. This will print out a number of statistics including:

  • Aggregate Scores: A report on overall BLEU scores and length ratios
  • Word Accuracy Analysis: A report on the F-measure of words by frequency bucket
  • Sentence Bucket Analysis: Bucket sentences by various statistics (e.g. sentence BLEU, length difference with the reference, overall length), and calculate statistics by bucket (e.g. number of sentences, BLEU score per bucket)
  • N-gram Difference Analysis: Calculate which n-grams one system is consistently translating better
  • Sentence Examples: Find sentences where one system is doing better than the other according to sentence BLEU

You can see an example of running this analysis (as well as the more advanced analysis below) either through a generated HTML report here, or in the following narrated video:

IMAGE ALT TEXT HERE

To summarize the results that immediately stick out from the basic analysis:

  • From the aggregate scores we can see that the BLEU of neural MT is higher, but its sentences are slightly shorter.
  • From the word accuracy analysis we can see that phrase-based MT is better at low-frequency words.
  • From the sentence bucket analysis we can see that neural seems to be better at translating shorter sentences.
  • From the n-gram difference analysis we can see that there are a few words that neural MT is not good at but phrase based MT gets right (e.g. "phantom"), while there are a few long phrases that neural MT does better with (e.g. "going to show you").

If you run on your own data, you might be able to find more interesting things about your own systems. Try comparing your modified system with your baseline and seeing what you find!

Other Options

There are many options that can be used to do different types of analysis. If you want to find all the different types of analysis supported, the most comprehensive way to do so is by taking a look at compare-mt, which is documented relatively well and should give examples. We do highlight a few particularly useful and common types of analysis below:

Significance Tests

The script allows you to perform statistical significance tests for scores based on bootstrap resampling. You can set the number of samples manually. Here is an example using the example data:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --compare_scores score_type=bleu,bootstrap=1000,prob_thresh=0.05

One important thing to note is that bootrap resampling as implemented in compare-mt only tests for variance due to data sampling, approximately answering the question ``if I ran the same system on a different, similarly sampled dataset, would I be likely to get the same result?''. It does not say anything about whether a system will perform better on another dataset in a different domain, and it does not control for training-time factors such as selection of the random seed, so it cannot say if another training run of the same model would yield the same result.

Using Training Set Frequency

One useful piece of analysis is the "word accuracy by frequency" analysis. By default this frequency is the frequency in the test set, but arguably it is more informative to know accuracy by frequency in the training set as this demonstrates the models' robustness to words they haven't seen much, or at all, in the training data. To change the corpus used to calculate word frequency and use the training set (or some other set), you can set the freq_corpus_file option to the appropriate corpus.

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng
        --compare_word_accuracies bucket_type=freq,freq_corpus_file=example/ted.train.eng

In addition, because training sets may be very big, you can also calculate the counts on the file beforehand,

python scripts/count.py < example/ted.train.eng > example/ted.train.counts

and then use these counts directly to improve efficiency.

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng
        --compare_word_accuracies bucket_type=freq,freq_count_file=example/ted.train.counts

Incorporating Word/Sentence Labels

If you're interested in performing aggregate analysis over labels for each word/sentence instead of the words/sentences themselves, it is possible to do so. As an example, we've included POS tags for each of the example outputs. You can use these in aggregate analysis, or n-gram-based analysis. The following gives an example:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng 
    --compare_word_accuracies bucket_type=label,ref_labels=example/ted.ref.eng.tag,out_labels="example/ted.sys1.eng.tag;example/ted.sys2.eng.tag",label_set=CC+DT+IN+JJ+NN+NNP+NNS+PRP+RB+TO+VB+VBP+VBZ 
    --compare_ngrams compare_type=match,ref_labels=example/ted.ref.eng.tag,out_labels="example/ted.sys1.eng.tag;example/ted.sys2.eng.tag"

This will calculate word accuracies and n-gram matches by POS bucket, and allows you to see things like the fact that the phrase-based MT system is better at translating content words such as nouns and verbs, while neural MT is doing better at translating function words.

We also give an example to perform aggregate analysis when multiple labels per word/sentence, where each group of labels is a string separated by '+'s, are allowed:

compare-mt example/multited.ref.jpn example/multited.sys1.jpn example/multited.sys2.jpn 
    --compare_word_accuracies bucket_type=multilabel,ref_labels=example/multited.ref.jpn.tag,out_labels="example/multited.sys1.jpn.tag;example/multited.sys2.jpn.tag",label_set=lexical+formality+pronouns+ellipsis

It also is possible to create labels that represent numberical values. For example, scripts/relativepositiontag.py calculates the relative position of words in the sentence, where 0 is the first word in the sentence, 0.5 is the word in the middle, and 1.0 is the word in the end. These numerical values can then be bucketed. Here is an example:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng 
    --compare_word_accuracies bucket_type=numlabel,ref_labels=example/ted.ref.eng.rptag,out_labels="example/ted.sys1.eng.rptag;example/ted.sys2.eng.rptag"

From this particular analysis we can discover that NMT does worse than PBMT at the end of the sentence, and of course other varieties of numerical labels could be used to measure different properties of words.

You can also perform analysis over labels for sentences. Here is an example:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng 
    --compare_sentence_buckets 'bucket_type=label,out_labels=example/ted.sys1.eng.senttag;example/ted.sys2.eng.senttag,label_set=0+10+20+30+40+50+60+70+80+90+100,statistic_type=score,score_measure=bleu'

Analyzing Source Words

If you have a source corpus that is aligned to the target, you can also analyze accuracies according to features of the source language words, which would allow you to examine whether, for example, infrequent words on the source side are hard to output properly. Here is an example using the example data:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --src_file example/ted.orig.slk --compare_src_word_accuracies ref_align_file=example/ted.ref.align

Analyzing Word Likelihoods

If you wish to analyze the word log likelihoods by two systems on the target corpus, you can use the following

compare-ll --ref example/ll_test.txt --ll-files example/ll_test.sys1.likelihood example/ll_test.sys2.likelihood --compare-word-likelihoods bucket_type=freq,freq_corpus_file=example/ll_test.txt

You can analyze the word log likelihoods over labels for each word instead of the words themselves:

compare-ll --ref example/ll_test.txt --ll-files example/ll_test.sys1.likelihood example/ll_test.sys2.likelihood --compare-word-likelihoods bucket_type=label,label_corpus=example/ll_test.tag,label_set=CC+DT+IN+JJ+NN+NNP+NNS+PRP+RB+TO+VB+VBP+VBZ

NOTE: You can also use the above to also analyze the word likelihoods produced by two language models.

Analyzing Other Language Generation Systems

You can also analyze other language generation systems using the script. Here is an example of comparing two text summarization systems.

compare-mt example/sum.ref.eng example/sum.sys1.eng example/sum.sys2.eng --compare_scores 'score_type=rouge1' 'score_type=rouge2' 'score_type=rougeL'

Evaluating on COMET

It is possible to use the COMET as a metric. To do so, you need to install it first by running

pip install unbabel-comet

To then run, pass the source and select the appropriate score type. Here is an example.

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --src_file example/ted.orig.slk \
  --compare_scores score_type=comet \
  --compare_sentence_buckets bucket_type=score,score_measure=sentcomet

Note that COMET runs on top of XLM-R, so it's highly recommended you use a GPU with it.

Citation/References

If you use compare-mt, we'd appreciate if you cite the paper about it!

@article{DBLP:journals/corr/abs-1903-07926,
  author    = {Graham Neubig and Zi{-}Yi Dou and Junjie Hu and Paul Michel and Danish Pruthi and Xinyi Wang and John Wieting},
  title     = {compare-mt: {A} Tool for Holistic Comparison of Language Generation Systems},
  journal   = {CoRR},
  volume    = {abs/1903.07926},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.07926},
}

There is an extensive literature review included in the paper above, but some key papers that it borrows ideas from are below:

There is also other good software for automatic comparison or error analysis of MT systems:

  • MT-ComparEval: Very nice for visualization of individual examples, but not as focused on aggregate analysis as compare-mt. Also has more software dependencies and requires using a web browser, while compare-mt can be used as a command-line tool.

compare-mt's People

Contributors

bricksdont avatar coderpat avatar danishpruthi avatar douglas01996 avatar gdemelo avatar kayoyin avatar lyuyangh avatar madaan avatar neubig avatar pmichel31415 avatar proyag avatar sebastinsanty avatar tbabm avatar thammegowda avatar zdou0830 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compare-mt's Issues

Implementation of Word Error Rate

Implementation of Word Error Rate would be nice to have for applications like speech. Here is an implementation that I have, it'd need to be copy-pasted into a scorer.

def edits(src, trg, sub_pen=1.0, ins_pen=1.0, del_pen=1.0):
  # Types:
  # 0 is equal
  # 1 is sub
  # 2 is delete
  # 3 is insert
  sp1 = len(src)+1
  tp1 = len(trg)+1
  scores = np.zeros((sp1, tp1))
  actions = np.zeros((sp1, tp1), np.int32)
  equals = np.expand_dims(np.array(src), axis=1) == np.array(trg)
  scores[:,0] = range(sp1)
  actions[1:,0] = 2
  scores[0,:] = range(tp1)
  actions[0,1:] = 3
  # Forward edit distance
  for i in range(0, len(src)):
    for j in range(0, len(trg)):
      my_action = 0 if equals[i,j] else 1
      my_score = scores[i,j] + my_action * sub_pen
      del_score = scores[i,j+1] + del_pen 
      if del_score < my_score:
        my_action = 2
        my_score = del_score
      ins_score = scores[i+1,j] + ins_pen 
      if ins_score < my_score:
        my_action = 3
        my_score = ins_score
      actions[i+1,j+1] = my_action
      scores[i+1,j+1] = my_score
  # Backward edits
  i, j = len(src), len(trg)
  best_actions = []
  while i+j != 0:
    my_action = actions[i,j]
    best_actions.append(my_action)
    if my_action != 2: j -= 1
    if my_action != 3: i -= 1
  best_actions.reverse() 
  return scores[-1,-1], best_actions

pip install does not auto-install requirements

For example:

$ pip install compare_mt
Collecting compare_mt
  Downloading https://files.pythonhosted.org/packages/e9/4c/e18ea230d656e273a1dd8a0fcea74154ebb343ea471dce7e95a6cd74ed7d/compare_mt-0.2.8.tar.gz (43kB)
     |████████████████████████████████| 51kB 2.2MB/s 
    ERROR: Command errored out with exit status 1:
     command: /home/gneubig/anaconda3/envs/python3/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-sieojamx/compare-mt/setup.py'"'"'; __file__='"'"'/tmp/pip-install-sieojamx/compare-mt/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-sieojamx/compare-mt/pip-egg-info
         cwd: /tmp/pip-install-sieojamx/compare-mt/
    Complete output (9 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-sieojamx/compare-mt/setup.py", line 4, in <module>
        import compare_mt
      File "/tmp/pip-install-sieojamx/compare-mt/compare_mt/__init__.py", line 5, in <module>
        import compare_mt.scorers
      File "/tmp/pip-install-sieojamx/compare-mt/compare_mt/scorers.py", line 1, in <module>
        import nltk
    ModuleNotFoundError: No module named 'nltk'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Upload to PyPI

It would be convenient for users to be able to pip install compare-mt. I'm willing to take care of this but I'd rather wait until things settle down a little bit (especially if we are going to change the name).

In any case, I'll leave this issue up as a reminder

Ability to specify more than 2 systems to analyze

Currently compare_mt.py only works with exactly two systems. It'd be nice if it was possible to specify more than two systems.

For some things this is relatively easy. For example, aggregate score analysis, word accuracy analysis, sentence bucket analysis are all calculated with respect to single systems, so this should be easy.

For other things, like n-gram analysis, and sentence difference analysis, this is harder because the analysis is pairwise. In this case, we might just have to specify which systems to compare.

Analysis of reordering errors

It might be nice to have some idea of how many reordering errors are occurring, it's not clear exactly how to do this, but we can discuss.

Word frequency analysis doesn't tell where frequencies came from

When using word frequency bucketing, the reporting doesn't output where the frequencies came from, so there is no indication of the difference between different types of word frequency analysis (e.g. train set vs. test set). It would be nice to print out where the frequencies are from.

Ability to name analyses

Currently the "caption" above tables or figures is based on the parameters input into the model. Unfortunately, not all of the parameters are included in these captions (e.g. freq_corpus_file is not included in the word_accuracies caption), so sometimes it's hard to tell the difference between them. I don't think it's necessarily ideal to put all of the parameters in the caption though, because the parameters can be quite verbose.

One solution to this is to allow "naming" of each type of analysis, where the user can add a name or caption parameter that will then be used as the caption.

Ability to click through to individual examples

Pending #6, it might be nice to have the ability to take the aggregate reports, then click on a link that shows individual examples of the outputs of the two systems, maybe with the salient parts highlighted. For example, in the n-gram analysis, it would be nice to be able to click through to places where one system got the n-gram right but the other did not.

Make number format consistent and configurable

Currently the formatting of numbers is not consistent, and also cannot be changed. It would be nice to adjust the following:

  • Most places use four digits, but some use 2, and others use default string formatting of Python. It would be nice to be able to specify the number of digits.
  • We can't adjust whether we want to have numbers be multiplied by 100 or not (sometimes this is good for BLEU, as it makes numbers easier to read and tables more concise).
  • Integer values should not have decimal places.

This can be worked on once #49 is merged.

Analysis over word likelihoods

One thing that MT systems can do is calculate word-by-word likelihoods over the development/test data. It might be nice to be able to do analysis over these likelihoods as well.

Undefined name 'sys_names' in compare_mt_main.py

flake8 testing of https://github.com/neulab/compare-mt on Python 3.7.1

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./compare_mt/compare_mt_main.py:391:60: F821 undefined name 'sys_names'
    raise ValueError(f'len(sys_names) != len(outs) -- {len(sys_names)} != {len(outs)}')
                                                           ^
1     F821 undefined name 'sys_names'
1

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

  • F821: undefined name name
  • F822: undefined name name in __all__
  • F823: local variable name referenced before assignment
  • E901: SyntaxError or IndentationError
  • E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

Bug in sentence level BLEU comparison

Description

The report

N sentences where Sys A > Sys B at sentence-level BLEU

will generate wrong output if:

a) Sys A never generates sentences that have a higher BLEU score

b) There are less than N sentences in the set of sentences to be analyzed

Screenshots

Screen Shot 2019-09-18 at 1 21 40 AM

Files

print(f'--- {report_length} sentences where {sright}>{sleft} at {self.scorer.name()}')

To Reproduce

Use the SysA, SysB and Ref outputs located at https://gist.github.com/madaan/2cec36a7b18dfeea3904ddfff1e19312 and run compare-mt with the default options.

Tasks

I can take a stab at it if you guys think this should be fixed.

Thanks!

Encoding issues in non-ASCII output?

It looks like maybe we have an encoding issue when printing out non-ASCII characters in output. Because I was mainly looking at English I didn't realize, but it's obvious when printing out the Slovak source sentences included in the example. See attached figure.
screen shot 2019-02-08 at 4 31 05 pm

Case-insensitive option

Do we need to add the case-insensitive options when computing scores and word accuracies?

bootstrap sample size

Hello,

I was checking your notes (http://www.phontron.com/class/mtandseq2seq2018/assets/slides/mt-fall2018.chapter11.pdf) and saw the following, which seems to be applied in this codebase as well:

In Line 4, we sample a subset of the test data, where in practice we usually use exactly half of the sentences in the test data.

If I understand correctly, if we have n sentences in the test set, this means that every bootstrap resample has only 0.5 * n sentences in it. What is the intuition of using half of the sentences here?

Thanks

Formatting table results

Do people think it's helpful to format the CLI outputs using some external packages like tabulate? Please see an example below.

Also, tabulate can take an argument to generate html outputs, which may also be helpful when considering dumping results in html.

Before:

x < -20	0	0
-20 <= x < -10	8	12
-10 <= x < -5	66	67
x = -5	41	44
x = -4	73	58
x = -3	115	97
x = -2	173	187
x = -1	263	256
x = 0	394	423
x = 1	330	344
x = 2	296	275
x = 3	219	194
x = 4	127	153
x = 5	107	93
6 <= x < 11	194	182
11 <= x < 21	39	51
21 < x	0	9

After:

--------------  ---  ---
x < -20           0    0
-20 <= x < -10    8   12
-10 <= x < -5    66   67
x = -5           41   44
x = -4           73   58
x = -3          115   97
x = -2          173  187
x = -1          263  256
x = 0           394  423
x = 1           330  344
x = 2           296  275
x = 3           219  194
x = 4           127  153
x = 5           107   93
6 <= x < 11     194  182
11 <= x < 21     39   51
21 < x            0    9
--------------  ---  ---

ps: If so, adding requirements.txt will be in higher demand. (The package already requires nltk, which has not been stated in README.)

Ability to set custom bucket_cutoffs

I am not sure if there is an explicit way of setting for custom bucket_cutoffs. Right now the optional arguments are taken as a string, which is further split by a comma, so if I wish to specify a list of cutoffs, it will also be split by comma, and wouldn't work.

stand-alone command line scorer

compare-mt implements many scorers, it might be nice to be able to use it to just implement a stand-alone scorer for text files that gives you access to any of the scorers provided by compare-mt.

Refactoring

Currently the code is not very modular, it would be nice to refactor it to be cleaner.

Errors resulting from command line formatting can be opaque

Errors in formatting of the command line arguments can be hard to understand. For example, if you accidentally put a comma instead of a semicolon between two filenames, you can get something like this:

Traceback (most recent call last):
  File "/Users/neubig/anaconda3/envs/python3/bin/compare-mt", line 11, in <module>
    load_entry_point('compare-mt', 'console_scripts', 'compare-mt')()
  File "/Users/neubig/work/compare-mt/compare_mt/compare_mt_main.py", line 411, in main
    reports.append( (name, [func(ref, outs, **arg_utils.parse_profile(x)) for x in arg]) )
  File "/Users/neubig/work/compare-mt/compare_mt/compare_mt_main.py", line 411, in <listcomp>
    reports.append( (name, [func(ref, outs, **arg_utils.parse_profile(x)) for x in arg]) )
  File "/Users/neubig/work/compare-mt/compare_mt/arg_utils.py", line 4, in parse_profile
    k, v = kv.split('=')
ValueError: not enough values to unpack (expected 2, got 1)

It would be nice to have these be more clear.

Ability to print LaTeX tables

Pending the refactoring in #18, it would also be nice to be able to write out LaTeX tables to be automatically copy-pasted into papers.

Bug in plotting

Cf #56

Running

python compare_mt/compare_mt_main.py example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --output_directory output

raises the following exception

Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.7-dev/bin/compare-mt", line 11, in <module>
    load_entry_point('compare-mt==0.1', 'console_scripts', 'compare-mt')()
  File "/home/travis/virtualenv/python3.7-dev/lib/python3.7/site-packages/compare_mt-0.1-py3.7.egg/compare_mt/compare_mt_main.py", line 386, in main
    reporters.generate_html_report(reports, args.output_directory)
  File "/home/travis/virtualenv/python3.7-dev/lib/python3.7/site-packages/compare_mt-0.1-py3.7.egg/compare_mt/reporters.py", line 459, in generate_html_report
    content.append(r.html_content(output_directory))
  File "/home/travis/virtualenv/python3.7-dev/lib/python3.7/site-packages/compare_mt-0.1-py3.7.egg/compare_mt/reporters.py", line 205, in html_content
    self.plot(output_directory, self.output_fig_file, ext)
  File "/home/travis/virtualenv/python3.7-dev/lib/python3.7/site-packages/compare_mt-0.1-py3.7.egg/compare_mt/reporters.py", line 196, in plot
    errs=sys_errs, title='Score Comparison', ylabel=self.scorer.name(),
UnboundLocalError: local variable 'sys_errs' referenced before assignment

No longer possible to run `python compare_mt.py`

Probably due to #40 it's no longer possible to run python compare_mt.py (or python compare_mt/compare_mt.py). I get the following error when I try to do so:

Traceback (most recent call last):
  File "/Users/neubig/work/compare-mt/compare_mt/compare_mt.py", line 6, in <module>
    from . import ngram_utils
ImportError: cannot import name 'ngram_utils'

This is a bit annoying, because it's hard to debug. Is there a way around this? I know in xnmt (https://github.com/neulab/xnmt) we used the python setup.py develop command, which allowed us to run things in place while still supporting installing and setup.py.

Faster significance tests

Currently significance tests are pretty slow. This is because currently, if we set bootstrap=1000, the program will do the entire process of calculating the score (e.g. BLEU) 1000 times.

There is a faster way to implement bootstrap resampling. It is actually possible to calculate the sufficient statistics to calculate BLEU (e.g. the number of n-gram matches, etc.) just once per sentence and cache them. This would be much faster, but would require that we implement BLEU from scratch. I don't think that this is too hard though.

Some measure of bucketed statistic reliability

When doing bucketing of either words or sentences, some of the buckets may have very few examples included, and thus the estimates of statistics may not be very reliable. It would be nice to have some idea of the reliability of these, either through:

  1. Outputting the number of counts of words/sentences in each bucket
  2. Error bars on the estimates calculated through regular standard deviation, or bootstrapping

Bootstrap resampling without replacement

According to my understanding (and the original paper: https://www.aclweb.org/anthology/W04-3250), bootstrap resampling should be performed with replacement.

However If I understand the current code we are sampling without replacement. This is probably not going to change actual results that much but we should still fix it for the sake of consistency (for instance one instance where sampling without replacement goes wring is that the more you increase the number of samples the less accurate the confidence intervals become).

EDIT: I think those are the lines where we are sampling: https://github.com/neulab/compare-mt/blob/master/compare_mt/sign_utils.py#L47-L48

Instead we should do

reduced_ids = np.random.choice(ids, size=int(len(ids)*sample_ratio), replace=True)

Ability to print graphical (HTML?) reports

Currently compare_mt.py is purely a command line tool. However, it would be nice if it was possible to directly print out HTML reports with graphics that could be used directly in papers.

In order to do this, I think it would be best to change report generation into two steps:

  1. A step of collecting all the statistics necessary in some data structure
  2. A step of printing out the actual report based on these statistics

If the logic is separated appropriately, then only 2. would have to vary based on the type of report being generated.

Consider model variance in bootstrap resampling test

This issue brings up a problem about using so-called "bootstrap resampling test" for evaluating "statistical significance" of machine translation (especially neural MT) methods, and similar generation tasks that are evaluated by MT metrics.

In this criterion, the evaluator will choose several number of generated sentences randomly to simulate the distribution of model outputs, but the evaluator does not consider the variance of the trained model itself.

Consider that we have a baseline model to beat, and our champion model of the proposed methods. The champion model will be produced regardless of recognizing it by authors, e.g., if the model was trained only a few times, any people may not be able to judge if the model is the outlier on the model distribution or not.

In this situation, the "bootstrap resampling test" may judge the model of the proposed method is significantly better, but the evaluation was actually employed for only one model variant which may be a champion, and did not consider any distributional properties of the proposed model.

The "bootstrap resampling test" was introduced on the era of statistical MT, and I guessed the method historically produced reasonable judgements for SMT systems because their study was usually investigating some additions of almost-fixed systems such as Moses (note that I said "almost-fixed" here because they also had random tuning for hyperparameters). In neural MT systems, this assumption had gone because the systems were randomly trained from scratch, and the "bootstrap resampling test" may no longer produce meaningful results but rather give the model a wrong authority.

I was observing continuously that the "bootstrap resampling test" was still utilized in many papers to give "statistical significance" of the model, and strongly worried about misleading this line of research.

Multiple references

Sometimes we may have multiple references. Would it be nice to support analysis over multiple references?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.