GithubHelp home page GithubHelp logo

maluuba / nlg-eval Goto Github PK

View Code? Open in Web Editor NEW
1.3K 29.0 216.0 94.46 MB

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

Home Page: http://arxiv.org/abs/1706.09799

License: Other

Python 100.00%
natural-language-generation natural-language-processing nlg nlp evaluation bleu bleu-score meteor cider rouge

nlg-eval's Introduction

Build Status

nlg-eval

Evaluation code for various unsupervised automated metrics for NLG (Natural Language Generation). It takes as input a hypothesis file, and one or more references files and outputs values of metrics. Rows across these files should correspond to the same example.

Metrics

  • BLEU
  • METEOR
  • ROUGE
  • CIDEr
  • SPICE
  • SkipThought cosine similarity
  • Embedding Average cosine similarity
  • Vector Extrema cosine similarity
  • Greedy Matching score

Setup

Install Java 1.8.0 (or higher).

Install the Python dependencies, run:

pip install git+https://github.com/Maluuba/nlg-eval.git@master

If you are using macOS High Sierra or higher, then run this to allow multithreading:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Simple setup (download required data (e.g. models, embeddings) and external code files), run:

nlg-eval --setup

If you're setting this up from the source code or you're on Windows and not using a Bash terminal, then you might get errors about nlg-eval not being found. You will need to find the nlg-eval script. See here for details.

Custom Setup

# If you don't like the default path (~/.cache/nlgeval) for the downloaded data,
# then specify a path where you want the files to be downloaded.
# The value for the data path is stored in ~/.config/nlgeval/rc.json and can be overwritten by
# setting the NLGEVAL_DATA environment variable.
nlg-eval --setup ${data_path}

Validate the Setup (Optional)

(These examples were made with Git Bash on Windows)

All of the data files should have been downloaded, you should see sizes like:

$ ls -l ~/.cache/nlgeval/
total 6003048
-rw-r--r-- 1 ...  289340074 Sep 12  2018 bi_skip.npz
-rw-r--r-- 1 ...        689 Sep 12  2018 bi_skip.npz.pkl
-rw-r--r-- 1 ... 2342138474 Sep 12  2018 btable.npy
-rw-r--r-- 1 ...    7996547 Sep 12  2018 dictionary.txt
-rw-r--r-- 1 ...   21494787 Jan 22  2019 glove.6B.300d.model.bin
-rw-r--r-- 1 ...  480000128 Jan 22  2019 glove.6B.300d.model.bin.vectors.npy
-rw-r--r-- 1 ...  663989216 Sep 12  2018 uni_skip.npz
-rw-r--r-- 1 ...        693 Sep 12  2018 uni_skip.npz.pkl
-rw-r--r-- 1 ... 2342138474 Sep 12  2018 utable.npy

You can also verify some checksums:

$ cd ~/.cache/nlgeval/
$ md5sum *
9a15429d694a0e035f9ee1efcb1406f3 *bi_skip.npz
c9b86840e1dedb05837735d8bf94cee2 *bi_skip.npz.pkl
022b5b15f53a84c785e3153a2c383df6 *btable.npy
26d8a3e6458500013723b380a4b4b55e *dictionary.txt
f561ab0b379e23cbf827a054f0e7c28e *glove.6B.300d.model.bin
be5553e91156471fe35a46f7dcdfc44e *glove.6B.300d.model.bin.vectors.npy
8eb7c6948001740c3111d71a2fa446c1 *uni_skip.npz
e1a0ead377877ff3ea5388bb11cfe8d7 *uni_skip.npz.pkl
5871cc62fc01b79788c79c219b175617 *utable.npy
$ sha256sum *
8ab7965d2db5d146a907956d103badfa723b57e0acffb75e10198ba9f124edb0 *bi_skip.npz
d7e81430fcdcbc60b36b92b3f879200919c75d3015505ee76ae3b206634a0eb6 *bi_skip.npz.pkl
4a4ed9d7560bb87f91f241739a8f80d8f2ba787a871da96e1119e913ccd61c53 *btable.npy
4dc5622978a30cddea8c975c871ea8b6382423efb107d27248ed7b6cfa490c7c *dictionary.txt
10c731626e1874effc4b1a08d156482aa602f7f2ca971ae2a2f2cd5d70998397 *glove.6B.300d.model.bin
20dfb1f44719e2d934bfee5d39a6ffb4f248bae2a00a0d59f953ab7d0a39c879 *glove.6B.300d.model.bin.vectors.npy
7f40ff16ff5c54ce9b02bd1a3eb24db3e6adaf7712a7a714f160af3a158899c8 *uni_skip.npz
d58740d46cba28417cbc026af577f530c603d81ac9de43ffd098f207c7dc4411 *uni_skip.npz.pkl
790951d4b08e843e3bca0563570f4134ffd17b6bd4ab8d237d2e5ae15e4febb3 *utable.npy

If you're ensure that the setup was successful, you can run the tests:

pip install pytest
pytest

It might take a few minutes and you might see warnings but they should pass.

Usage

Once setup has completed, the metrics can be evaluated with a Python API or in the command line.

Examples of the Python API can be found in test_nlgeval.py.

Standalone

nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt

where each line in the hypothesis file is a generated sentence and the corresponding lines across the reference files are ground truth reference sentences for the corresponding hypothesis.

functional API: for the entire corpus

from nlgeval import compute_metrics
metrics_dict = compute_metrics(hypothesis='examples/hyp.txt',
                               references=['examples/ref1.txt', 'examples/ref2.txt'])

functional API: for only one sentence

from nlgeval import compute_individual_metrics
metrics_dict = compute_individual_metrics(references, hypothesis)

where references is a list of ground truth reference text strings and hypothesis is the hypothesis text string.

object oriented API for repeated calls in a script - single example

from nlgeval import NLGEval
nlgeval = NLGEval()  # loads the models
metrics_dict = nlgeval.compute_individual_metrics(references, hypothesis)

where references is a list of ground truth reference text strings and hypothesis is the hypothesis text string.

object oriented API for repeated calls in a script - multiple examples

from nlgeval import NLGEval
nlgeval = NLGEval()  # loads the models
metrics_dict = nlgeval.compute_metrics(references, hypothesis)

where references is a list of lists of ground truth reference text strings and hypothesis is a list of hypothesis text strings. Each inner list in references is one set of references for the hypothesis (a list of single reference strings for each sentence in hypothesis in the same order).

Reference

If you use this code as part of any published research, please cite the following paper:

Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. "Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation" arXiv preprint arXiv:1706.09799 (2017)

@article{sharma2017nlgeval,
    author  = {Sharma, Shikhar and El Asri, Layla and Schulz, Hannes and Zumer, Jeremie},
    title   = {Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation},
    journal = {CoRR},
    volume  = {abs/1706.09799},
    year    = {2017},
    url     = {http://arxiv.org/abs/1706.09799}
}

Example

Running

nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt

gives

Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
METEOR: 0.295797
ROUGE_L: 0.522104
CIDEr: 1.242192
SPICE: 0.312331
SkipThoughtsCosineSimilarity: 0.626149
EmbeddingAverageCosineSimilarity: 0.884690
VectorExtremaCosineSimilarity: 0.568696
GreedyMatchingScore: 0.784205

Troubleshooting

If you have issues with Meteor then you can try lowering the mem variable in meteor.py

Important Note

CIDEr by default (with idf parameter set to "corpus" mode) computes IDF values using the reference sentences provided. Thus, CIDEr score for a reference dataset with only 1 image (or example for NLG) will be zero. When evaluating using one (or few) images, set idf to "coco-val-df" instead, which uses IDF from the MSCOCO Vaildation Dataset for reliable results. This has not been adapted in this code. For this use-case, apply patches from vrama91/coco-caption.

External data directory

To mount an already prepared data directory to a Docker container or share it between users, you can set the NLGEVAL_DATA environment variable to let nlg-eval know where to find its models and data. E.g.

NLGEVAL_DATA=~/workspace/nlg-eval/nlgeval/data

This variable overrides the value provided during setup (stored in ~/.config/nlgeval/rc.json)

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

License

See LICENSE.md.

nlg-eval's People

Contributors

dependabot[bot] avatar getim avatar juharris avatar koustuvsinha avatar kracwarlock avatar ringsaturn avatar temporaer avatar tgisaturday avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlg-eval's Issues

IOError: [Errno 32] Broken pipe

Traceback (most recent call last):
  File "test_results.py", line 41, in <module>
    metrics_dict = nlgeval.compute_metrics(references, hypotheses)
  File "/home/shaunak/ProjectSSSP/FULL S/caption_generator_resnet/nlg_eval_master/nlgeval/__init__.py", line 292, in compute_metrics
    score, scores = scorer.compute_score(refs, hyps)
  File "/home/shaunak/ProjectSSSP/FULL S/caption_generator_resnet/nlg_eval_master/nlgeval/pycocoevalcap/meteor/meteor.py", line 62, in compute_score
    stat = self._stat(res[i][0], gts[i])
  File "/home/shaunak/ProjectSSSP/FULL S/caption_generator_resnet/nlg_eval_master/nlgeval/pycocoevalcap/meteor/meteor.py", line 84, in _stat
    self.meteor_p.stdin.write(score_line)
IOError: [Errno 32] Broken pipe
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/shaunak/anaconda2/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/shaunak/ProjectSSSP/FULL S/caption_generator_resnet/nlg_eval_master/nlgeval/pycocoevalcap/meteor/meteor.py", line 50, in close
    if atexit is not None and atexit.unregister is not None:
AttributeError: 'module' object has no attribute 'unregister'
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/home/shaunak/anaconda2/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/shaunak/ProjectSSSP/FULL S/caption_generator_resnet/nlg_eval_master/nlgeval/pycocoevalcap/meteor/meteor.py", line 50, in close
    if atexit is not None and atexit.unregister is not None:
AttributeError: 'module' object has no attribute 'unregister'

In 'self.meteor_p.stdin.write(score_line)' ioerror32: Broken Pipe appears in meteor.py

_pickle.UnpicklingError: pickle data was truncated

Hi, I've some problems when running nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt:

First, it seems normal and output:

Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
METEOR: 0.295797
ROUGE_L: 0.522104
CIDEr: 1.242192

And then I get a warning:

WARNING (theano.tensor.blas): We did not find a dynamic library in the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library.

Finally, I get an abnormal program termination with '_pickle.UnpicklingError':

Traceback (most recent call last):
  File "/home/xjx/anaconda2/envs/anaconda-py3/bin/nlg-eval", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/xjx/nlg-eval/bin/nlg-eval", line 169, in <module>
    compute_metrics()
  File "/home/xjx/anaconda2/envs/anaconda-py3/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/xjx/anaconda2/envs/anaconda-py3/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/xjx/anaconda2/envs/anaconda-py3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xjx/anaconda2/envs/anaconda-py3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/xjx/nlg-eval/bin/nlg-eval", line 165, in compute_metrics
    nlgeval.compute_metrics(hypothesis, references, no_overlap, no_skipthoughts, no_glove)
  File "/home/xjx/nlg-eval/nlgeval/__init__.py", line 55, in compute_metrics
    model = skipthoughts.load_model()
  File "/home/xjx/nlg-eval/nlgeval/skipthoughts/skipthoughts.py", line 59, in load_model
    utable, btable = load_tables()
  File "/home/xjx/nlg-eval/nlgeval/skipthoughts/skipthoughts.py", line 79, in load_tables
    utable = numpy.load(os.path.join(path_to_tables, 'utable.npy'), encoding='bytes')
  File "/home/xjx/anaconda2/envs/anaconda-py3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 440, in load
    pickle_kwargs=pickle_kwargs)
  File "/home/xjx/anaconda2/envs/anaconda-py3/lib/python3.6/site-packages/numpy/lib/format.py", line 693, in read_array
    array = pickle.load(fp, **pickle_kwargs)
_pickle.UnpicklingError: pickle data was truncated

It would be helpful if someone know what should I do next.

Thanks and best regards!

bug at setup.sh

hey
i got an error:

"PYTHONPATH=pwd python nlgeval/word2vec/generate_w2v_files.py
'PYTHONPATH' is not recognized as an internal or external command,
operable program or batch file."

at the line:
PYTHONPATH=pwd python2.7 nlgeval/word2vec/generate_w2v_files.py

i tryed instead:
python nlgeval/word2vec/generate_w2v_files.py

and got a different error:
Traceback (most recent call last):
File "nlgeval/word2vec/generate_w2v_files.py", line 3, in
from gensim.models import Word2Vec
File "C:\Python27\lib\site-packages\gensim_init_.py", line 5, in
from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils # noqa:F401
File "C:\Python27\lib\site-packages\gensim\parsing_init_.py", line 4, in
from .preprocessing import (remove_stopwords, strip_punctuation, strip_punctuation2, # noqa:F401
File "C:\Python27\lib\site-packages\gensim\parsing\preprocessing.py", line 40, in
from gensim import utils
File "C:\Python27\lib\site-packages\gensim\utils.py", line 37, in
import numpy as np
File "C:\Python27\lib\site-packages\numpy_init_.py", line 142, in
from . import add_newdocs
File "C:\Python27\lib\site-packages\numpy\add_newdocs.py", line 13, in
from numpy.lib import add_newdoc
File "C:\Python27\lib\site-packages\numpy\lib_init_.py", line 8, in
from .type_check import *
File "C:\Python27\lib\site-packages\numpy\lib\type_check.py", line 11, in
import numpy.core.numeric as nx
File "C:\Python27\lib\site-packages\numpy\core_init
.py", line 26, in
raise ImportError(msg)
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try git clean -xdf (removes all
files not under version control). Otherwise reinstall numpy.

Original error was: DLL load failed: %1 is not a valid Win32 application.

Infersent

Hi,

I am trying to obtain the semantic similarity between the generated and the ground truth sentence.

I used all these metrics to evaluate the generated sentences (validation dataset):
BLEU 1 | 0.128031
BLEU 2 | 0.056153
BLEU 3 | 0.029837
BLEU 4 | 0.013649
METEOR | 0.305482
ROUGE_L | 0.148652
CIDEr | 0.069519
SkipThought cosine similarity | 0.765784
Embedding Average cosine similarity | 0.973187
Vector Extrema cosine similarity | 0.683888
Greedy Matching score | 0.94496

Some of these metrics indicates the sentences to be quite similar and some shows sentences to be different. Can you please suggest a metric to obtain the semantic similarity between sentences.

How about the Infersent and word mover's distance? I think you should consider adding these metrics for evaluation of text generation. This repository is helpful for evaluation of generated text.

bug(encoding): TypeError: __init__() got an unexpected keyword argument 'encoding'

I tried to run the basic functional API for the entire corpus, and have been getting the TypeError for the Meteor metric calculation. I am using Anaconda on Win10 and have all the pre-requisites installed(java 1.8.0 etc). I am running this on Python 3.5.4.

>>> from nlgeval import compute_metrics
>>> metrics_dict = compute_metrics(hypothesis="data/quora/train_target_1.txt",references=["data/quora/train_source_1.txt"],no_skipthoughts=True,no_glove=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\users\derp\desktop\nlgeval\nlgeval\__init__.py", line 29, in compute_metrics
    (Meteor(), "METEOR"),
  File "c:\users\derp\desktop\nlgeval\nlgeval\pycocoevalcap\meteor\meteor.py", line 29, in __init__
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

I tried a few approaches to bypass this issue, like commenting out the kwargs in init function of c:\users\derp\desktop\nlgeval\nlgeval\pycocoevalcap\meteor\meteor.py and explicitly encoding the strings that were being passed to self.meteor_p.stdin.write in _stat function.

def __init__(self):
        meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR,
                      '-', '-', '-stdio', '-l', 'en', '-norm']
#        kwargs = dict()
#        if not six.PY2:
#            kwargs['encoding'] = 'utf-8'
        self.meteor_p = subprocess.Popen(meteor_cmd,
                                         cwd=os.path.dirname(os.path.abspath(__file__)),
                                         stdin=subprocess.PIPE,
                                         stdout=subprocess.PIPE,
                                         stderr=subprocess.PIPE)
#                                         ,**kwargs)
        # Used to guarantee thread safety
        self.lock = threading.Lock()
def _stat(self, hypothesis_str, reference_list):
        # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words
        hypothesis_str = hypothesis_str.replace('|||', '').replace('  ', ' ')
        score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str))
        self.meteor_p.stdin.write(score_line.encode(encoding='utf-8'))
        self.meteor_p.stdin.write('\n'.encode(encoding='utf-8'))
        self.meteor_p.stdin.flush()
        return self.meteor_p.stdout.readline().strip()

But then an different error is coming this time:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\users\derp\desktop\nlgeval\nlgeval\__init__.py", line 34, in compute_metrics
    score, scores = scorer.compute_score(refs, hyps)
  File "c:\users\derp\desktop\nlgeval\nlgeval\pycocoevalcap\meteor\meteor.py", line 42, in compute_score
    stat = self._stat(res[i][0], gts[i])
  File "c:\users\derp\desktop\nlgeval\nlgeval\pycocoevalcap\meteor\meteor.py", line 63, in _stat
    self.meteor_p.stdin.flush()
OSError: [Errno 22] Invalid argument

Any insights to these problems ?

'word2vec' object has no attribute min_alpha_yet_reached

hi , i have encountered this quention when i call:
metrics_dict = compute_metrics(hypothesis=reply, references=[answer])
The first 8 are no problem. This problem occurred when calculating EmbeddingAverageCosineSimilarity.

無法執行

我直接下載github的整個code,
想在win10用Anaconda3的Powershell想嘗試NLG,

以下是我run的過程:
conda create -n nlgeval python=3.7
conda activate nlgeval啟用。

pip install .
把nlg-eval-master內的setup.py執行安裝,
然後我再cd到bin資料夾輸入
python bin\nlg-eval --setup

接下來我要test
nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt
卻跳出「'nlg-eval' 程式無法執行: 沒有任何應用程式與此操作的指定檔案有關聯。」
請問我在什麼地方沒有做成功呢?

1015

Tokenize the sentences

I got the sense from the readme file that the references and hypothesis should all be strings.

But, shouldn't they be tokenized before used as input to the overlap-based scorers, as what they did in coco-caption/pycocoevalcap/eval.py? Or has tokenization been done somewhere that I didn't find?

compute_individual_metrics docs don't match function

Your README says:


Within a script: for only one sentence

from nlgeval import compute_individual_metrics
metrics_dict = compute_individual_metrics(references, hypothesis)

where references is a list of ground truth reference text strings and
hypothesis is the hypothesis text string.


However, references can't be a list, as you have this assertion:

assert isinstance(ref, str)

call meteor

Hi, I want to calculate meteor.

windows 10
java 1.8.0_17
python 3.6.4

code like this:

from __future__ import unicode_literals
import os
import nlgeval

root_dir = os.path.join(os.path.dirname(__file__))
hypothesis = os.path.join(root_dir, 'examples/hyp.txt')
references = os.path.join(root_dir, 'examples/ref1.txt'), os.path.join(root_dir, 'examples/ref2.txt')
scores = nlgeval.compute_metrics(hypothesis, references)

File "C:\Users\214A\Desktop\nlg-eval-master\nlgeval\pycocoevalcap\meteor\meteor.py", line 68, in compute_score
scores.append(float(dec(self.meteor_p.stdout.readline().strip())))
ValueError: could not convert string to float:

Thanks!

Incorrect output when setting the metrics_to_omit parameter

Hi,
it seems that there is a bug in the load_scorers method of the NLGEval class. For example, when running the following code,

from nlgeval import NLGEval

nlgeval = NLGEval(no_skipthoughts=True, no_glove=True,  metrics_to_omit=['Bleu_1', 'Bleu_2', 'Bleu_3'])  # loads the models
 metrics_dict = nlgeval.compute_metrics([references], hypothesis)
 print(metrics_dict)

it gives the wrong results (Bleu_4 isn't printed):

{'METEOR': 0.2191196041010623, 'ROUGE_L': 0.46546672221759094, 'CIDEr': 3.10829766113145}

So, is this a real bug or did I miss something?

maybe something wrong with nlg-eval --setup

I run the command nlg-eval --setup and it turns out like this:

Traceback (most recent call last):
  File "/home/xxx/anaconda3/bin/nlg-eval", line 214, in <module>
    setup()
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/xxx/anaconda3/bin/nlg-eval", line 157, in setup
    with ZipFile(os.path.join(data_path, 'glove.6B.zip')) as z:
  File "/home/xxx/anaconda3/lib/python3.7/zipfile.py", line 1222, in __init__
    self._RealGetContents()
  File "/home/xxx/anaconda3/lib/python3.7/zipfile.py", line 1289, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

I check the file glove.6B.zip in .cache, finding it is damaged. After replacing the file with the one downloaded from official website, the command works well. So maybe there is something wrong with the download part. I am not sure.

None-consistent results between corpus and single

The REAMDE, and manual running of the binary file results in:

Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
METEOR: 0.295797
ROUGE_L: 0.522104
CIDEr: 1.242192
SkipThoughtsCosineSimilairty: 0.626149
EmbeddingAverageCosineSimilairty: 0.884690
VectorExtremaCosineSimilarity: 0.568696
GreedyMatchingScore: 0.784205

However, running each sentence using the new model preload method gives (for each of the 2 sentences):

{'CIDEr': 0.0, 'GreedyMatchingScore': 0.697944, 'Bleu_4': 4.939382736523921e-09, 'Bleu_3': 1.609148974162434e-06, 'Bleu_2': 0.18257418581578377, 'Bleu_1': 0.2999999999700001, 'ROUGE_L': 0.36454183266932266, 'METEOR': 0.1556568826170604, 'EmbeddingAverageCosineSimilairty': 0.836663, 'VectorExtremaCosineSimilarity': 0.427065, 'SkipThoughtCS': 0.3743917}

{'CIDEr': 0.0, 'GreedyMatchingScore': 0.870466, 'Bleu_4': 0.35494810555850326, 'Bleu_3': 0.4807498567152745, 'Bleu_2': 0.6666666665962965, 'Bleu_1': 0.7999999999200001, 'ROUGE_L': 0.67966573816155995, 'METEOR': 0.39012536521249613, 'EmbeddingAverageCosineSimilairty': 0.932718, 'VectorExtremaCosineSimilarity': 0.710326, 'SkipThoughtCS': 0.87790722}

Or on average:

  • 'Bleu_1': 0.5499999999450002,
  • 'Bleu_2': 0.4246204262060401,
  • 'Bleu_3': 0.24037573293212433,
  • 'Bleu_4': 0.177474055248943,
  • 'METEOR': 0.27289112391477827,
  • 'ROUGE_L': 0.52210378541544133,
  • 'CIDEr': 0.0
  • 'SkipThoughtCS': 0.62614947557449341
  • 'EmbeddingAverageCosineSimilairty': 0.8846905,
  • 'VectorExtremaCosineSimilarity': 0.5686955,
  • 'GreedyMatchingScore': 0.784205,

For CIDEr, I understand it must be 0 because of the note in the README.
For the rest, I don't understand why there are such fluctuating deviations

I looked at your scorers' code, and I saw that for most checked scorers above, you are using an average like np.mean, but for METEOR and BLEU I couldn't find any behavior like that.

Here is code for reproduction, I think it can be good to use that as test code: (obviously not to use these assertions values, but values from compute_metrics on the entire corpus)

from nlgeval import NLGEval
from datetime import datetime
from collections import Counter

startTime = datetime.now()


def passedTime():
    return str(datetime.now() - startTime)


data = [
    ("this is the model generated sentence1 which seems good enough", [
        "this is one reference sentence for sentence1",
        "this is one more reference sentence for sentence1"
    ]),
    ("this is sentence2 which has been generated by your model", [
        "this is a reference sentence for sentence2 which was generated by your model",
        "this is the second reference sentence for sentence2"
    ])
]

if __name__ == "__main__":
    print "Start loading NLG-Eval Model", passedTime()
    nlgeval = NLGEval()  # loads the models
    print "End loading NLG-Eval Model", passedTime()

    metrics = []
    for hyp, refs in data:
        print "Start evaluating a single sentence", passedTime()
        metrics_dict = nlgeval.evaluate(refs, hyp)
        print "End evaluating a single sentence", passedTime()
        print metrics_dict
        metrics.append(metrics_dict)

    total = sum(map(Counter, metrics), Counter())
    N = float(len(metrics))
    final_metrics = {k: round(v / len(metrics), 6) for k, v in total.items()}
    print final_metrics

    assert final_metrics["Bleu_1"] == 0.550000
    assert final_metrics["Bleu_2"] == 0.428174
    assert final_metrics["Bleu_3"] == 0.284043
    assert final_metrics["Bleu_4"] == 0.201143
    assert final_metrics["METEOR"] == 0.295797
    assert final_metrics["ROUGE_L"] == 0.522104
    assert final_metrics["SkipThoughtsCosineSimilairty"] == 0.626149
    assert final_metrics["EmbeddingAverageCosineSimilairty"] == 0.884690
    assert final_metrics["VectorExtremaCosineSimilarity"] == 0.568696
    assert final_metrics["GreedyMatchingScore"] == 0.784205

Question: Varying number of references

#I have a corpus with varying number of references for each candidate.
Is there a way to input the evaluation with that kind of data?

If not, a solution can be to create the maximum number of references amount of files and to duplicate references when I don't have enough.

A better solution, because I don't want to save files all the time, is running

compute_individual_metrics(references, hypothesis)

for every tuple of (hyp, refs), and eventually, average them, Is that kind of thing valid? Are you somehow optimizing the batch method such that it is way faster than this? (like, are you loading the data files every time or just once)

post-setup crashes if nltk not installed

When installing nlg-eval with pip install nlg-eval@git+https://github.com/Maluuba/nlg-eval.git and if nltk is not installed, the following exception is raised by

nlg-eval/setup.py

Lines 19 to 21 in 5a89faf

def _post_setup():
from nltk.downloader import download
download('punkt')

  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-v3k3eqaa/nlg-eval/setup.py", line 56, in <module>
      'install': PostInstallCommand,
    File "/private/home/louismartin/miniconda3/envs/tmp/lib/python3.7/site-packages/setuptools/__init__.py", line 145, in setup
      return distutils.core.setup(**attrs)
    File "/private/home/louismartin/miniconda3/envs/tmp/lib/python3.7/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/private/home/louismartin/miniconda3/envs/tmp/lib/python3.7/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/private/home/louismartin/miniconda3/envs/tmp/lib/python3.7/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/private/home/louismartin/miniconda3/envs/tmp/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 228, in run
      self.run_command('install')
    File "/private/home/louismartin/miniconda3/envs/tmp/lib/python3.7/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/private/home/louismartin/miniconda3/envs/tmp/lib/python3.7/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/tmp/pip-install-v3k3eqaa/nlg-eval/setup.py", line 34, in run
      _post_setup()
    File "/tmp/pip-install-v3k3eqaa/nlg-eval/setup.py", line 20, in _post_setup
      from nltk.downloader import download
  ModuleNotFoundError: No module named 'nltk'
  ----------------------------------------
  ERROR: Failed building wheel for nlg-eval

This error prevents the automatic install of nlg-eval as a requirement in https://github.com/facebookresearch/text-simplification-evaluation/blob/21308789a173762af000ae9d8afedb356e0ae548/setup.py#L42

What would be the easiest way to fix it?

bug(unicode): utf-8 encoded strings throw errors

Given the following hyp:

  • uruguay where the leader is raúl fernando sendic rodríguez alfredo zitarrosa died in montevideo montevideo , uruguay montevideo where the leader is daniel martínez ( politician ) uruguay and the language spoken is spanish .

And following refs:

  • Alfredo Zitarrosa died in Montevideo, Uruguay. Daniel Martinez is a political leader in Montevideo, and Raul Fernando Sendic Rodriguez is a leader in Uruguay, where Spanish is spoken.
  • Alfredo Zitarrosa died in Montevideo, the leader of which, is Daniel Martinez. Montevideo is in Uruguay, where Spanish is the language and where Raúl Fernando Sendic Rodríguez is the leader.
  • Raúl Fernando Sendic Rodríguez is the leader of Spanish speaking, Uruguay. Daniel Martinez is the leader of Montevideo, the final resting place of Alfredo Zitarrosa.

corpus check fails.

Meh solution: Map to a string, ignore unicode characters shouldn't hurt too much, but can

def sentence_unicode(s):
    return str(''.join([i if ord(i) < 128 else 'X' for i in s]))

Type of BLEU

Hi,
Thanks for providing the code.
I want to confirm if the BLEU-4 score is sentence level BLEU or corpus level BLEU?
Thanks again.

METEOR for languages other than English

If I provide a paraphrase file for some target language other than English, is it possible to run METEOR evaluation on my new target language with nlg-eval?
If so, how? By simply replacing the data in METEOR submodule?
Thank you!

ModuleNotFoundError: No module named 'nlgeval.word2vec.glove2word2vec'

There's an error when running nlg-eval --setup

  File "/private/home/louismartin/miniconda3/lib/python3.6/site-packages/nlgeval/word2vec/generate_w2v_files.py", line 11, in <module>
    from nlgeval.word2vec.glove2word2vec import glove2word2vec
ModuleNotFoundError: No module named 'nlgeval.word2vec.glove2word2vec'

Culprit line:

from nlgeval.word2vec.glove2word2vec import glove2word2vec

Python 3 compatibility issues

flake8 testing of https://github.com/Maluuba/nlg-eval on Python 3.6.3

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./nlgeval/skipthoughts/dataset_handler.py:26:45: E999 SyntaxError: invalid syntax
    print 'Computing skip-thought vectors...'
                                            ^
./nlgeval/skipthoughts/eval_classification.py:77:24: E999 SyntaxError: invalid syntax
        print scanscores
                       ^
./nlgeval/skipthoughts/eval_msrp.py:20:29: E999 SyntaxError: invalid syntax
    print 'Preparing data...'
                            ^
./nlgeval/skipthoughts/eval_rank.py:226:28: E999 SyntaxError: invalid syntax
    print "Loading model..."
                           ^
./nlgeval/skipthoughts/eval_sick.py:20:29: E999 SyntaxError: invalid syntax
    print 'Preparing data...'
                            ^
./nlgeval/skipthoughts/eval_trec.py:17:29: E999 SyntaxError: invalid syntax
    print 'Preparing data...'
                            ^
./nlgeval/skipthoughts/decoding/model.py:98:30: E999 SyntaxError: invalid syntax
    print 'Building f_init...',
                             ^
./nlgeval/skipthoughts/decoding/search.py:30:15: F821 undefined name 'xrange'
    for ii in xrange(maxlen):
              ^
./nlgeval/skipthoughts/decoding/search.py:75:24: F821 undefined name 'xrange'
            for idx in xrange(len(new_hyp_samples)):
                       ^
./nlgeval/skipthoughts/decoding/search.py:99:24: F821 undefined name 'xrange'
            for idx in xrange(live_k):
                       ^
./nlgeval/skipthoughts/decoding/tools.py:28:33: E999 SyntaxError: invalid syntax
    print 'Loading dictionary...'
                                ^
./nlgeval/skipthoughts/decoding/train.py:74:23: E999 SyntaxError: invalid syntax
    print model_options
                      ^
./nlgeval/skipthoughts/decoding/utils.py:55:13: F821 undefined name 'warnings'
            warnings.warn('%s is not in the archive'%kk)
            ^
./nlgeval/skipthoughts/training/tools.py:35:33: E999 SyntaxError: invalid syntax
    print 'Loading dictionary...'
                                ^
./nlgeval/skipthoughts/training/train.py:64:23: E999 SyntaxError: invalid syntax
    print model_options
                      ^
./nlgeval/skipthoughts/training/utils.py:55:13: F821 undefined name 'warnings'
            warnings.warn('%s is not in the archive'%kk)
            ^
11    E999 SyntaxError: invalid syntax
5     F821 undefined name 'xrange'
16

please add Word Movers Distance with ELMo Embeddings

It would be nice to add this metric because Word Movers Distance uses information geometry and ELMo uses contextual embeddings. I'm not familiar with all of your metrics so please correct me if there is an existing metric similar to this

I try to load the file, but get a error? Anyone known?

error: /home/yuanlixiong/Downloads/data/uni_skip.npz

# this is my code.
from nlgeval import NLGEval
nlgeval = NLGEval()  # loads the models
metrics_dict = nlgeval.compute_metrics(ref=["this is a test",
                                           "this is also a test"],
                                      hyp="this is a good test")

error:

---------------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
<ipython-input-1-9a25fc0d0af4> in <module>
      1 from nlgeval import NLGEval
----> 2 nlgeval = NLGEval()  # loads the models
      3 metrics_dict = nlgeval.compute_metrics(ref=["this is a test",
      4                                            "this is also a test"],
      5                                       hyp="this is a good test")

~/miniconda3/envs/py36/lib/python3.6/site-packages/nlgeval/__init__.py in __init__(self, no_overlap, no_skipthoughts, no_glove, metrics_to_omit)
    190         self.no_skipthoughts = no_skipthoughts or 'SkipThoughtCS' in self.metrics_to_omit
    191         if not self.no_skipthoughts:
--> 192             self.load_skipthought_model()
    193 
    194         self.no_glove = no_glove or len(self.glove_metrics - self.metrics_to_omit) == 0

~/miniconda3/envs/py36/lib/python3.6/site-packages/nlgeval/__init__.py in load_skipthought_model(self)
    224         self.cosine_similarity = cosine_similarity
    225 
--> 226         model = skipthoughts.load_model()
    227         self.skipthought_encoder = skipthoughts.Encoder(model)
    228 

~/miniconda3/envs/py36/lib/python3.6/site-packages/nlgeval/skipthoughts/skipthoughts.py in load_model()
     43     # Load parameters
     44     uparams = init_params(uoptions)
---> 45     uparams = load_params(path_to_umodel, uparams)
     46     utparams = init_tparams(uparams)
     47     bparams = init_params_bi(boptions)

~/miniconda3/envs/py36/lib/python3.6/site-packages/nlgeval/skipthoughts/skipthoughts.py in load_params(path, params)
    251     """
    252     print('error:', path)
--> 253     pp = numpy.load(path)
    254     for kk, vv in six.iteritems(params):
    255         if kk not in pp:

~/miniconda3/envs/py36/lib/python3.6/site-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
    436             # Transfer file ownership to NpzFile
    437             ret = NpzFile(fid, own_fid=own_fid, allow_pickle=allow_pickle,
--> 438                           pickle_kwargs=pickle_kwargs)
    439             own_fid = False
    440             return ret

~/miniconda3/envs/py36/lib/python3.6/site-packages/numpy/lib/npyio.py in __init__(self, fid, own_fid, allow_pickle, pickle_kwargs)
    191         # Import is postponed to here since zipfile depends on gzip, an
    192         # optional component of the so-called standard library.
--> 193         _zip = zipfile_factory(fid)
    194         self._files = _zip.namelist()
    195         self.files = []

~/miniconda3/envs/py36/lib/python3.6/site-packages/numpy/lib/npyio.py in zipfile_factory(file, *args, **kwargs)
    117     import zipfile
    118     kwargs['allowZip64'] = True
--> 119     return zipfile.ZipFile(file, *args, **kwargs)
    120 
    121 

~/miniconda3/envs/py36/lib/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
   1129         try:
   1130             if mode == 'r':
-> 1131                 self._RealGetContents()
   1132             elif mode in ('w', 'x'):
   1133                 # set the modified flag so central directory gets written

~/miniconda3/envs/py36/lib/python3.6/zipfile.py in _RealGetContents(self)
   1196             raise BadZipFile("File is not a zip file")
   1197         if not endrec:
-> 1198             raise BadZipFile("File is not a zip file")
   1199         if self.debug > 1:
   1200             print(endrec)

BadZipFile: File is not a zip file

OSError: [Errno 22] Invalid argument

I am running the basic functional API as in README. And I am running into the following error:

>>> from nlgeval import compute_metrics
>>> metrics_dict = compute_metrics(hypothesis='examples/hyp.txt',references=['examples/ref1.txt','examples/ref2.txt'])
Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Back Up\Desktop\Setiment Analysis\nlg-eval\nlgeval\__init__.py", line 40, in compute_metrics
    score, scores = scorer.compute_score(refs, hyps)
  File "D:\Back Up\Desktop\Setiment Analysis\nlg-eval\nlgeval\pycocoevalcap\meteor\meteor.py", line 62, in compute_score
    stat = self._stat(res[i][0], gts[i])
  File "D:\Back Up\Desktop\Setiment Analysis\nlg-eval\nlgeval\pycocoevalcap\meteor\meteor.py", line 82, in _stat
    self.meteor_p.stdin.flush()
OSError: [Errno 22] Invalid argument

I thought this issue might be related to Issue #8 . So I uninstalled the scikit-learn using pip and reinstalled using conda, but the issue still persists. I have all other perquisites installed( java version "1.8.0_201"). I am using Anaconda on Win10 and Python 3.6.8. I'll post my conda list output just in case

# Name                    Version                   Build  Channel
absl-py                   0.7.0                     <pip>
astor                     0.7.1                     <pip>
blas                      1.0                         mkl
bleach                    1.5.0                     <pip>
boto                      2.49.0                    <pip>
boto3                     1.9.86                    <pip>
botocore                  1.12.86                   <pip>
bz2file                   0.98                      <pip>
certifi                   2018.11.29               py36_0
chardet                   3.0.4                     <pip>
Click                     7.0                       <pip>
docutils                  0.14                      <pip>
gast                      0.2.2                     <pip>
gensim                    3.7.0                     <pip>
grpcio                    1.18.0                    <pip>
html5lib                  0.9999999                 <pip>
icc_rt                    2019.0.0             h0cc432a_1
idna                      2.8                       <pip>
intel-openmp              2019.1                      144
jmespath                  0.9.3                     <pip>
Markdown                  3.0.1                     <pip>
mkl                       2019.1                      144
mkl_fft                   1.0.10           py36h14836fe_0
mkl_random                1.0.2            py36h343c172_0
nlg-eval                  2.1                       <pip>
nltk                      3.4                       <pip>
numpy                     1.15.4           py36h19fb1c0_0
numpy                     1.16.0                    <pip>
numpy-base                1.15.4           py36hc3f5095_0
pandas                    0.24.0                    <pip>
pip                       18.1                     py36_0
protobuf                  3.6.1                     <pip>
python                    3.6.8                h9f7ef89_0
python-dateutil           2.7.5                     <pip>
pytz                      2018.9                    <pip>
requests                  2.21.0                    <pip>
s3transfer                0.1.13                    <pip>
scikit-learn              0.20.2           py36h343c172_0
scipy                     1.2.0                     <pip>
scipy                     1.2.0            py36h29ff71c_0
setuptools                40.7.0                    <pip>
setuptools                40.6.3                   py36_0
singledispatch            3.4.0.3                   <pip>
six                       1.12.0                    <pip>
smart-open                1.8.0                     <pip>
sqlite                    3.26.0               he774522_0
tensorboard               1.8.0                     <pip>
tensorflow-gpu            1.8.0                     <pip>
termcolor                 1.1.0                     <pip>
Theano                    1.0.4                     <pip>
tqdm                      4.30.0                    <pip>
urllib3                   1.24.1                    <pip>
vc                        14.1                 h0510ff6_4
vs2015_runtime            14.15.26706          h3a45250_0
Werkzeug                  0.14.1                    <pip>
wheel                     0.32.3                   py36_0
wheel                     0.32.3                    <pip>
wincertstore              0.2              py36h7fe50ca_0

Earlier I got it to work it with Python 3.5 ( Issue #21 ). Where am I going wrong ?

Can't get the metric running

Hi, I'm using java1.8.0 and python 2.7
After I get the setup done, I ran the test code you give
But I get the following things:

Traceback (most recent call last):
File "", line 1, in
File "nlgeval/init.py", line 135, in init
self.load_scorers()
File "nlgeval/init.py", line 148, in load_scorers
(Meteor(), "METEOR"),
File "nlgeval/pycocoevalcap/meteor/meteor.py", line 24, in init
stderr=subprocess.PIPE)
File "/usr/lib/python2.7/subprocess.py", line 711, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

It seems something's not exist
I have ever interrupted the setup.sh once and get it restarted.
Maybe that's the problem???

ROUGE 1 / ROUGE 2

How can run nlg-eval to also produce ROUGE 1 and ROUGE 2 scores ?

AttributeError: 'Word2Vec' object has no attribute 'vocabulary'

gensim 3.5.0

nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt

Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
METEOR: 0.295797
ROUGE_L: 0.522104
CIDEr: 1.242192
SkipThoughtsCosineSimilairty: 0.626149
Traceback (most recent call last):
  File "/data1/nlg_eval_vm/nlgconda/bin/nlg-eval", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/data1/nlg_eval_vm/nlg-eval/bin/nlg-eval", line 169, in <module>
    compute_metrics()
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/data1/nlg_eval_vm/nlg-eval/bin/nlg-eval", line 165, in compute_metrics
    nlgeval.compute_metrics(hypothesis, references, no_overlap, no_skipthoughts, no_glove)
  File "/data1/nlg_eval_vm/nlg-eval/nlgeval/__init__.py", line 73, in compute_metrics
    scores = eval_emb_metrics(glove_hyps, glove_refs)
  File "/data1/nlg_eval_vm/nlg-eval/nlgeval/word2vec/evaluate.py", line 46, in eval_emb_metrics
    emb = Embedding()
  File "/data1/nlg_eval_vm/nlg-eval/nlgeval/word2vec/evaluate.py", line 14, in __init__
    self.m = KeyedVectors.load(os.path.join(path, 'glove.6B.300d.model.bin'), mmap='r')
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 211, in load
    return super(BaseKeyedVectors, cls).load(fname_or_handle, **kwargs)
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/gensim/utils.py", line 420, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/gensim/utils.py", line 485, in _load_specials
    setattr(self, attrib, None)
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/gensim/utils.py", line 1419, in new_func1
    return func(*args, **kwargs)
  File "/data1/nlg_eval_vm/nlgconda/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 745, in cum_table
    self.vocabulary.cum_table = value
AttributeError: 'Word2Vec' object has no attribute 'vocabulary'

I also tried using gensim 3.4.0 but same error.

ValueError when the number of references and hypothesis is large

The code below works fine when the number of sentences in references and hypothesis is small below about 1000 but obtains ValueError when the size increases.

from nlgeval import NLGEval
nlgeval = NLGEval()
scores = nlgeval.compute_metrics(ref_list=refff, hyp_list=hyps)

Had the error:
ValueError Traceback (most recent call last)
in ()
----> 1 scores = nlgeval.compute_metrics(ref_list=refff, hyp_list=hyps,)

~/project/sentenceGen/nlg-eval/nlgeval/init.py in compute_metrics(self, ref_list, hyp_list)
285 if not self.no_overlap:
286 for scorer, method in self.scorers:
--> 287 score, scores = scorer.compute_score(refs, hyps)
288 if isinstance(method, list):
289 for sc, scs, m in zip(score, scores, method):

~/project/sentenceGen/nlg-eval/nlgeval/pycocoevalcap/meteor/meteor.py in compute_score(self, gts, res)
66 self.meteor_p.stdin.flush()
67 for i in range(0, len(imgIds)):
---> 68 scores.append(float(dec(self.meteor_p.stdout.readline().strip())))
69 score = float(dec(self.meteor_p.stdout.readline()).strip())
70
ValueError: could not convert string to float: '11.0 10.0 5.0 5.0 3.0 3.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 5.0 5.0'

nlg-eval command not found

I am running the code on a windows machine. I have executed the "pip install -e ." and all packages have been successfully installed. However when I run the next step, "nlg-eval --setup" it throws an error command not found.

image

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined>

Hi, I've got UnicodeDecodeError when running 'nlg-eval --setup':

Downloading http://nlp.stanford.edu/data/glove.6B.zip to nlgeval/data.
Downloading https://raw.githubusercontent.com/robmsmt/glove-gensim/dea5e55f449794567f12c79dc12b7f75339b18ba/glove2word2vec.py to nlgeval/word2vec.
Downloading http://www.cs.toronto.edu/~rkiros/models/dictionary.txt to nlgeval/data.
Downloading http://www.cs.toronto.edu/~rkiros/models/utable.npy to nlgeval/data.
glove2word2vec.py: 100%|██████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<?, ? chunks/s]
Downloading http://www.cs.toronto.edu/~rkiros/models/btable.npy to nlgeval/data.
dictionary.txt: 550 chunks [00:02, 208 chunks/s] | 0.00/823 [00:00<?, ? chunks/s]
Downloading http://www.cs.toronto.edu/~rkiros/models/uni_skip.npz to nlgeval/data.
glove.6B.zip: 100%|██████████████████████████████████████████████████████████████| 823/823 [01:18<00:00, 10.5 chunks/s]
Downloading http://www.cs.toronto.edu/~rkiros/models/uni_skip.npz.pkl to nlgeval/data.
uni_skip.npz.pkl: 100%|███████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<?, ? chunks/s]
Downloading http://www.cs.toronto.edu/~rkiros/models/bi_skip.npz to nlgeval/data.
bi_skip.npz: 100%|███████████████████████████████████████████████████████████████| 276/276 [07:11<00:00, 1.56s/ chunks]
btable.npy: 17%|██████████▏ | 369/2.23k [08:30<50:15, 1.62s/ chunks]Downloading http://www.cs.toronto.edu/~rkiros/models/bi_skip.npz.pkl to nlgeval/data.
bi_skip.npz.pkl: 100%|████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<?, ? chunks/s]
Downloading https://raw.githubusercontent.com/moses-smt/mosesdecoder/b199e654df2a26ea58f234cbb642e89d9c1f269d/scripts/generic/multi-bleu.perl to nlgeval/multibleu.
multi-bleu.perl: 100%|█████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 32.0 chunks/s]
uni_skip.npz: 100%|██████████████████████████████████████████████████████████████| 634/634 [12:14<00:00, 1.16s/ chunks]
btable.npy: 100%|████████████████████████████████████████████████████████████| 2.23k/2.23k [37:41<00:00, 1.16 chunks/s]
utable.npy: 100%|████████████████████████████████████████████████████████████| 2.23k/2.23k [39:10<00:00, 1.05s/ chunks]
C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2019-01-21 18:06:09,764 : MainThread : INFO : 400000 lines with 300 dimensions
Traceback (most recent call last):
File "nlg-eval.py", line 169, in
compute_metrics()
File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 722, in call
return self.main(*args, **kwargs)
File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 696, in main
with self.make_context(prog_name, args, **extra) as ctx:
File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 621, in make_context
self.parse_args(ctx, args)
File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 880, in parse_args
value, args = param.handle_parse_result(ctx, opts, args)
File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 1404, in handle_parse_result
self.callback, ctx, self, value)
File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 78, in invoke_param_callback
return callback(ctx, param, value)
File "nlg-eval.py", line 141, in setup
generate()
File "C:\Users\XuluY\nlg-eval-master\nlgeval\word2vec\generate_w2v_files.py", line 26, in generate
txt2bin(glove2word2vec(glove_vector_file, output_model_file))
File "C:\Users\XuluY\nlg-eval-master\nlgeval\word2vec\glove2word2vec.py", line 57, in glove2word2vec
model_file = prepend_line(glove_vector_file, output_model_file, gensim_first_line)
File "C:\Users\XuluY\nlg-eval-master\nlgeval\word2vec\glove2word2vec.py", line 48, in prepend_line
for line in old:
File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to

Do you know how to fix this issue?

Thanks and best regards!

IOError: [Errno 22] Invalid argument

Confirmed on both Windows and Mac.

Running the following (really simple) code:

from nlgeval import compute_metrics

if __name__ == "__main__":
    compute_metrics("examples/hyp.txt", [
        "examples/ref1.txt",
        "examples/ref2.txt",
        "examples/ref2.txt"
    ])

On your example files, with no change, results in:

Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
Traceback (most recent call last):
File "main.py", line 9, in
abspath("examples/ref2.txt")
File "nlgeval\nlgeval_init_.py", line 30, in compute_metrics
score, scores = scorer.compute_score(refs, hyps)
File "nlgeval\nlgeval\pycocoevalcap\meteor\meteor.py", line 37, in compute_score
stat = self._stat(res[i][0], gts[i])
File "nlgeval\nlgeval\pycocoevalcap\meteor\meteor.py", line 55, in _stat
self.meteor_p.stdin.write('{}\n'.format(score_line))
IOError: [Errno 22] Invalid argument

Environment:

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

EOFError: Ran out of input

metrics_dict = compute_individual_metrics("I hate you", "I love you")

Traceback (most recent call last):
File "", line 1, in
File "/scratchd/home/anand/KVQG/baselines/eval/nlg-eval/nlgeval/init.py", line 117, in compute_individual_metrics
model = skipthoughts.load_model()
File "/scratchd/home/anand/KVQG/baselines/eval/nlg-eval/nlgeval/skipthoughts/skipthoughts.py", line 59, in load_model
utable, btable = load_tables()
File "/scratchd/home/anand/KVQG/baselines/eval/nlg-eval/nlgeval/skipthoughts/skipthoughts.py", line 80, in load_tables
btable = numpy.load(os.path.join(path_to_tables, 'btable.npy'), encoding='bytes')
File "/home/anand/.local/lib/python3.5/site-packages/numpy/lib/npyio.py", line 421, in load
pickle_kwargs=pickle_kwargs)
File "/home/anand/.local/lib/python3.5/site-packages/numpy/lib/format.py", line 647, in read_array
array = pickle.load(fp, **pickle_kwargs)
EOFError: Ran out of input

_pickle.UnpicklingError: pickle data was truncated

python3.6

from nlgeval import NLGEval
nlgeval = NLGEval()
Traceback (most recent call last):
File "", line 1, in
File "/home/lkk/code/caption_v1/nlgeval/nlgeval/init.py", line 192, in init
self.load_skipthought_model()
File "/home/lkk/code/caption_v1/nlgeval/nlgeval/init.py", line 226, in load_skipthought_model
model = skipthoughts.load_model()
File "/home/lkk/code/caption_v1/nlgeval/nlgeval/skipthoughts/skipthoughts.py", line 59, in load_model
utable, btable = load_tables()
File "/home/lkk/code/caption_v1/nlgeval/nlgeval/skipthoughts/skipthoughts.py", line 79, in load_tables
utable = numpy.load(os.path.join(path_to_tables, 'utable.npy'), allow_pickle=True, encoding='bytes')
File "/home/lkk/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 433, in load
pickle_kwargs=pickle_kwargs)
File "/home/lkk/anaconda3/lib/python3.6/site-packages/numpy/lib/format.py", line 657, in read_array
array = pickle.load(fp, **pickle_kwargs)
_pickle.UnpicklingError: pickle data was truncated

Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory

Hi, thanks for the nice toolkit! I have a question regarding evaluating a list of hypothesis sentences (5,000 in total) against a reference corpus which contains 170,000 sentences. If I try to use compute_metrics, then it gives AssetionError: assert len(refs) == len(hyps), which is due to the different number of items in each file. As I was reading through the documentation, would it be reasonable to use compute_individual_metrics for each sentence in the list of hypothesis against the entire reference corpus, retrieve the scores and average them for all 5,000 hypothesis sentences? If so, what could be an efficient and fast way to do it? Thanks a lot!

OSError: 120000000 requested and 66581472 written

During executing nlg-eval --setup, OSError occurred.

The error message is as follows:

Traceback (most recent call last):
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/gensim/utils.py", line 692, in save
    _pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
TypeError: file must have a 'write' attribute

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shinoda/.pyenv/versions/3.6.8/bin/nlg-eval", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/shinoda/app/nlg-eval/bin/nlg-eval", line 169, in <module>
    compute_metrics()
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/click/core.py", line 716, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/click/core.py", line 641, in make_context
    self.parse_args(ctx, args)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/click/core.py", line 940, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/click/core.py", line 1477, in handle_parse_result
    self.callback, ctx, self, value)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/click/core.py", line 96, in invoke_param_callback
    return callback(ctx, param, value)
  File "/home/shinoda/app/nlg-eval/bin/nlg-eval", line 141, in setup
    generate()
  File "/home/shinoda/app/nlg-eval/nlgeval/word2vec/generate_w2v_files.py", line 26, in generate
    txt2bin(glove2word2vec(glove_vector_file, output_model_file))
  File "/home/shinoda/app/nlg-eval/nlgeval/word2vec/generate_w2v_files.py", line 17, in txt2bin
    m.save(filename.replace('txt', 'bin'), separately=None)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 435, in save
    super(WordEmbeddingsKeyedVectors, self).save(*args, **kwargs)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 223, in save
    super(BaseKeyedVectors, self).save(fname_or_handle, **kwargs)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/gensim/utils.py", line 695, in save
    self._smart_save(fname_or_handle, separately, sep_limit, ignore, pickle_protocol=pickle_protocol)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/gensim/utils.py", line 547, in _smart_save
    compress, subname)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/gensim/utils.py", line 621, in _save_specials
    np.save(subname(fname, attrib), np.ascontiguousarray(val))
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/numpy/lib/npyio.py", line 536, in save
    pickle_kwargs=pickle_kwargs)
  File "/home/shinoda/.pyenv/versions/3.6.8/lib/python3.6/site-packages/numpy/lib/format.py", line 644, in write_array
    array.tofile(fp)
OSError: 120000000 requested and 66581472 written

How can I solve this problem?

Thanks!

ValueError in meteor.py on running example

Hi, I'm getting the following error upon running the CLI example:

$ nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt
Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
Traceback (most recent call last):
  File "/home/dan/anaconda3/bin/nlg-eval", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/media/dan/hdd/projects/nlg-eval/bin/nlg-eval", line 20, in <module>
    compute_metrics()
  File "/home/dan/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/dan/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/dan/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dan/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/media/dan/hdd/projects/nlg-eval/bin/nlg-eval", line 16, in compute_metrics
    nlgeval.compute_metrics(hypothesis, references, no_overlap, no_skipthoughts, no_glove)
  File "/media/dan/hdd/projects/nlg-eval/nlgeval/__init__.py", line 34, in compute_metrics
    score, scores = scorer.compute_score(refs, hyps)
  File "/media/dan/hdd/projects/nlg-eval/nlgeval/pycocoevalcap/meteor/meteor.py", line 50, in compute_score
    scores.append(float(dec(self.meteor_p.stdout.readline().strip())))
ValueError: could not convert string to float: 

I checked the string in question, and it seems that self.meteor_p.stdout.readline() is returning an empty string.

Can you help me figure out how to address this issue?

I've installed the dependencies and ran the setup script. Also, my java version is 1.8.0_171.

Thanks

got different result when using standalone and python API

Hi,

I used this repository for evaluating my result.

However, I received two different result.

I tried to use (I have only one reference for each hypothesis):
nlg-eval --hypothesis=hyp.txt --references=ref.txt

and I got the result:
Bleu_1: 0.225156
Bleu_2: 0.124906
Bleu_3: 0.071296
Bleu_4: 0.042867

But when I using the following code:

nlgeval = NLGEval(no_glove=True, no_skipthoughts=True, metrics_to_omit={'CIDEr'})
bleu_1_list = []
bleu_2_list = []
bleu_3_list = []
bleu_4_list = []

for gt, pred in zip(gt_list, label_list):
    nlg_eval_result = nlgeval.compute_individual_metrics([gt], pred)
    bleu_1_list.append(nlg_eval_result['Bleu_1'])
    bleu_2_list.append(nlg_eval_result['Bleu_2'])
    bleu_3_list.append(nlg_eval_result['Bleu_3'])
    bleu_4_list.append(nlg_eval_result['Bleu_4'])
    
bleu_1_result = np.asarray(bleu_1_list).mean()
bleu_2_result = np.asarray(bleu_2_list).mean()
bleu_3_result = np.asarray(bleu_3_list).mean()
bleu_4_result = np.asarray(bleu_4_list).mean()

print(bleu_1_result, bleu_2_result, bleu_3_result, bleu_4_result)

this will get the BLEU results:
Bleu_1=0.187041729287695
Bleu_2=0.08295724762832153
Bleu_3=0.029044826012821708
Bleu_4=0.012461408083621252

Why I got the different result?

The only difference is that I take all hyp and ref together in the first one (standalong command) and I calculate the result for each sample one by one in the second one (python API).

nlg-eval: command not found

I ran ./setup.sh. After that when I do
nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt
I get the error nlg-eval: command not found
Can you please tell how to resolve this?

How to find the metrices when two files are exactly same?

Hi,

When I use these lines of code and my two files contain exact same sentences then it produces an assertion error. Why this code is not applicable for two files which are exactly same ?

from nlgeval import compute_metrics
metrics_dict = compute_metrics(hypothesis='nlg-eval/examples/hyp.txt',
references=['nlg-eval/examples/ref1.txt'])

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.