GithubHelp home page GithubHelp logo

google-research / bleurt Goto Github PK

View Code? Open in Web Editor NEW
649.0 13.0 81.0 31.24 MB

BLEURT is a metric for Natural Language Generation based on transfer learning.

Home Page: https://arxiv.org/abs/2004.04696

License: Apache License 2.0

Python 100.00%

bleurt's Introduction

BLEURT: a Transfer Learning-Based Metric for Natural Language Generation

BLEURT is an evaluation metric for Natural Language Generation. It takes a pair of sentences as input, a reference and a candidate, and it returns a score that indicates to what extent the candidate is fluent and conveys the meaning of the reference. It is comparable to sentence-BLEU, BERTscore, and COMET.

BLEURT is a trained metric, that is, it is a regression model trained on ratings data. The model is based on BERT and RemBERT. This repository contains all the code necessary to use it and/or fine-tune it for your own applications. BLEURT uses Tensorflow, and it benefits greatly from modern GPUs (it runs on CPU too).

An overview of BLEURT can be found in our our blog post. Further details are provided in the ACL paper BLEURT: Learning Robust Metrics for Text Generation and our EMNLP paper.

Installation

BLEURT runs in Python 3. It relies heavily on Tensorflow (>=1.15) and the library tf-slim (>=1.1). You may install it as follows:

pip install --upgrade pip  # ensures that pip is current
git clone https://github.com/google-research/bleurt.git
cd bleurt
pip install .

You may check your install with unit tests:

python -m unittest bleurt.score_test
python -m unittest bleurt.score_not_eager_test
python -m unittest bleurt.finetune_test
python -m unittest bleurt.score_files_test

Using BLEURT - TL;DR Version

The following commands download the recommended checkpoint and run BLEURT:

# Downloads the BLEURT-base checkpoint.
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip

# Runs the scoring.
python -m bleurt.score_files \
  -candidate_file=bleurt/test_data/candidates \
  -reference_file=bleurt/test_data/references \
  -bleurt_checkpoint=BLEURT-20

The files bleurt/test_data/candidates and references contain test sentences, included by default in the BLEURT distribution. The input format is one sentence per line. You may replace them with your own files. The command outputs one score per sentence pair.

Oct 8th 2021 Update: we upgraded the recommended checkpoint to BLEURT-20, a more accurate, multilingual model ๐ŸŽ‰.

Using BLEURT - the Long Version

Command-line tools and APIs

Currently, there are three methods to invoke BLEURT: the command-line interface, the Python API, and the Tensorflow API.

Command-line interface

The simplest way to use BLEURT is through command line, as shown below.

python -m bleurt.score_files \
  -candidate_file=bleurt/test_data/candidates \
  -reference_file=bleurt/test_data/references \
  -bleurt_checkpoint=bleurt/test_checkpoint \
  -scores_file=scores

The files candidates and references contain one sentence per line (see the folder test_data for the exact format). Invoking the command should produce a file scores which contains one BLEURT score per sentence pair. Alternatively you may use a JSONL file, as follows:

python -m bleurt.score_files \
  -sentence_pairs_file=bleurt/test_data/sentence_pairs.jsonl \
  -bleurt_checkpoint=bleurt/test_checkpoint

The flags bleurt_checkpoint and scores_file are optional. If bleurt_checkpoint is not specified, BLEURT will default to a test checkpoint, based on BERT-Tiny, which is very light but also very inaccurate (we recommend against using it). If scores_files is not specified, BLEURT will use the standard output.

The following command lists all the other command-line options:

python -m bleurt.score_files -helpshort

Python API

BLEURT may be used as a Python library as follows:

from bleurt import score

checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=references, candidates=candidates)
assert isinstance(scores, list) and len(scores) == 1
print(scores)

Here again, BLEURT will default to BERT-Tiny if no checkpoint is specified.

BLEURT works both in eager_mode (default in TF 2.0) and in a tf.Session (TF 1.0), but the latter mode is slower and may be deprecated in the near future.

Tensorflow API

BLEURT may be embedded in a TF computation graph, e.g., to visualize it on the Tensorboard while training a model.

The following piece of code shows an example:

import tensorflow as tf
# Set tf.enable_eager_execution() if using TF 1.x.

from bleurt import score

references = tf.constant(["This is a test."])
candidates = tf.constant(["This is the test."])

bleurt_ops = score.create_bleurt_ops()
bleurt_out = bleurt_ops(references=references, candidates=candidates)

assert bleurt_out["predictions"].shape == (1,)
print(bleurt_out["predictions"])

The crucial part is the call to score.create_bleurt_ops, which creates the TF ops.

Checkpoints

A BLEURT checkpoint is a self-contained folder that contains a regression model and some information that BLEURT needs to run. BLEURT checkpoints can be downloaded, copy-pasted, and stored anywhere. Furthermore, checkpoints are tunable, which means that they can be fine-tuned on custom ratings data.

BLEURT defaults to the test checkpoint, which is very inaccaurate. We recommend using BLEURT-20 for results reporting. You may use it as follows:

wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip
python -m bleurt.score_files \
  -candidate_file=bleurt/test_data/candidates \
  -reference_file=bleurt/test_data/references \
  -bleurt_checkpoint=BLEURT-20

The checkpoints page provides more information about how these checkpoints were trained, as well as pointers to smaller models. Additionally, you can fine-tune BERT or existing BLEURT checkpoints on your own ratings data. The checkpoints page describes how to do so.

Interpreting BLEURT Scores

Different BLEURT checkpoints yield different scores. The currently recommended checkpoint BLEURT-20 generates scores which are roughly between 0 and 1 (sometimes less than 0, sometimes more than 1), where 0 indicates a random output and 1 a perfect one. As with all automatic metrics, BLEURT scores are noisy. For a robust evaluation of a system's quality, we recommend averaging BLEURT scores across the sentences in a corpus. See the WMT Metrics Shared Task for a comparison of metrics on this aspect.

In principle, BLEURT should measure adequacy: most of its training data was collected by the WMT organizers who asked to annotators "How much do you agree that the system output adequately expresses the meaning of the reference?" (WMT Metrics'18, Graham et al., 2015). In practice however, the answers tend to be very correlated with fluency ("Is the text fluent English?"), and we added synthetic noise in the training set which makes the distinction between adequacy and fluency somewhat fuzzy.

Language Coverage

Currently, BLEURT-20 was tested on 13 languages: Chinese, Czech, English, French, German, Japanese, Korean, Polish, Portugese, Russian, Spanish, Tamil, Vietnamese (these are languages for which we have held-out ratings data). In theory, it should work for the 100+ languages of multilingual C4, on which RemBERT was trained.

If you tried any other language and would like to share your experience, either positive or negative, please send us feedback!

Speeding Up BLEURT

We describe three methods to speed up BLEURT, and how to combine them.

Batch size tuning

You may specify the flag -bleurt_batch_size which determines the number of sentence pairs processed at once by BLEURT. The default value is 16, you may want to increase or decrease it based on the memory available and the presence of a GPU (we typically use 16 when using a laptop without a GPU, 100 on a workstation with a GPU).

Length-based batching

Length-based batching is an optimization which consists in batching examples that have a similar a length and cropping the resulting tensor, to avoid wasting computations on padding tokens. This technique oftentimes results in spectacular speed-ups (typically, ~2-10X). It is described here, and it was successfully used by BERTScore in the field of learned metrics.

You can enable length-based by specifying -batch_same_length=True when calling score_files with the command line, or by instantiating a LengthBatchingBleurtScorer instead of BleurtScorer when using the Python API.

Distilled models

We provide pointers to several compressed checkpoints on the checkpoints page. These models were obtained by distillation, a lossy process, and therefore the outputs cannot be directly compared to those of the original BLEURT model (though they should be strongly correlated).

Putting everything together

The following command illustrates how to combine these three techniques, speeding up BLEURT by an order of magnitude (up to 20X with our configuration) on larger files:

# Downloads the 12-layer distilled model, which is ~3.5X smaller.
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D12.zip .
unzip BLEURT-20-D12.zip

python -m bleurt.score_files \
  -candidate_file=bleurt/test_data/candidates \
  -reference_file=bleurt/test_data/references \
  -bleurt_batch_size=100 \            # Optimization 1.
  -batch_same_length=True \           # Optimization 2.
  -bleurt_checkpoint=BLEURT-20-D12    # Optimization 3.

Reproducibility

You may find information about how to work with ratings from the WMT Metrics Shared Task, reproduce results from our ACL paper, and a selection of models from our EMNLP paper on this page.

How to Cite

Please cite our ACL paper:

@inproceedings{sellam2020bleurt,
  title = {BLEURT: Learning Robust Metrics for Text Generation},
  author = {Thibault Sellam and Dipanjan Das and Ankur P Parikh},
  year = {2020},
  booktitle = {Proceedings of ACL}
}

The latest model, BLEURT-20, is based on work that led to this follow-up paper:

@inproceedings{pu2021learning,
  title = {Learning compact metrics for MT},
  author = {Pu, Amy and Chung, Hyung Won and Parikh, Ankur P and Gehrmann, Sebastian and Sellam, Thibault},
  booktitle = {Proceedings of EMNLP},
  year = {2021}
}

bleurt's People

Contributors

tsellam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bleurt's Issues

About the range of BLEURT scores.

Hi,
I'm trying to evaluate my models with BLEURT, and I find contradictory descriptions in README now and the previous issues about how to interpret BLEURT scores.

As mentioned in README, "The currently recommended checkpoint BLEURT-20 generates scores which are roughly between 0 and 1 (sometimes less than 0, sometimes more than 1), where 0 indicates a random output and 1 a perfect one."

And as mentioned in this issue, the statistics of training corpus has a large portion of samples has negative values.
#1

If I understand correctly, the BLEURT scores can be bounded manually by myself if I can regard 0 as random and 1 as perfect. For instance, the negative values can be set to 0, and values more than 1 can be truncated to 1.
Am I right?

Installation check error: Expected to be a int64 tensor but is a int32.

Hi ,

I have installed BLEURT and running the test script to test installation, getting below error. Paths to directory seems to be correct .

python -m bleurt.score
-candidate_file=bleurt/test_data/candidates
-reference_file=bleurt/test_data/references
-bleurt_checkpoint=bleurt/test_checkpoint
-scores_file=scores


INFO:tensorflow:BLEURT initialized.
I0630 08:28:46.424506 24396 score.py:151] BLEURT initialized.
INFO:tensorflow:Computing BLEURT scores...
I0630 08:28:46.424506 24396 score.py:305] Computing BLEURT scores...
Traceback (most recent call last):
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 344, in
tf.app.run()
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 339, in main
FLAGS.bleurt_checkpoint)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 321, in score_files
_consume_buffer()
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 300, in _consume_buffer
scores = scorer.score(ref_buffer, cand_buffer, FLAGS.bleurt_batch_size)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 186, in score
predict_out = self.predict_fn(tf_input)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 70, in _predict_fn
segment_ids=tf.constant(input_dict["segment_ids"])
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\function.py", line 1551, in call
return self._call_impl(args, kwargs)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\function.py", line 1591, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
ctx=ctx)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute __inference_pruned_1485 as input #0(zero-based) was expected to be a int64 tensor but is a int32 tensor [Op:__inference_pruned_1485]

Thanks,
Amit

How to load rembert distilled models?

Hi I am trying to load rembert distilled models for some of my downstream tasks. However, I am not able to do so.

AutoTokenizer.from_pretrained(model, **kwargs)

Can you help?

Incompatible dependencies with installing through pip

When installing this repo through pip, it raises the following errors.

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

tensorflow 2.3.1 requires numpy<1.19.0,>=1.16.0, but you'll have numpy 1.19.2 which is incompatible.

My workaround is to install an older numpy version beforehand by running pip install numpy==1.18.5.

However, when I run the test, it raises $ python -m unittest bleurt.score_test Illegal instruction (core dumped) from bash.

Error in finetuning BLEURT

Thank you for the great work and for open-sourcing it!

I am trying to follow the instructions in https://github.com/google-research/bleurt/blob/master/checkpoints.md#from-an-existing-bleurt-checkpoint to fine-tune the BLEURT-20 model on a customized set of ratings.

However, when I run the suggested command,

python -m bleurt.finetune \
  -train_set=../data/ratings_train.jsonl \
  -dev_set=../data/ratings_dev.jsonl \
  -num_train_steps=500 \
  -model_dir=../models/bleurt-20-fine1 \
  -init_bleurt_checkpoint=../models/BLEURT-20/

I get the following issue:

ValueError: Shape of variable bert/embeddings/LayerNorm/beta:0 ((1152,)) doesn't match with shape of tensor bert/embeddings/LayerNorm/beta ([256]) from checkpoint reader.

I have checked this with both tensorflow 2.7 and 1.15

Any help related to this would be appreciated!

file input from process substitution doesnt work

this works:

 python -m bleurt.score  -candidate_file bleurt/test_data/candidates \
   -reference_file bleurt/test_data/references \
   -bleurt_checkpoint=bleurt/test_checkpoint

But this does not work (on my machine):

python -m bleurt.score  -candidate_file <(head bleurt/test_data/candidates)
   -reference_file <(head bleurt/test_data/references) 
   -bleurt_checkpoint=bleurt/test_checkpoint

Why?

NOTE:
<(head /path/to/file) is meant to demostrate a simple usecase of process subsititution

In reality, someone like me would use
<(detokenize.sh < /path/to/file)
or <(cut -f2 /path/to/file.tsv)
and those should work with tf.io.gfile.GFile, right?

WMT17 dataset: Mismatch in candidate-reference pairs counts

Hi, I was trying to download the WMT17 dataset using the wmt/db_builder example shared in Experiments with the WMT Metrics shared task section. However, I found that downloading the WMT17 dataset in this manner results only in 3920 candidate-reference pairs, while the number of candidate-reference pairs in WMT17 is mentioned to be 5344 in the research paper.

Thus, I wished to check what might be causing this discrepancy in the count of candidate-reference pairs.

PS: I also tried setting average_duplicates flag to False while calling wmt/db_builder but that resulted in 4132 samples. Still lower than 5344 samples

Can't use predict_fn with TF2

Hi !

When I try to instantiate a PythonPredictor with a predict_fn function, I get this error:

    def __init__(self, predict_fn):
>     tf.logging.info("Creating Python-based predictor.")
E     AttributeError: module 'tensorflow' has no attribute 'logging'

I'm using TF2 so tensorflow.logging isn't available.

cc @tsellam I think this comes from the recent change to 0.0.2

UnrecognizedFlagError: Unknown command line flag 'f'

Hello!

I'm trying to run the code below, following the instructions in the README, and I'm getting an error. Can you help me? Follow the code used and the output. The tensorflow version used is 2.2.0.

import os
!git clone https://github.com/google-research/bleurt.git
os.chdir('bleurt')
!pip install .
from bleurt import score
import tensorflow as tf

checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references, candidates)
assert type(scores) == list and len(scores) == 1
print(scores)


UnrecognizedFlagError Traceback (most recent call last)
in ()
9
10 scorer = score.BleurtScorer(checkpoint)
---> 11 scores = scorer.score(references, candidates)
12 assert type(scores) == list and len(scores) == 1
13 print(scores)

2 frames
/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py in call(self, argv, known_only)
631 suggestions = _helpers.get_flag_suggestions(name, list(self))
632 raise _exceptions.UnrecognizedFlagError(
--> 633 name, value, suggestions=suggestions)
634
635 self.mark_as_parsed()

UnrecognizedFlagError: Unknown command line flag 'f'

Python API test_checkpoint not found

Hi,

Thank you for releasing the code. Interesting work!

I am trying to use the Python API in my code as:

from bleurt import score
checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]
scorer = score.BleurtScorer(checkpoint)

however, the bleurt/test_checkpoint is not found.

>>> scorer = score.BleurtScorer(checkpoint)
INFO:tensorflow:Reading checkpoint bleurt/test_checkpoint.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/metrics/lib/python3.7/site-packages/bleurt/score.py", line 133, in __init__
    config = checkpoint_lib.read_bleurt_config(checkpoint)
  File "/opt/conda/envs/metrics/lib/python3.7/site-packages/bleurt/checkpoint.py", line 78, in read_bleurt_config
    "Could not find BLEURT checkpoint {}".format(path)
AssertionError: Could not find BLEURT checkpoint bleurt/test_checkpoint

Is there a missing download link here?

If I don't provide any checkpoint

scorer = score.BleurtScorer()

Expected: If bleurt_checkpoint is not specified, BLEURT will default to the test checkpoint, based on BERT-Tiny, however, I am getting the assertion error

>>> scorer = score.BleurtScorer()
INFO:tensorflow:No checkpoint specified, defaulting to BLEURT-tiny.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/metrics/lib/python3.7/site-packages/bleurt/score.py", line 130, in __init__
    checkpoint = _get_default_checkpoint()
  File "/opt/conda/envs/metrics/lib/python3.7/site-packages/bleurt/score.py", line 56, in _get_default_checkpoint
    "Default checkpoint not found! Are you sure the install is complete?"
AssertionError: Default checkpoint not found! Are you sure the install is complete?

Could you please suggest a way around. Thanks!

How to use the checkpoints of BERT "warmed up" with synthetic ratings?

After I downloaded bleurt and successfully use the test_checkpoint or the fine-tuned checkpoint, I am thinking use the "warmed up" version. However, if I directly download the "warmed up" checkpoint and use it, it will show an error:

OSError: SavedModel file does not exist at: bleurt/bert-large-midtrained/bert-large//{saved_model.pbtxt|saved_model.pb}

After looking into the details, I found the files type under "warmed up" checkpoint is different than those under test_checkpoint or fine-tuned checkpoint.
Under fine-tuned checkpoint:

bert_config.json  bleurt_config.json  saved_model.pb  variables  vocab.txt

Under warmed up checkpoint:

bert_config.json  bert-large.data-00000-of-00001  bert-large.index  bert-large.meta  bleurt_config.json  vocab.txt

So how can we directly use the warmed up checkpoint for evaluation?

BLEURT consumes all available memory on checkpoint load?

Not quite sure what's happening here - running CUDA 11.6 and TensorFlow 2.10.0. No matter what checkpoint I use, all available GPU memory is consumed. Minimum reproducible example here:

bleurtcheckpoints = os.path.join(os.getcwd(), "bleurtcktpts")
from bleurt import score
checkpoint = os.path.join(bleurtcheckpoints, "bleurt-tiny-128/")
scorer = score.BleurtScorer(checkpoint)
INFO:tensorflow:Reading checkpoint /data/visualization/vis-text/datasets/vis-text/bleurtcktpts/bleurt-tiny-128/.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.

2022-10-25 21:53:54.567812: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.690050: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.691074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.692770: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-25 21:53:54.693762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.694475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.695136: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.420924: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.422112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.423287: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.424176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11413 MB memory:  -> device: 0, name: NVIDIA TITAN Xp, pci bus id: 0000:00:05.0, compute capability: 6.1

INFO:tensorflow:BLEURT initialized.

nvidia-smi results after (right before, it's only showing 4 MiB in use):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     On   | 00000000:00:05.0 Off |                  N/A |
| 23%   30C    P2    58W / 250W |  11697MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     16853      C   /usr/bin/python3.8              11693MiB |
+-----------------------------------------------------------------------------+

Results mismatch using released BLEURT-Large-128

Hi there,

Recently I'm interested in reimplementing and investigating your research. However, when I directly use your released code and BLEURT-Large-128 checkpoint model, I can't get comparable results with what you present here. Here is what I got:

  • de-en: 29.15
  • fi-en: 30.64
  • gu-en: 27.49
  • kk-en: 39.14
  • lt-en: 34.28
  • ru-en: 26.75
  • zh-en: 41.98
  • avg: 32.78

I first follow your command to get the WMT2019 data and the BLEURT-Large-128 checkpoint. After evaluating the whole dataset file, I collect the prediction scores, split the results and corresponding golden scores by language pairs, and compute Kendall results using scipy.stats.kendalltau as your implementation.

So I'm wondering I've missed any detail. Could you help me? Thanks!

What tensorflow versions are possible?

I am trying to run bleurt with tensorflow 2.15 and I get

TypeError: Binding inputs to tf.function failed due to `got an unexpected keyword argument 'input_ids'`. Received args: () and kwargs: {'input_ids': <tf.Tensor: shape=(10, 128), dtype=int64, numpy=

Which looks like my tensorflow is too new.

What is the newest tensorflow version supported by bleurt?

How to calculate overall bleurt score?

HI,

When i run the code on my own dataset I get the scores file back with a score calculated for each sample. Am i supposed to take average on all the scores generated?

My dataset consists of the reference file with a single line per sample, and a generated summary file also with a single line per sample. I am doing data to text generation, and the reference file is all the expected output summaries, and the generated summary file is all the actual output summaries.

Install fails (ERROR: Could not install packages due to an EnvironmentError)

I am trying to install this on my linux machine but it fails (error shown below). Install works fine on my Mac however. Cant find anything useful online- any suggestions to get this running?

ERROR: Could not install packages due to an EnvironmentError: [('/mnt/batch/tasks/shared/LS_root/mounts/clusters/path_to_bleurt/bleurt/.git/branches',.... (very long error)]

BLUERT for Spanish

I want to use bleurt for a Image Captioning in spanish, but I was searching if there is any parameter to do that but I didn't find it.
So I wanted to know if BLEURT dectect the languaje automatically of the captions o there are some parameter that I need to change or how do BLEURT do that and how can I do that?

How to use multiple reference with BLEURT๏ผŸ

It looks like the BLEURT is calculating the score for one hyp and one ref. What if I have multiple references like 5 or 6? Should I calculate the score with each of them and average for every single sentence?

db_builder.py add support for WMT20?

Hi. Thanks for your amazing efforts. The wmt data downloader is really a helpful tool.
I wonder are you considering adding support for WMT20? That would be great.

Thanks!

Issue with loading fine-tuned BLEURT

I am unable to load the fine-tuned model. It is falling back and loading the base model.

How I am loading the model -
model = evaluate.load("bleurt", module_type="metric", checkpoint="/finetuned_bleurt/export/bleurt_best/1689684085/")

I also tried the following, but same issue -
bleurt_model = evaluate.load("bleurt", module_type="metric", checkpoint="/finetuned_bleurt_base_128/")

What am I doing wrong here?

Answer comparison inaccuracy

I'm wondering if I'm doing something wrong. Setting the candidate and reference with the command:

python -m bleurt.score   -candidate_file=bleurt/test_data/candidates   -reference_file=bleurt/test_data/references   -bleurt_checkpoint=bleurt/test_checkpoint   -scores_file=scores

I'm comparing the candidate and reference of the following:

"A group of tasks that you monitor as a single unit."
"Aggregates a set of tasks and synchronize behaviors on the group."

The result yields: -0.4743865132331848

My experience with NLP is still early, so I apologize if my expectations have been set too high with this.

BLEURT returning scores less than zero

I'm not sure if this is supposed to happen or not, but when testing BLEURT on the test_data ( "Bud Powell was a legendary pianist.", etc...) I'm getting scores that are way below zero:

  1. Here is my output for bleurt/test_checkpoint
0.9129246473312378
0.2755325436592102
-0.34470897912979126
-0.737292468547821
  1. And my output for bleurt-base-128
1.003721833229065
0.5313903093338013
-1.489485502243042
-1.6975871324539185

Though I may be wrong, my understanding is that scores should be between 0 and 1.
Thanks!

Does bleurt support Chinese?

I tried to use your fine-tuned model on Chinese, but the result is awful with a 0.5 pearson correlation with sacrebleu. Is it because your model does not support Chinese? If not, then how can I use your codes on Chinese?

Optimising BLEURT for large dataset

Hi, great paper!
I am trying to use this implementation to compute BLEURT scores for > 60K English sentence pairs and even with the provided GPU optimisations, it takes days to calculate the scores.
Is there any way to configure this to run in a shorter time?

Python API Error (UnrecognizedFlagError: Unknown command line flag 'f')

Hello,

I have installed the BLEURT according to steps mentioned in README. I also have run all installation tests, and everything works fine. Then, I am trying to use the Python API, following the mentioned script:

from bleurt import score

checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references, candidates)
assert type(scores) == list and len(scores) == 1
print(scores)

I have checked that my Tensorflow is >=1.15, and tf-slim is >=1.1. I am using Python3.6 in Google Colab, in Tesla K80 GPU. However I got this error:

UnrecognizedFlagError                     Traceback (most recent call last)

<ipython-input-2-8427261da7b2> in <module>()
      6 
      7 scorer = score.BleurtScorer(checkpoint)
----> 8 scores = scorer.score(references, candidates)
      9 assert type(scores) == list and len(scores) == 1
     10 print(scores)

2 frames

/usr/local/lib/python3.6/dist-packages/bleurt/score.py in score(self, references, candidates, batch_size)
    164     """
    165     if not batch_size:
--> 166       batch_size = FLAGS.bleurt_batch_size
    167 
    168     candidates, references = list(candidates), list(references)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/flags.py in __getattr__(self, name)
     83     # a flag.
     84     if not wrapped.is_parsed():
---> 85       wrapped(_sys.argv)
     86     return wrapped.__getattr__(name)
     87 

/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py in __call__(self, argv, known_only)
    631       suggestions = _helpers.get_flag_suggestions(name, list(self))
    632       raise _exceptions.UnrecognizedFlagError(
--> 633           name, value, suggestions=suggestions)
    634 
    635     self.mark_as_parsed()

UnrecognizedFlagError: Unknown command line flag 'f'

Could you suggest any way around? Thank you!

Install with pip fails

When I'm try to install the repo as a package with pip, there is no problem in Linux, but in Windows Python 3.9 it fails due to fail in README.md. Basically, in Windows python tries to open file with cp1254 encoding by default, and it results in a failed installation giving the following error message

(base) C:\Users\devri>pip install git+https://github.com/google-research/bleurt.git --force-reinstall --no-cache-dir     
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to c:\users\devri\appdata\local\temp\pip-req-build-l293_p23                    
  Running command git clone -q https://github.com/google-research/bleurt.git 'C:\Users\devri\AppData\Local\Temp\pip-req-build-l293_p23'
  Resolved https://github.com/google-research/bleurt.git to commit c6f2375c7c178e1480840cf27cb9e2af851394f9
    ERROR: Command errored out with exit status 1:
     command: 'C:\tools\Anaconda3\envs\jury\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\devri\\AppData\\Local\\Temp\\pip-req-build-l293_p23\\setup.py'"'"'; __file__='"'"'C:\\Users\\devri\\AppData\\Local\\Temp\\pip-req-build-l293_p23\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.
exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\devri\AppData\Local\Temp\pip-pip-egg-info-s1jaf9gz'
         cwd: C:\Users\devri\AppData\Local\Temp\pip-req-build-l293_p23\
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\devri\AppData\Local\Temp\pip-req-build-l293_p23\setup.py", line 23, in <module>
        long_description = fh.read()
      File "C:\tools\Anaconda3\envs\base\lib\encodings\cp1254.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8e in position 2560: character maps to <undefined>

JIT compilation failed

I was trying to run the below sample code (python API) in Jupyter notebook but encounter this error

Sample Code:

import bleurt
from bleurt import score

checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=references, candidates=candidates)
assert isinstance(scores, list) and len(scores) == 1
print(scores)

Error:

2024-03-30 23:05:27.516741: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
2024-03-30 23:05:27.516767: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
	 [[{{node bert/embeddings/LayerNorm/batchnorm/Rsqrt}}]]

JIT compilation failed. 
[[{{node bert/embeddings/LayerNorm/batchnorm/Rsqrt}}]] [Op:__inference_pruned_2804]. 

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/function_type_utils.py:442, in bind_function_inputs(args, kwargs, function_type, default_values)
    441 try:
--> 442   bound_arguments = function_type.bind_with_defaults(
    443       args, sanitized_kwargs, default_values
    444   )
    445 except Exception as e:

File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/core/function/polymorphism/function_type.py:264, in FunctionType.bind_with_defaults(self, args, kwargs, default_values)
    263 """Returns BoundArguments with default values filled in."""
--> 264 bound_arguments = self.bind(*args, **kwargs)
    265 bound_arguments.apply_defaults()

File ~/anaconda3/envs/nlu/lib/python3.9/inspect.py:3045, in Signature.bind(self, *args, **kwargs)
   3041 """Get a BoundArguments object, that maps the passed `args`
   3042 and `kwargs` to the function's signature.  Raises `TypeError`
   3043 if the passed arguments can not be bound.
   3044 """
-> 3045 return self._bind(args, kwargs)

File ~/anaconda3/envs/nlu/lib/python3.9/inspect.py:3034, in Signature._bind(self, args, kwargs, partial)
   3033     else:
-> 3034         raise TypeError(
   3035             'got an unexpected keyword argument {arg!r}'.format(
   3036                 arg=next(iter(kwargs))))
   3038 return self._bound_arguments_cls(self, arguments)

TypeError: got an unexpected keyword argument 'input_ids'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/concrete_function.py:1179, in ConcreteFunction._call_impl(self, args, kwargs)
   1178 try:
-> 1179   return self._call_with_structured_signature(args, kwargs)
   1180 except TypeError as structured_err:

File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/concrete_function.py:1259, in ConcreteFunction._call_with_structured_signature(self, args, kwargs)
   1245 """Executes the wrapped function with the structured signature.
   1246 
   1247 Args:
   (...)
   1256     of this `ConcreteFunction`.
   1257 """
   1258 bound_args = (
-> 1259     function_type_utils.canonicalize_function_inputs(
   1260         args, kwargs, self.function_type)
   1261 )
   1262 filtered_flat_args = self.function_type.unpack_inputs(bound_args)

File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/function_type_utils.py:422, in canonicalize_function_inputs(args, kwargs, function_type, default_values, is_pure)
    421   args, kwargs = _convert_variables_to_tensors(args, kwargs)
--> 422 bound_arguments = bind_function_inputs(
    423     args, kwargs, function_type, default_values
    424 )
    425 return bound_arguments

File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/function_type_utils.py:446, in bind_function_inputs(args, kwargs, function_type, default_values)
    445 except Exception as e:
--> 446   raise TypeError(
    447       f"Binding inputs to tf.function failed due to `{e}`. "
    448       f"Received args: {args} and kwargs: {sanitized_kwargs} for signature:"
    449       f" {function_type}."
    450   ) from e
    451 return bound_arguments

TypeError: Binding inputs to tf.function failed due to `got an unexpected keyword argument 'input_ids'`. Received args: () and kwargs: {'input_ids': <tf.Tensor: shape=(1, 512), dtype=int64, numpy=

.......

nknownError: Graph execution error:
Detected at node bert/embeddings/LayerNorm/batchnorm/Rsqrt defined at (most recent call last)

I followed the readme instruction to install and it worked on command-line. Any idea on this error?

Getting BLEURT to Work in Jupyter Notebook on Windows: Unknown command line flag 'f', 32int error

Raising an unknown error flag. Installed exactly as described on front page.

  • tensorflow 2.3 or 2.0
  • python 3.7.5
  • Windows

Arises from either:

references = ['This is a test.', 'This is surely a test']
candidates = ['This is also a text', 'This could be a test']
checkpoint = 'C:/bleurt/bleurt/checkpoints/bleurt-tiny-512'
scorer = score.BleurtScorer(checkpoint)
scorer.score(references, candidates, batch_size = 2)

or

references = tf.constant(["This is a test."])
candidates = tf.constant(["This is the test."])
checkpoint = 'C:/bleurt/bleurt/checkpoints/bleurt-tiny-512'
scorer = score.BleurtScorer(checkpoint)
scorer.score(references, candidates, batch_size = 2)

Error:

UnrecognizedFlagError Traceback (most recent call last)
in
13
14 scorer = score.BleurtScorer(checkpoint)
---> 15 scorer.score(references, candidates, batch_size = 2)
16 # bleurt_out = scorer(references, candidates)
17 # # bleurt_ops = score.create_bleurt_ops()

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\score.py in score(self, references, candidates, batch_size)
178 batch_cand = candidates[i:i + batch_size]
179 input_ids, input_mask, segment_ids = encoding.encode_batch(
--> 180 batch_ref, batch_cand, self.tokenizer, self.max_seq_length)
181 tf_input = {
182 "input_ids": input_ids,

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\encoding.py in encode_batch(references, candidates, tokenizer, max_seq_length)
150 encoded_examples = []
151 for ref, cand in zip(references, candidates):
--> 152 triplet = encode_example(ref, cand, tokenizer, max_seq_length)
153 example = np.stack(triplet)
154 encoded_examples.append(example)

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\encoding.py in encode_example(reference, candidate, tokenizer, max_seq_length)
56 # Tokenizes, truncates and concatenates the sentences, as in:
57 # bert/run_classifier.py
---> 58 tokens_ref = tokenizer.tokenize(reference)
59 tokens_cand = tokenizer.tokenize(candidate)
60

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\lib\tokenization.py in tokenize(self, text)
144 def tokenize(self, text):
145 split_tokens = []
--> 146 for token in self.basic_tokenizer.tokenize(text):
147 if preserve_token(token, self.vocab):
148 split_tokens.append(token)

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\lib\tokenization.py in tokenize(self, text)
189 split_tokens = []
190 for token in orig_tokens:
--> 191 if preserve_token(token, self.vocab):
192 split_tokens.append(token)
193 continue

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\lib\tokenization.py in preserve_token(token, vocab)
43 def preserve_token(token, vocab):
44 """Returns True if the token should forgo tokenization and be preserved."""
---> 45 if not FLAGS.preserve_unused_tokens:
46 return False
47 if token not in vocab:

c:\programdata\anaconda3\envs\context2\lib\site-packages\tensorflow\python\platform\flags.py in getattr(self, name)
83 # a flag.
84 if not wrapped.is_parsed():
---> 85 wrapped(_sys.argv)
86 return wrapped.getattr(name)
87

c:\programdata\anaconda3\envs\context2\lib\site-packages\absl\flags_flagvalues.py in call(self, argv, known_only)
631 suggestions = _helpers.get_flag_suggestions(name, list(self))
632 raise _exceptions.UnrecognizedFlagError(
--> 633 name, value, suggestions=suggestions)
634
635 self.mark_as_parsed()

UnrecognizedFlagError: Unknown command line flag 'f'

Interpreting Bleurt scores

Hi,
sorry for the stupid question, maybe you have already answered to this.
You wrote: "In practice however, the answers tend to be very correlated with fluency ("Is the text fluent English?"), and we added synthetic noise in the training set which makes the distinction between adequacy and fluency somewhat fuzzy",
I'm a little bit confused by this.
Which are ultimately the aspects evaluated in the text by the metric in translation task?
Thanks and sorry for bad english ;)

Is text truncation to 512 tokens handled automatically for both candidate and reference texts?

Hi,

I wanted to clarify the following information. On the checkpoints page here, you mention that

Currently, the following six BLEURT checkpoints are available, fine-tuned on WMT Metrics ratings data from 2015 to 2018. They vary on two aspects: the size of the model, and the size of the input.

Let's say I am using the following model - BLEURT-Base, 512 (max #tokens). In my case, both generated text and reference text are longer than 512 tokens. While computing the BLEURT, will it automatically truncate both generated text and reference text to fit the requirement and then calculate the score between truncated versions of generated text and reference text? Or do I need to cut the length of generated text and reference text manually before calling the function to calculate BLEURT?

Many thanks in advance,
Ruslan

UnrecognizedFlagError: Unknown command line flag 'f'

Getting this issue.

    10 bleurt_ops = score.create_bleurt_ops()
---> 11 bleurt_out = bleurt_ops(references, candidates)
     12 
     13 assert bleurt_out["predictions"].shape == (1,)

11 frames
/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py in __call__(self, argv, known_only)
    631       suggestions = _helpers.get_flag_suggestions(name, list(self))
    632       raise _exceptions.UnrecognizedFlagError(
--> 633           name, value, suggestions=suggestions)
    634 
    635     self.mark_as_parsed()

UnrecognizedFlagError: Unknown command line flag 'f'

About fine tune

Why i finetuned the model and save it no weights are reserved expect some configs

issue

As I follow the command until:

wget https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip .
unzip bleurt-base-128.zip
python -m bleurt.score
-candidate_file=bleurt/test_data/candidates
-reference_file=bleurt/test_data/references
-bleurt_checkpoint=bleurt-base-128

I actually finish the step in your program, but as I run Python's API ,
the code like this:

from bleurt import score

checkpoint = "C:\bleurt-master\bert-base-128"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=references, candidates=candidates)
assert type(scores) == list and len(scores) == 1
print(scores)

It occurred that:AssertionError: Could not find BLEURT checkpoint Cleurt-masteert-base-128

WMT metric shared dataset download error

We cannot access the current download URL for the wmt17, wmt18 datasets.
When I run this command,

python -m bleurt.wmt.db_builder   -target_language="en"   -rating_years="2017"

It gives an error

INFO:tensorflow:Downloading newstest2017-segment-level-human from http://computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz
I0824 12:51:08.780502 140389933356864 downloaders.py:139] Downloading newstest2017-segment-level-human from http://computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz
Downloading data from http://computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/keras/utils/data_utils.py", line 274, in get_file
    urlretrieve(origin, fpath, dl_progress)
  File "/opt/conda/lib/python3.8/site-packages/keras/utils/data_utils.py", line 82, in urlretrieve
    response = urlopen(url, data)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/opt/conda/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/private/metric_shared/bleurt/bleurt/wmt/db_builder.py", line 273, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/data/private/metric_shared/bleurt/bleurt/wmt/db_builder.py", line 262, in main
    create_wmt_dataset(FLAGS.target_file, FLAGS.rating_years,
  File "/data/private/metric_shared/bleurt/bleurt/wmt/db_builder.py", line 100, in create_wmt_dataset
    importer.fetch_files()
  File "/data/private/metric_shared/bleurt/bleurt/wmt/downloaders.py", line 305, in fetch_files
    super(Importer17, self).fetch_files()
  File "/data/private/metric_shared/bleurt/bleurt/wmt/downloaders.py", line 140, in fetch_files
    _ = tf.keras.utils.get_file(
  File "/opt/conda/lib/python3.8/site-packages/keras/utils/data_utils.py", line 276, in get_file
    raise Exception(error_msg.format(origin, e.code, e.msg))
Exception: URL fetch failure on http://computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz: 503 -- Service Unavailable

We have to change the URL to the HTTPS version, for example,
https://www.computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz

Code for Section 4 of the paper

Hi,

I was wondering if you released the code for either generating synthetic sentence pairs or running pre-training tasks that you mentioned in the Section 4 of the BLEURT paper. Thank you!

Reproducing table 2 results

Hello and thank you very much for your contribution to the field and open-sourcing the code.

I am trying to reproduce the table 2 results from the paper using the code specified here. I had to add a value for -max_seq_length since the command wouldn't run otherwise. I also train for 40k steps instead of 20k steps, as is specified in the paper. Otherwise, I am running the exact same command.

The results I obtain are different from the ones shown in Table 2 in the paper. Here's what I obtain

{"cs-en": {"kendall": 0.45062611806797853, "pearson": 0.6431185796263721, "spearman": 0.6229406024891663, "wmt_da_rr_kendall": -1.0, "sys-kendall": 1.0, "sys-pearson": 0.9755128178401778, "sys-spearman": 1.0}, "de-en": {"kendall": 0.4543906669799155, "pearson": 0.6222954800372593, "spearman": 0.6351588328764061, "wmt_da_rr_kendall": null, "sys-kendall": 0.5636363636363636, "sys-pearson": 0.8306197237216054, "sys-spearman": 0.8000000000000002}, "fi-en": {"kendall": 0.5518527983644262, "pearson": 0.7438350752272211, "spearman": 0.7497050145476959, "wmt_da_rr_kendall": null, "sys-kendall": 0.9999999999999999, "sys-pearson": 0.9914245385710291, "sys-spearman": 1.0}, "lv-en": {"kendall": 0.5359953999488883, "pearson": 0.7415974274659147, "spearman": 0.7285339831167466, "wmt_da_rr_kendall": null, "sys-kendall": 0.9444444444444445, "sys-pearson": 0.9648751710079252, "sys-spearman": 0.9833333333333333}, "ru-en": {"kendall": 0.5051750575006388, "pearson": 0.7192891660812555, "spearman": 0.6896893120559332, "wmt_da_rr_kendall": null, "sys-kendall": 0.8333333333333334, "sys-pearson": 0.9496574815839979, "sys-spearman": 0.9166666666666666}, "tr-en": {"kendall": 0.48177868642984917, "pearson": 0.6692174622720356, "spearman": 0.6613286166637742, "wmt_da_rr_kendall": null, "sys-kendall": 0.7333333333333333, "sys-pearson": 0.8750333198839647, "sys-spearman": 0.8545454545454544}, "zh-en": {"kendall": 0.47126245847176074, "pearson": 0.6771181689880665, "spearman": 0.6452188030847402, "wmt_da_rr_kendall": null, "sys-kendall": 0.7, "sys-pearson": 0.8358359469478294, "sys-spearman": 0.8617647058823529}}

Do you have any guidance as to what I might be doing wrong? Could it be that I'm not using the correct initial BLEURT checkpoint?

Thanks a lot

Checkpoints of fine-tuned on WebNLG

Could you release the checkpoints of fine-tuned on WebNLG, as mentioned in section 5.3 in the paper? If it's not allowed to be public, could you give more specific instructions on how to fine-tune on WebNLG ratings? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.