huggingface / evaluate Goto Github PK

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Home Page: https://huggingface.co/docs/evaluate

License: Apache License 2.0

Makefile 0.06% Python 99.94%

evaluate's Introduction

🤗 Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.

It currently contains:

implementations of dozens of popular metrics: the existing metrics cover a variety of tasks spanning from NLP to Computer Vision, and include dataset-specific metrics for datasets. With a simple command like accuracy = load("accuracy"), get any of these metrics ready to use for evaluating a ML model in any framework (Numpy/Pandas/PyTorch/TensorFlow/JAX).
comparisons and measurements: comparisons are used to measure the difference between models and measurements are tools to evaluate datasets.
an easy way of adding new evaluation modules to the 🤗 Hub: you can create new evaluation modules and push them to a dedicated Space in the 🤗 Hub with evaluate-cli create [metric name], which allows you to see easily compare different metrics and their outputs for the same sets of references and predictions.

🎓 Documentation

🔎 Find a metric, comparison, measurement on the Hub

🌟 Add a new evaluation module

🤗 Evaluate also has lots of useful features like:

Type checking: the input types are checked to make sure that you are using the right input formats for each metric
Metric cards: each metrics comes with a card that describes the values, limitations and their ranges, as well as providing examples of their usage and usefulness.
Community metrics: Metrics live on the Hugging Face Hub and you can easily add your own metrics for your project or to collaborate with others.

Installation

With pip

🤗 Evaluate can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install evaluate

Usage

🤗 Evaluate's main methods are:

evaluate.list_evaluation_modules() to list the available metrics, comparisons and measurements
evaluate.load(module_name, **kwargs) to instantiate an evaluation module
results = module.compute(*kwargs) to compute the result of an evaluation module

Adding a new evaluation module

First install the necessary dependencies to create a new metric with the following command:

pip install evaluate[template]

Then you can get started with the following command which will create a new folder for your metric and display the necessary steps:

evaluate-cli create "Awesome Metric"

See this step-by-step guide in the documentation for detailed instructions.

Credits

Thanks to @marella for letting us use the evaluate namespace on PyPi previously used by his library.

evaluate's People

Contributors

Stargazers

Watchers

Forkers

francescosaveriozuppichini prabhakar267 cceyda bharatr21 azizighani isaac009 techthiyanes neel7317 ssghost msoftware laplacekorea manueldeprada abhilb demo88 ianliyi1996 panwarnaveen9 elm8116 dumpmemory forrestbao sg-future kanka-max artemisep fabiovargas iserralv ouerxiao nouamanetazi juliensimon anddromeda dshivashankar pn11 muskanmahajan486 narutohyc yuyhao virdi16 mfumanelli fxmarty benlipkin giladcohen william3johnson inferno-inc suehyunpark versipellis saullu fcakyon mnabihali bp-high scmart yulong-me sanderland smisomazibuko mouhanedg56 nielsrogge falcaopetri kashif fudp raibows alvations jackmin801 tickleliu nimaboscarino johnpertoft clefourrier mmarius lvwerra awinml arjunpatel7 sanchit-gandhi sywangyi weichuanw lorazarol karthy257 henriquesousa7 henryj18 bayartsogt-ya huangyf530 kunlun-zhu maslikovegor k-blo younesbelkada alex-ht tarikkaankoc unna97 lokking cakiki wauplin plutone11011 samkenxstream a-why-not-fork-repositories-good-luck xingyaoww martinabeleda rtu4673 skanderhellal avinashsai automationkit kdutia jsondai medahmedkrichen1 mohamedmesto quadv 5l1v3r1

evaluate's Issues

Add the CIDEr metric?

Hi,
I find the api in https://huggingface.co/metrics quite useful.
I am playing around with video/image captioning task, where CIDEr is a popular metric.
Do you plan to add this into the HF datasets library?
Thanks.

Issue with Perplexity metric

Describe the bug

For most of the model keys that are listed for perplexity score, the keys are not present.
https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM

For using keys like bert-base-multilingual-cased or xlm-roberta-large, I get the following error

Steps to reproduce the bug

from datasets import load_metric
metric = load_metric("perplexity")

metric.compute(
    model_id="bert-base-multilingual-cased",
    input_texts=dataset
)

Expected results

The model should have been able to process the request and produce a perplexity score

{'perplexity': XXX}

Actual results

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_30901/2773184990.py in <cell line: 1>()
----> 1 metric.compute(
      2     model_id="bert-base-multilingual-cased",
      3     input_texts=flores_true_src
      4 )

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/datasets/metric.py in compute(self, predictions, references, **kwargs)
    428             inputs = {input_name: self.data[input_name] for input_name in self.features}
    429             with temp_seed(self.seed):
--> 430                 output = self._compute(**inputs, **compute_kwargs)
    431 
    432             if self.buf_writer is not None:

~/.cache/huggingface/modules/datasets_modules/metrics/perplexity/35bbdc10965bddb5e3a69737699ecb19f4ddc4c115e0b929c94bb7d38897bd57/perplexity.py in _compute(self, input_texts, model_id, stride, device)
    110         special_tokens_masks = encodings["special_tokens_mask"]
    111 
--> 112         max_model_length = model.config.n_positions
    113 
    114         ppls = []

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/configuration_utils.py in __getattribute__(self, key)
    250         if key != "attribute_map" and key in super().__getattribute__("attribute_map"):
    251             key = super().__getattribute__("attribute_map")[key]
--> 252         return super().__getattribute__(key)
    253 
    254     def __init__(self, **kwargs):

AttributeError: 'BertConfig' object has no attribute 'n_positions'

Environment info

datasets version: 2.0.0
Platform: Linux-5.4.0-1066-aws-x86_64-with-glibc2.10
Python version: 3.8.12
PyArrow version: 7.0.0
Pandas version: 1.3.4

Using datasets.Metric with Trainer()

Hi team, I was quite surprised in the Metric documentation I don't see how it can be used with Trainer(). That would be the most intuitive use case instead of having to iterate the batches and add predictions and references to the metric, then compute the metric manually. Ideally, any pre-built metrics can be added to compute_metrics argument of Trainer() and they will be calculated at an interval specified by TrainingArguments.evaluation_strategy.

Is this option available but just not mentioned in the documentation or it's not possible at the moment? I notice in the Transformer | Training and fine-tuning tutorial, you are using custom scripts to calculate the accuracy, P/R/F, which are already in the pre-built metrics.

How to Add New Metrics Guide

Is your feature request related to a problem? Please describe.
Currently there is an absolutely fantastic guide for how to contribute a new dataset to the library. However, there isn't one for adding new metrics.

Describe the solution you'd like
I'd like for a guide in a similar style to the dataset guide for adding metrics. I believe many of the content in the dataset guide such as setup can be easily copied over with minimal changes. Also, from what I've seen with existing metrics, it shouldn't be as complicated, especially in documentation of the metric, mainly just citation and usage. The most complicated part I see would be in automated tests that run the new metrics, but y'all's test suite seem pretty comprehensive, so it might not be that hard.

Describe alternatives you've considered
One alternative would be just not having the metrics be community generated and so would not need a step by step guide. New metrics would just be proposed as issues and the internal team would take care of them. However, I think it makes more sense to have a step by step guide for contributors to follow.

Additional context
I'd be happy to help with creating this guide as I am very interested in adding software engineering metrics to the library 🤓, the part I would need guidance on would be testing.

P.S. Love the library and community y'all have built! 🤗

Broken Link - Step-by-Step Guide

The step-by-step guide link given in the readme is broken

https://huggingface.co/docs/evaluate/creating_and_sharing.html

[request] make load_metric api intutive

metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=rank)

May I suggest that num_process is confusing as it's singular yet expects a plural value and either

be deprecated in favor of num_processes which is more intuitive since it's plural as its expected value
or even better why not mimic the established dist environment convention for that purpose, which uses world_size.

Same for process_id - why reinvent the naming and needing to explain that this is NOT PID, when we have rank already. That is:

metric = load_metric('glue', 'mrpc', world_size=world_size, rank=rank)

This then fits like a glove into the pytorch DDP and alike envs. and we just need to call:

dist.get_world_size()
dist.get_rank()

So it'd be as simple as:

metric = load_metric('glue', 'mrpc', world_size=dist.get_world_size(), rank=dist.get_rank())

From: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

world_size (int, optional) – Number of processes participating in the job. Required if store is specified.
rank (int, optional) – Rank of the current process. Required if store is specified.

And may be an example would be useful, so that the user doesn't even need to think about where to get dist:

import torch.distributed as dist
if dist.is_initialized():
    metric = load_metric(metric_name, world_size=dist.get_world_size(), rank=dist.get_rank())
else:
    metric = load_metric(metric_name)

I'm aware this is pytorch-centric, but it's better than no examples, IMHO.

Thank you.

Add runtime metrics to `Evaluator`

Currently, the Evaluator class for text-classification computes the model metrics on a given dataset. In addition to model metrics, it would be nice if the Evaluator could also report runtime metrics like eval_runtime (latency) and eval_samples_per_second (throughput).

cc @philschmid

How to add tensors to batch?

How to add pytorch tensors to a metric's batch? I tried to declare the features types as Sequence(Value()) and as Array2D() both didn’t work

import logging

import torch
from datasets import Features, Sequence, Value

from evaluate.module import EvaluationModule, EvaluationModuleInfo

logging.basicConfig(level=logging.INFO)


class DummyMetric(EvaluationModule):
    def _info(self):
        return EvaluationModuleInfo(
            description="dummy metric for tests",
            citation="insert citation here",
            features=Features(
                {"tensor1": Sequence(Value("float")), "tensor2": Sequence(Value("int64"))}
            ),
        )

    def _compute(self, tensor1, tensor2):
        return tensor1, tensor2


metric = DummyMetric()
metric.add_batch(tensor1=torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]), tensor2=torch.tensor([1, 2]))

Load metrics from Hub instead of GitHub repo

In order for the metrics on spaces (added here #14) to be loaded we need to changed the loading mechanism to point to the hub instead of evalutes's repository. This should not take much more effort than changing the URL from GitHub to the Hub.

cc @lhoestq

Handle zero divs in consistent way

Sklearn has a consistent way of dealing with division-by-zero (either ignore, error out or warn), which we pass on in wrappers (see example).

In our own canonical evaluation modules, we also have potential ZeroDivs that we currently do not handle, e.g.: https://github.com/huggingface/evaluate/blob/main/comparisons/mcnemar/mcnemar.py#L94. We should fix this.

Cache results from `Evaluator`

When running an evaluation with the Evaluator class, it would be great to cache the results (e.g. store them as an Arrow dataset) so that one doesn't have to wait to recompute everything each time.

This would also give evaluate a similar developer experience to datasets and allow users to leverage the intuitions from one lib to the other :)

Feature: enrich `MetricsInfo` with more meta information

Add more meta information to the MetricsInfo. The existing fields are listed here:

evaluate/src/evaluate/metric.py

Lines 76 to 135 in f884495

 class MetricInfoMixin: 

 """This base class exposes some attributes of MetricInfo 

  at the base level of the Metric for easy access. 

  """ 

 def __init__(self, info: MetricInfo): 

 self._metric_info = info 

 @property 

 def info(self): 

 """:class:`datasets.MetricInfo` object containing all the metadata in the metric.""" 

 return self._metric_info 

 @property 

 def name(self) -> str: 

 return self._metric_info.metric_name 

 @property 

 def experiment_id(self) -> Optional[str]: 

 return self._metric_info.experiment_id 

 @property 

 def description(self) -> str: 

 return self._metric_info.description 

 @property 

 def citation(self) -> str: 

 return self._metric_info.citation 

 @property 

 def features(self) -> Features: 

 return self._metric_info.features 

 @property 

 def inputs_description(self) -> str: 

 return self._metric_info.inputs_description 

 @property 

 def homepage(self) -> Optional[str]: 

 return self._metric_info.homepage 

 @property 

 def license(self) -> str: 

 return self._metric_info.license 

 @property 

 def codebase_urls(self) -> Optional[List[str]]: 

 return self._metric_info.codebase_urls 

 @property 

 def reference_urls(self) -> Optional[List[str]]: 

 return self._metric_info.reference_urls 

 @property 

 def streamable(self) -> bool: 

 return self._metric_info.streamable 

 @property 

 def format(self) -> Optional[str]: 

 return self._metric_info.format

Things that could be useful to add:

requirements: we could warn and instruct on what is required if not already installed.
test_cases: a list of input-output pairs to test whether the metrics works as expected. this could also be used to populate a metric widget with examples where the user can preselect a few choices before testing a custom input. Could also trigger automatic tests of metrics.
score_range: what are the possible values for the metric score. E.g. [0, 1] for accuracy or [-1, 1] for cosine similarity.
score_order: is larger or smaller better. Could be useful for automatic ordering of several results.
task_id: a list of tasks this metric is applicable (e.g. "binary-classification"

cc @sashavor @lhoestq

Add support for other languages for rouge

I calculate rouge with

from datasets import load_metric
rouge = load_metric("rouge")
rouge_output = rouge.compute(predictions=['тест тест привет'], references=['тест тест пока'], rouge_types=[
    "rouge2"])["rouge2"].mid
print(rouge_output)

the result is
Score(precision=0.0, recall=0.0, fmeasure=0.0)
It seems like the rouge_score library that this metric uses filters all non-alphanueric latin characters
in rouge_scorer/tokenize.py with text = re.sub(r"[^a-z0-9]+", " ", six.ensure_str(text)).
Please add support for other languages.

Only retain relevant statistics in certain metrics

Is your feature request related to a problem? Please describe.
As I understand, in the add_batch() function, the raw predictions and references are kept (in memory?) until compute() is called.
https://github.com/huggingface/datasets/blob/e248247518140d5b0527ce2843a1a327e2902059/src/datasets/metric.py#L423-L442

This takes O(n) memory. However, for many (most?) metrics, this is not necessary. E.g., for accuracy, only the # correct and # total need to be recorded.

Describe the solution you'd like
Probably an inheritance hierarchy where "predictions" and "references" are not always the two keys for the final metric computation. Each metric should create and maintain its own relevant statistics, again for example, "n_correct" and "n_total" for accuracy.

I believe the metrics in AllenNLP (https://github.com/allenai/allennlp/tree/39c40fe38cd2fd36b3465b0b3c031f54ec824160/allennlp/training/metrics) can be used as a good reference.

Describe alternatives you've considered
At least Metric.compute() shouldn't hard-code "predictions" and "references" so that custom subclasses may override this behavior.
https://github.com/huggingface/datasets/blob/e248247518140d5b0527ce2843a1a327e2902059/src/datasets/metric.py#L399-L400

Add support for IR mertics

Mertics like ndcg, dcg

Add COCO evaluation metrics

I'm currently working on adding Facebook AI's DETR model (end-to-end object detection with Transformers) to HuggingFace Transformers. The model is working fine, but regarding evaluation, I'm currently relying on external CocoEvaluator and PanopticEvaluator objects which are defined in the original repository (here and here respectively).

Running these in a notebook gives you nice summaries like this:

It would be great if we could import these metrics from the Datasets library, something like this:

import datasets

metric = datasets.load_metric('coco')

for model_input, gold_references in evaluation_dataset:
    model_predictions = model(model_inputs)
    metric.add_batch(predictions=model_predictions, references=gold_references)

final_score = metric.compute()

I think this would be great for object detection and semantic/panoptic segmentation in general, not just for DETR. Reproducing results of object detection papers would be way easier.

However, object detection and panoptic segmentation evaluation is a bit more complex than accuracy (it's more like a summary of metrics at different thresholds rather than a single one). I'm not sure how to proceed here, but happy to help making this possible.

Connect `evaluate` with the Data Measurements Tool

Given that many of the information from the Data Measurements Tool are relevant to the evaluate library, it would be useful to connect the two.
For instance, having a measurements folder in the evaluate repo that would allow users to call, e.g.

evaluate.measurements.npmi('glue','woman','housewife')
evaluate.measurements.npmi('glue','man','programmer')

and get the nPMI values for those terms and the GLUE dataset.

The input/output structure of measurements may be a bit different, but they are useful additions to the repo.

cc @mmitchellai @yjernite @TristanThrush

Add `Evaluator` class to easily evaluate a combination of (model, dataset, metric)

Similar to the Trainer class in transformers it would be nice to easily evaluate a model on a dataset given a metric. We could use the Trainer but it comes with a lot of unused extra stuff and is transformers centric. Alternatively we could build an Evaluator as follows:

from evaluate import Evaluator
from evaluate import load_metric
from dataset import load_dataset
from transformers import pipeline

metric = load_metric("bleu")
dataset = load_dataset("wmt19", language_pair=("de", "en"))
pipe = pipeline("translation", model="opus-mt-de-en"))

# WMT specific transform
dataset = dataset.map(lambda x: {"source": x["translation"]["de"], "target": x["translation"]["en"]}) 

evaluator = Evaluator(
    model=pipe,
    dataset=dataset,
    metric=metric,
    dataset_mapping={"model_input": "source", "references": "target"}
)

evaluator.evaluate()
>>> {"bleu": 12.4}

The dataset_mapping maps the dataset columns to inputs for the model and metric. Using the pipeline API as the standard for the Evaluator this could easily be extended to any other framework. The user would just need to setup a pipeline class with the main requirement being that inputs and outputs follow the same format and that the class has implemented a __call__ method.

The advantage of starting with the pipeline API is that in transformers it already implements a lot of quality of life functionality such as batching and GPU. Also it abstracts away the pre/post-processing.

In #16 it is mentioned that statistical significance testing would be a desired feature. The above example could be extended to enable this:

evaluator.evalute(n_runs=42)
>>> [{"bleu": 12.4}, {"bleu": 8.3}, ...]

Where under the hood the random seed is changed between the runs.

cc @douwekiela @osanseviero @NRajani @lhoestq

Rouge and Meteor for multiple references

Hi,

Currently rogue and meteor supports only single references. Can we use these metrics to calculate for multiple references?

Statistical significance testing

As mentioned in a recent paper about evaluating MT approaches (and probably other sources too), statistical significance testing can be used to confirm that one method is superior to another, saying that 'it remains one of the most cost-effective tools to check how trustworthy a particular difference between two metric scores is."

We could possibly use the Wilcoxon signed-rank test (implemented in scipy) or another similar approach.

Docs: add documentation page

Before the release we need to setup the documentation for the library:

update documentation
add examples
deploy docs page

Calling existing metrics from other metrics

There are several cases of metrics calling other metrics, e.g. Wiki Split which calls BLEU and SARI. These are all currently re-implemented each time (often with external code).

A potentially more efficient and centralized way of doing things would maybe to have a single implementation and calling that implementation.

E.g. @lhoestq 's proposal:

def _compute(...):
    bleu = load_metric("bleu", cache_dir=self.cache_dir, seed=self.seed)
    output = bleu._compute(...)

Something to keep in mind for the big metric reorg!

Seq2Seq Metrics QOL: Bleu, Rouge

Putting all my QOL issues here, idt I will have time to propose fixes, but I didn't want these to be lost, in case they are useful. I tried using rouge and bleu for the first time and wrote down everything I didn't immediately understand:

Bleu expects tokenization, can I just kwarg it like sacrebleu?
different signatures, means that I would have had to add a lot of conditionals + pre and post processing: if I were going to replace the calculate_rouge and calculate_bleu functions here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L61

What I tried

Rouge experience:

rouge = load_metric('rouge')
rouge.add_batch(['hi im sam'], ['im daniel']) # fails
rouge.add_batch(predictions=['hi im sam'], references=['im daniel']) # works
rouge.compute() # huge messy output, but reasonable. Not worth integrating b/c don't want to rewrite all the postprocessing.

BLEU experience:

bleu = load_metric('bleu')
bleu.add_batch(predictions=['hi im sam'], references=['im daniel'])
bleu.add_batch(predictions=[['hi im sam']], references=[['im daniel']])

bleu.add_batch(predictions=[['hi im sam']], references=[['im daniel']])

All of these raise ValueError: Got a string but expected a list instead: 'im daniel'

Doc Typo

This says dataset=load_metric(...) which seems wrong, will cause NameError

cc @lhoestq, feel free to ignore.

Feature: integration standard libraries

Create a uniform integration for metrics libraries such as scikit-learn, keras.metrics or torchmetrics instead of integrating them metrics from these frameworks one-by-one. The goal of the integration is to be able to load the metrics from these frameworks with a syntax as follows:

accuracy = load_metric("scikit-learn/accuracy")

This way there is also more transparency where the metrics come from especially if they are just a wrapper around existing libraries. This would also simplify maintaining them by avoiding the need to change every metric separately if needed.

Popular metrics libraries:

Others (maybe a bit more niche):

If all metrics (incl. "canonical" ones) will be on the hub such an integration could just be a script that automatically creates (or updates) the necessary repositories on the hub.

Feature: save metrics results locally

Create a way to store the result of a metric locally. Especially useful when running many experiments. Besides the score of the evaluation it is important to be able to store as much additional information as possible:

for model in all_models:
    for dataset in all_datasets:
        predictions, references = evaluate(model, dataset)
        metric.add(predictions, references)
        metric.compute()
        metric.save(name="experiment_name", dataset_name=dataset.name, model_name=model.name, **model.hyperparams)

Besides the information provided by the users (essentially key, value pairs) we could also save some additional information:

Time of experiment
path to script that created the scores
git commit hash of repo that produced the scores
others?

The results could be stored in simple JSON such that we would not even need to provide a loading mechanism as a user could easily load the results e.g. as a pandas.DataFrame.

Feature: calibration error estimators

While implementations of standard metrics are scattered, this is certainly the case for L^p calibration error estimators.

Why is it important to measure model calibration?
When fine-tuning a model with cross-entropy loss (or any other strictly proper loss optimizing both accuracy and calibration) there is no guarantee that your model will turn out well-calibrated. Empirically, large NNs were shown to "overfit" on accuracy, leading to sub-optimal calibration.

Additionally, binned estimators typically require setting some arguments such as the binning scheme (equal-range/equal-mass/...), number of bins, and the p-norm. More advanced settings include debiasing ([1] Kumar et al. 2019) or the proxy used for average bin probability (bin center or bin left/right edge).
This library might provide the right standardization and documentation on which arguments are important and how they impact estimation and comparisons.

Without complicating matters from the start, it might already be nice to have a simple calibration error estimator along the lines of ECE ([2] Guo et al. 2017), which (despite some flaws and differing implementations) is commonly used to report top-1 miscalibration. Some "reasonable" defaults are equal-range binning with 15 bins and p-norm 1, to be discussed :).

With regards to the implementation, there is a clean way to create a hashmap of unique bin assignments which keeps running averages of the conditional expectation and average bin probabilities. In the future, the hashmap can even be created on the validation set and when evaluating on the test set retrieves values by the hash, resulting in unbiased estimates.

I would very much appreciate a discussion on this so that the community is aligned on the approach.
Over the weekend, I will get my hands dirty for a first PR, maybe better to discuss it there :)
First functional code dump here: https://huggingface.co/spaces/jordyvl/ece

[1] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
[2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.

Defining and standardizing metric input structures

Documenting the types of inputs and data structures used for each metric

accuracy

predictions = Sequence(Value("int32"))
references = Sequence(Value("int32"))

or if "multilabel" mode:

predictions= Value("int32")
references = Value("int32")

bertscore

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

bleu

predictions = Sequence(Value("string", id="token"), id="sequence")
references = Sequence(Value("string", id="token"), id="sequence"), id="references"

bleurt

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

cer

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

chrf

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

code_eval

predictions = Sequence(Value("string"))
references = Value("string")

comet

sources = Value("string", id="sequence")
predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

competition_math

predictions = Value("string")
references = Value("string")

coval

predictions = Sequence(Value("string"))
references = Sequence(Value("string"))

N.B. The sentences have to be in CoNLL format, which may be tricky to handle in some cases

cuad

"predictions": {
    "id": Value("string"),
    "prediction_text": Sequence(Value("string")),
}
  "references": {
      "id": Value("string"),
      "answers": Sequence(
          {
              "text": Value("string"),
              "answer_start": Value("int32"),
          }
      ),
  },
}

exact_match

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

predictions = Sequence(Value("int32")
references = Sequence(Value("int32"))

frugalscore

references = Value("string")
predictions = Value("string")

gleu

predictions = Sequence(Value("string", id="token"), id="sequence")
references = Sequence(Sequence(Value("string", id="token"), id="sequence"), id="references")

glue

predictions = Value("int64" if self.config_name != "stsb" else "float32")
references = Value("int64" if self.config_name != "stsb" else "float32")

The type of input depends on the GLUE subset used.

google_bleu

predictions = Sequence(Value("string", id="token"), id="sequence")
references = Sequence(Sequence(Value("string", id="token"), id="sequence"), id="references")

indic_glue

predictions = Value("int64") if self.config_name != "cvit-mkb-clsr" else Sequence(Value("float32"))
references = Value("int64") if self.config_name != "cvit-mkb-clsr" else Sequence(Value("float32"))

mae

predictions = Value("float")
references = Value("float")

or if multilist:

predictions = Sequence(Value("float"))
references = Sequence(Value("float"))

mahalanobis

"X": Sequence(Value("float", id="sequence"), id="X")
reference_distribution = np.array(reference_distribution)

N.B. the names for references and predictions are different here -- maybe we should standardize? wdyt @lhoestq

matthews_correlation

predictions = Value("int32")
references = Value("int32")

mauve

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

mean_iou

predictions = Sequence(Sequence(Value("uint16")))
references = Sequence(Sequence(Value("uint16")))

What's a unit16? unicode? this is the only metric with a unicode restriction (so far).

meteor

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

mse

predictions = Value("float")
references = Value("float")

or if multilist:

predictions = Sequence(Value("float"))
references = Sequence(Value("float")),

pearsonr

references = Value("float")
predictions = Value("float")

perplexity

input_texts = Value("string")

precision

predictions = Value("int32")
references = Value("int32")

or if multilist:

predictions = Sequence(Value("int32"))
references = Sequence(Value("int32"))

recall

predictions = Value("int32")
references = Value("int32")

or if multilist:

predictions = Sequence(Value("int32"))
references = Sequence(Value("int32"))

rouge

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

sacrebleu

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

sari

sources = Value("string", id="sequence")
predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

seqeval

predictions = Sequence(Value("string", id="label"), id="sequence")
references = Sequence(Value("string", id="label"), id="sequence")

N.B. both predictions and references are in IOB format

spearmanr

predictions = Value("float")
references = Value("float")

squad

predictions = {"id": Value("string"), "prediction_text": Value("string")}
"references": {
"id": Value("string"),
"answers": features.Sequence(
    {
        "text": Value("string"),
        "answer_start": Value("int32"),
    }
)

squad_v2

"predictions": {
    "id": Value("string"),
    "prediction_text": Value("string"),
    "no_answer_probability": Value("float32"),
}
"references": {
    "id": Value("string"),
    "answers": features.Sequence(
      {"text": Value("string"), "answer_start": Value("int32")}
                        ),
                    }

N.B. SQuAD and SQuAD v2. formats differ in the fact that v2 has the 'no_answer_probability' tag in predictions.

super_glue

if self.config_name == "record":
        return {
            "predictions": {
                "idx": {
                    "passage": Value("int64"),
                    "query": Value("int64"),
                },
                "prediction_text": Value("string"),
            },
            "references": {
                "idx": {
                    "passage": Value("int64"),
                    "query": Value("int64"),
                },
                "answers": Sequence(datasets.Value("string")),
            },
        }
    elif self.config_name == "multirc":
        return {
            "predictions": {
                "idx": {
                    "answer": Value("int64"),
                    "paragraph": Value("int64"),
                    "question": Value("int64"),
                },
                "prediction": Value("int64"),
            },
            "references": Value("int64"),
        }
    else:
        return {
            "predictions": Value("int64"),
            "references": Value("int64"),
        }

ter

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

wer

predictions = Value("string", id="sequence"),
references = Value("string", id="sequence")

wiki_split

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

xnli

predictions = Value("int64" if self.config_name != "sts-b" else "float32")
references = Value("int64" if self.config_name != "sts-b" else "float32")

xtreme_s

pred_type = "int64" if self.config_name in ["fleurs-lang_id", "minds14"] else "string"

predictions = Value(pred_type)
references = Value(pred_type)

N.B. the input depends on the XTREME-S dataset selected

Adding 3 metrics

Hello !
I hope you are doing great !

Could you add these 3 metrics:

BaryScore (oral EMNL 2021)
DepthScore
InfoLM (best student paper award AAAI 2022)

They are available here: https://github.com/PierreColombo/nlg_eval_via_simi_measures on a unified format !

Cheers,

Tokenized BLEU considered harmful - Discussion on community-based process

https://github.com/huggingface/nlp/blob/7d1526dfeeb29248d832f1073192dbf03ad642da/metrics/bleu/bleu.py#L76 assumes the inputs are tokenized by the user. This is bad practice because the user's tokenizer is usually not the same as the one used by mteval-v13a.pl, the closest thing we have to a standard. Moreover, tokenizers are like window managers: they can be endlessly customized and nobody has quite the same options.

As @mjpost reported in https://www.aclweb.org/anthology/W18-6319.pdf BLEU configurations can vary by 1.8. Yet people are incorrectly putting non-comparable BLEU scores in the same table, such as Table 1 in https://arxiv.org/abs/2004.04902 .

There are a few use cases for tokenized BLEU like Thai. For Chinese, people seem to use character BLEU for better or worse.

The default easy option should be the one that's correct more often. And that is sacrebleu. Please don't make it easy for people to run what is usually the wrong option; it definitely shouldn't be bleu.

Also, I know this is inherited from TensorFlow and, paging @lmthang, they should discourage it too.

Feature: add community metrics

We would like a way for community members to easily add custom or new metrics. These custom metrics could be hosted on the hub either in a dedicated repository type (like datasets or models) or on Spaces.

Spaces comes with a few advantages, especially in short term:

No dedicated repo type needed
Infrastructure already exists
We can easily setup a demo without needing a custom widget (e.g. with Gradio)

@osanseviero made a PoC here: https://huggingface.co/spaces/osanseviero/accuracy_metric

A few things that we should add to make this as user friendly and useful as possible:

create a template to add a metric space
- template README.md (like model/dataset card) with basic structure
- template my_metric.py file (similar to datasets loading scripts)
- gradio app.py that loads all information from README.md and my_metric.py to create a default widget. This does probably not need to be changed.
debugging tools
- instructions on how to clone the repository directly on spaces
- maybe an evaluate-cli command to test if everything is fine evaluate-cli test-metric PATH_TO_REPO
dedicated tag: if we ever want to migrate the metrics to a dedicated place like datasets and models it is good if we had a tag to find them all. Similarly, we could then easily do a list_metrics() function in the evaluate without needing to look at the content of each repo.

The goal is to enable behaviour (example with the space above):

accuracy_metric = load_metric("osanseviero/accuracy_metric")

Link to step-by-step guide is broken.

It's about this link in the README.

When I click it I see a page with the following text:

The documentation page CREATING_AND_SHARING.HTML doesn’t exist in v0.1.0, but exists on the main version. Click [here](https://huggingface.co/docs/evaluate/main/en/creating_and_sharing.html) to redirect to the main version of the documentation.

When I click that link I get

Bug: `SCIPY_AVAILABLE` not correct

When trying to use evaluator i ll get an error saying

If you want to use the Evaluator you need scipy>=1.7.1. Run pip install evaluate[evaluator]

But i have scipy installed with v1.8.1, see screenshot

.

Feature: standardize inputs/outputs of metrics

Currently there are several different inputs/output formats possible in Metrics. We should standardize them as much as possible and respecting the following principle:

inputs/outputs are easy to understand and use
outputs are compatible with other frameworks

For the output standardization: probably a dictionary structure, even if nested would be ok. Also a dedicated output class could be considered like in transformer models but this is probably not necessary here. To make it compatible with e.g. keras we could add a postprocess function at initialization similar a transform in datasets.

There are three options we could implement:

load_metric(..., postprocess="metric_key") # equivalent result to `metric.compute()["metric_key"]`
load_metric(..., postprocess="flatten") # equivalent flattening the output dict: `flatten(metric.compute())`
load_metric(..., postprocess=func) # equivalent result to `func(metric.compute())`

the meteor metric seems not consist with the official version

Describe the bug

The computed meteor score seems strange because the value is very different from the scores computed by other tools. For example, I use the meteor score computed by NLGeval as the reference (which reuses the official jar file for the computation)

Steps to reproduce the bug

from datasets import load_metric
from nlgeval import NLGEval, compute_individual_metrics

meteor = load_metric('meteor')
predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
results = meteor.compute(predictions=predictions, references=references)
# print the actual result
print(round(results["meteor"], 4))
metrics_dict = compute_individual_metrics(references, predictions[0])
# print the expected result
print(round(metrics_dict["METEOR"], 4))

By the way, you need to install the nlg-eval library first. Please check the installation guide here, thanks!

Expected results

0.4474

Actual results

0.7398

Environment info

datasets version: 1.10.2
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.5
PyArrow version: 4.0.1

Standardize metric ranges

Several common metrics, like exact_match and f1, sometimes range from 0-1, and sometimes from 0-100.

As discussed with @lhoestq , we think it makes more sense to report them from 0-1, which would entail changing the code of metrics such as CUAD.

@emibaylor and I will add other metrics here that we see reporting scores from 0-100 instead of 0-1.

Refactor for loading multiple evaluation categories

In addition to Metric we also want to add other types of evaluations such as Comparison (#34) and Measurement (#35) following the internal discussion (https://huggingface.slack.com/archives/C035S5G2J3D/p1652200198598789). Technically, these all behave the same way as they take some inputs and compute a scores. As such they could largely be one class (essentially what Metric is today). That means we could also load in the same fashion:

import evaluate

metric = evaluate.load("accuracy")
comparison = evaluate.load("mcnemar")
measure = evaluate.load("npmi")

While each type can live in a different folder on the repository this can cause name clashes when a name can be used for two methods (e.g. perplexity can be a metric and a measurement). This could be solved with an additional argument for like load("perplexity", type="metric") that resolves those conflicts. I think this would be fine.

However, there is a second conflict with Spaces: since each metric (or comparison/measurement) would have their own space with widget it is not so easy to resolve the conflicts here, unless we create an org for each type of metric: e.g. evaluate-metrics, evaluate-comparisons, evaluate-measurements. Then each evaluation type is pushed to a separate org.

If that solution sounds good then we could implement the following behaviour:

if no type is provided we cycle through metric/comparison/measurement and return the first result
if a type is provided we only look for that one and raise an error if that type does not exist

What do you think @douwekiela @lhoestq?

Feature: `push_to_hub` function for metrics.

Add a function to easily push the result of a metric to the hub similar to the Trainer in transformers.

Use-case: a user evaluates a model on a dataset. If this happens with a transformer model and they use the Trainer they can easily do that during training. However, if they are either not using a transformer model, the Trainer or want to do it after training they have to do it manually. Thus it would be nice to easily push the results of a model and framework agnostic model to the hub to add it to a models dataset card.

The workflow would be roughly the following:

metric = load_metric("lvwerra/my_metric")
metric.add(some_predictions, some_references)
metric.compute()
metric.push_to_hub(name="my_new_metric", model="lvwerra/my_model", dataset="lvwerra/my_dataset")

This adds the result of the metric to the meta information of the README.md so it is displayed in the model card.

cc @osanseviero @lhoestq

Add DER metric

Add DER metric for speaker diarization task.

This is used by SUPERB beenchmark, for example.

🌟 [Metric Request] WOOD score

WOOD score paper : https://arxiv.org/pdf/2007.06898.pdf

Abstract :

Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and ‘hack’ datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance – and thus overestimation in AI systems’ capabilities – we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.

TIMIT typically reports PER, not WER

The docs here mention that TIMIT reports WER, but this dataset typically serves as a benchmark for phone error rate (PER), because it’s one of the few resources that have manually annotated phone segments. I recommend to fix and clarify that in the README:

evaluate/metrics/wer/README.md

Line 68 in c1141b0

 For example, datasets such as [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) report a WER in the 1.8-3.3 range, whereas ASR models evaluated on [Timit](https://huggingface.co/datasets/timit_asr) report a WER in the 8.3-20.4 range. 

I think it would be good to have a clear difference between word/character/phone/token error rate (WER/CER/PER/TER) at the library level.

Feature: compose multiple metrics into single object

Often models are evaluated on multiple metrics in a project. E.g. a classification project might always want to report the Accuracy, Precision, Recall, and F1 score. In scikit-learn one use the classification report for that which is widely used. This takes this a step further and allows the user to freely compose metrics. Similar to a DatasetDict one could use the MetricSuite like a Metric object.

metrics_suite = MetricsSuite(
     {
        "accuray": load_metric("accuracy"),
        "recall": load_metric("recall")
     }
)

metrics_suite = MetricsSuite(
     {
        "bleu": load_metric("bleu"),
        "rouge": load_metric("rouge"),
        "perplexity": load_metric("perplexity")
     }
)

metrics_suite.add(predictions, references)
metrics_suite.compute()
>>> {"bleu": bleu_result_dict, "rouge": roughe_result_dict, "perplexity": perplexity_result_dict}

Alternatively, we could also flatten the return dict or have it as an option. We could also add a summary option that defines how an overall result is calculated. E.g. summary="average" averages all the metrics into a summary metric or a custom function with summary=lambda x: x["bleu"]**2 + 0.5*["rouge"]+2. This would allow to create simple, composed metrics without the needing to define a new metric (e.g. for a custom benchmark).

cc @douwekiela @lewtun

Initial proposal for documentation Table of Contents

Initial proposal for documentation Table of Contents:

Getting started
How-to guides
- Evaluate in a nutshell
- Choosing a metric for your task
Conceptual guides
- All about metrics
- Considerations for ML model evaluation
- Metrics vs Measurements vs Comparisons
Reference
- Main classes
- Builder classes
- Loading methods
- Pushing to the Hub
- Streamlit Demo

Wdyt @lvwerra ?

Add missing metrics

As per @douwekiela's suggestion, we should find the blind spots that we have in terms of missing metrics, especially from domains like speech recognition and computer vision.

Suggestions are welcome below!

Fresh pip installation gives error

I installed the library with pip and I get the following error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_9716/2159764109.py in <cell line: 1>()
----> 1 import evaluate

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/evaluate/__init__.py in <module>
----> 1 from . import estimators
      2 from . import evaluators
      3 from . import preprocessors
      4 from . import utils
      5 

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/evaluate/estimators.py in <module>
     41     GradientBoostingClassifier(),
     42     DecisionTreeClassifier(),
---> 43     DummyClassifier('most_frequent'),
     44 ]
     45 

TypeError: __init__() takes 1 positional argument but 2 were given

Environment info

evaluate version: 0.0.3
Platform: Linux-5.4.0-1066-aws-x86_64-with-glibc2.10
Python version: 3.8.12
PyArrow version: 7.0.0
Pandas version: 1.3.4

`BLEURT` has no `README.md`

Currently BLEURT is the only metric without a README.md. We should add one, not super urgent, though.

cc @sashavor

Loading BERTScore

Hi,

I was testing loading bertscore with evaluate and saw this following error.

>>> import evaluate
>>> metric = evaluate.load("bertscore")
Couldn't find a directory or a metric named 'bertscore' in this version. It was picked from the master branch on github instead.

Looks a bit mystic to me. Perhaps you have some insights :)

Thank you!

Feature: type hints for metric inputs

We could add type hints to the metric inputs that go beyond the standard Python type hints. This could help knowing what the input format is (e.g. logits vs. normalized scores) and improve debugging (e.g. we could run some tests on inputs per depending on type -> ValueError(f"Oops, looks like you passed unnormalized scores to {metric.name} but it expect values between 0 and 1. Try applying a SoftMax function first.")

Some ideas/examples:

NormalizedArray and OneHotArray/MultiHotArray for classification predictions and references, respectively.
LogitsArray e.g. for cross-entropy.
TokenList or TextList for NLP related scores
CategoricalPixelLabels or ContinuousPixelLabels for image tasks

Not the best naming, yet, but hopefully gives an idea.

We could provide nice, visual examples of these types in the docs. I know input format tripped me up a lot when working with metrics libraries.

cc @lhoestq @sashavor

Trying to use metric.compute but get OSError

I want to use metric.compute from load_metric('accuracy') to get training accuracy, but receive OSError. I am wondering what is the mechanism behind the metric calculation, why would it report an OSError?

195     for epoch in range(num_train_epochs):
196         model.train()
197         for step, batch in enumerate(train_loader):
198             # print(batch['input_ids'].shape)
199             outputs = model(**batch)
200
201             loss = outputs.loss
202             loss /= gradient_accumulation_steps
203             accelerator.backward(loss)
204
205             predictions = outputs.logits.argmax(dim=-1)
206             metric.add_batch(
207                 predictions=accelerator.gather(predictions),
208                 references=accelerator.gather(batch['labels'])
209             )
210             progress_bar.set_postfix({'loss': loss.item(), 'train batch acc.': train_metrics})
211
212             if (step + 1) % 50 == 0 or step == len(train_loader) - 1:
213                 train_metrics = metric.compute()

the error message is as below:

Traceback (most recent call last):
  File "run_multi.py", line 273, in <module>
    main()
  File "/home/yshuang/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/yshuang/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/yshuang/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/yshuang/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "run_multi.py", line 213, in main
    train_metrics = metric.compute()
  File "/home/yshuang/.local/lib/python3.8/site-packages/datasets/metric.py", line 391, in compute
    self._finalize()
  File "/home/yshuang/.local/lib/python3.8/site-packages/datasets/metric.py", line 342, in _finalize
    self.writer.finalize()
  File "/home/yshuang/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 370, in finalize
    self.stream.close()
  File "pyarrow/io.pxi", line 132, in pyarrow.lib.NativeFile.close
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: error closing file

Environment info

datasets version: 1.6.1
Platform: Linux NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)"
Python version: python3.8.5
PyArrow version: 4.0.0

Setup script action to push "canonical" metrics to hub

The goal is to have all in the library available metrics hosted on the hub as a space. This includes the metrics that are currently inside the evaluate remove in metrics/. To get them to the hub we should setup a script and action that pushes them to a space. We can probably reuse a lot of functionality from #14 to make this work.

A natural mechanism would be to run this whenever a PR is merged:

pull all canonical metrics repos
apply changes (if any)
push back to hub

Docs: Add README

Setup README for repository.

	class MetricInfoMixin:
	"""This base class exposes some attributes of MetricInfo
	at the base level of the Metric for easy access.
	"""

	def __init__(self, info: MetricInfo):
	self._metric_info = info

	@property
	def info(self):
	""":class:`datasets.MetricInfo` object containing all the metadata in the metric."""
	return self._metric_info

	@property
	def name(self) -> str:
	return self._metric_info.metric_name

	@property
	def experiment_id(self) -> Optional[str]:
	return self._metric_info.experiment_id

	@property
	def description(self) -> str:
	return self._metric_info.description

	@property
	def citation(self) -> str:
	return self._metric_info.citation

	@property
	def features(self) -> Features:
	return self._metric_info.features

	@property
	def inputs_description(self) -> str:
	return self._metric_info.inputs_description

	@property
	def homepage(self) -> Optional[str]:
	return self._metric_info.homepage

	@property
	def license(self) -> str:
	return self._metric_info.license

	@property
	def codebase_urls(self) -> Optional[List[str]]:
	return self._metric_info.codebase_urls

	@property
	def reference_urls(self) -> Optional[List[str]]:
	return self._metric_info.reference_urls

	@property
	def streamable(self) -> bool:
	return self._metric_info.streamable

	@property
	def format(self) -> Optional[str]:
	return self._metric_info.format

huggingface / evaluate Goto Github PK

evaluate's Introduction

Installation

With pip

Usage

Adding a new evaluation module

Credits

evaluate's People

Contributors

Stargazers

Watchers

Forkers

evaluate's Issues

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

Using datasets.Metric with Trainer()

What I tried

Doc Typo

Documenting the types of inputs and data structures used for each metric

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

Environment info

Environment info

Recommend Projects

Recommend Topics

Recommend Org

Jobs