hipe-eval / hipe-scorer Goto Github PK

A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).

Home Page: https://hipe-eval.github.io

License: MIT License

Python 87.91% Shell 0.22% Perl 11.60% Makefile 0.28%

evaluation machine-learning named-entity-linking named-entity-recognition

hipe-scorer's People

Contributors

Stargazers

Watchers

Forkers

saharghannay creat89 o-lechuck emanuelaboros

hipe-scorer's Issues

evaluation NEL with multiple links

@maud @e-maud @simon-clematide

With c457b7f, I added the two options to evaluate on a multitude of links:

option 1: unionize an arbitrary number of columns (e.g. literal and metonymic columns)
option 2: ranked list across separated by pipe within single cell. Participants can provide as many links as they want, the number is limited with a script parameter.

There are some limitations:

only NEL, not NERC
either option 1 or 2, not both
alternative links are allowed in the prediction file only, not in the gold standard
alternative links always have identical spans regardless of pipe separation or cross-column separation.
blaming the first link of the ranked list in case of spurious links

The limations are not due to technical problems but rather to keep it as simple as possible. Our scorer grew in complexity already considerably due to our advanced evaluation scenarios.

Make file urls amenable to the scorer

Feature request : because all the modules in HIPE-pycommons can be called either with a path or an url, I think the scorer should too. Adding a simple

import urllib
if url:
    response = urllib.request.urlopen(url)
    tsv_data = response.read().decode('utf-8')

Would do the job !

ToDo CLEF scorer

decisions

document-level F1-score to weight long/short docs equally,
treat nested entities (columns) individually
treat every column separate; NEL & NERC & components
no fuzziness on the type, only for the boundary (one token needs to match)
no macro_type in TSV

--> compare F1-score with CONLL script as a sanity check

programming

Evaluation Measures: Understanding of macro average

Micro P, R, F1:

P, R, F1 on entity level (not on token level): micro average (= over all documents)
- strict and fuzzy (= at least 1 token overlap)
- separately per type and cumulative for all types

Macro as document-level average of micro P, R, F1

P, R, F1 on entity level (not on token level): doc-level macro average (= average of separate micro evaluation on each document)
- strict and fuzzy (= at least 1 token overlap)
- separately per type and cumulative for all types

@e-maud @mromanello: The following type-oriented macro average can be computed from the output of Micro P, R, F1 (spreadsheet style). Therefore the scorer should not directly compute it (for now, at least).

Macro as average over type-specific P, R, F1 measures

P, R, F1 on entity type: doc-level macro average (= average of separate micro evaluation on each document)
- strict and fuzzy (= at least 1 token overlap)

Mention the truncation danger somewhere

This is not really an issue as the code is doing what it should, but I think it would be good to warn users that if they are truncating samples (which, to my knowledge, is the default setting of HuggingFace and PyTorch) to fit the maximum model length (e.g. 512 tokens), then they will have blanks in their reconstructed files, which will lower their results.

Just dropping the idea !

weighting of partial matches in the fuzzy evaluation scoring

@e-maud @mromanello
Do we reward partial matches as high as full matches (i.e. exact boundaries) in our fuzzy evaluation scoring? This would mean setting the formula the 0.5 to 1 in the formula below.

Source: http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/

handling of invalid tags vs out-of-GT tags

Proposed change to how predicted tags are handled by the scorer.

Current behaviour:

Given a certain TSV column (e.g. NE-COARSE-LIT), a predicted tag is ignored by the scorer (i.e. considered as if it were an O tag) if it's not in the set of tags contained in the ground-truth (GT) for that specific column.

For example, in case a system returns the tag B-PERS for the column NE-FINE-COMP, and the ground-truth does not contain any tag B-PERS for that column, it is currently considered as an O tag, thus resulting in a false positive error.

New behaviour:

For each column, the scorer will accept as a valid tag any tag present in a predefined tagset known to the scorer. The default tagset list will correspond to the set of tags existing in the HIPE train/dev/test corpora, and could be overwritten.

In this case, if a system returns the tag PERS for the column NE-COARSE-METO it will be considered as a valid tag since PERS is present in the tagset (all tags defined in the annotation schema).

NB: this change is likely to have some impact on evaluation of systems (i.e. slightly worse precision scores).

last small things before making the repo public

adding a requirements.txt file (from Battista?) todo: @aflueckiger
(removing libs for dev which are no needed, e..g nltk)
add a license: MIT (same as Battista) todo: @aflueckiger
a few text corrections (cf. slack) todo: @e-maud
add mention of impresso and SNF todo: @e-maud

Wrong precision and recall in entity linking evaluation

The scorer seems to calculate the precision and recall wrongly due to the use of consecutive IDs as gold standard and predictions.

Let's consider that the following text is the gold standard. There are three mentions and according to the scorer, there are three links Q69345, Q78068340 and Q39:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    Q69345    _    LED0.33
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    Q78068340    _    NoSpaceAfter|LED0.09
.    I-pers    O    I-pers.ind    O    I-comp.title    O    Q78068340    _    LED0.09
la    O    O    O    O    O    O    _    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39    _    LED0.00
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39    _    NoSpaceAfter|LED0.00

And let us consider the predicted links:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    NIL    _
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    NIL    _
.    I-pers    O    I-pers.ind    O    I-comp.title    O    NIL    _
la    O    O    O    O    O    O    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39|Q340787|Q568452    _
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39|Q340787|Q568452    _

Although the entity linker predicted two NIL and Q39, the scorer considers that we only predicted one NIL and Q39. Therefore, we have:

P=1/(1+1) = 0.5 Q39/(Q39+NIL)
R=1/(1+2) = 0.33 Q39/(Q39+Q69345+Q78068340)

The error can happens on the other way too. Let us consider the following text as the gold standard:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    NIL    _    LED0.33
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    NIL    _    NoSpaceAfter|LED0.09
.    I-pers    O    I-pers.ind    O    I-comp.title    O    NIL    _    LED0.09
la    O    O    O    O    O    O    _    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39    _    LED0.00
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39    _    NoSpaceAfter|LED0.00

And this the predicted output:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    Q69345    _
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    NIL    _
.    I-pers    O    I-pers.ind    O    I-comp.title    O    NIL    _
la    O    O    O    O    O    O    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39|Q340787|Q568452    _
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39|Q340787|Q568452    _

In this case, the scorer says, that we predicted three links but the gold standard had only two:

P=1/(1+2) = 0.33 Q39/(Q39+NIL+Q78068340)
R=1/(1+1) = 0.5 Q39/(Q39+NIL)

The scorer should take in consideration the NER boundaries to determine when a link starts and ends. This behavior is not seen if we add an extra line without a link, such as in:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    NIL    _    LED0.33
REMOVEME    O    O    O    O    O    O    _    _
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    NIL    _    NoSpaceAfter|LED0.09
.    I-pers    O    I-pers.ind    O    I-comp.title    O    NIL    _    LED0.09
la    O    O    O    O    O    O    _    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39    _    LED0.00
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39    _    NoSpaceAfter|LED0.00

Extend tagset.txt to include entity tags from all HIPE-2022 datasets

Make sure that entity tags from all datasets in HIPE-2022-data (esp. ajmc) are included in https://github.com/hipe-eval/HIPE-scorer/blob/master/tagset.txt (or perhaps have one file per dataset, to be decided).

This issue was a mistake

Evaluation measures: Slot error rates

@maud What would be the expected benefits for this evaluation if we

already do entity-level evaluations (not IOB-token-level evaluations)?
allow fuzzy matches?

paper: https://pdfs.semanticscholar.org/451b/61b390b86ae5629a21461d4c619ea34046e0.pdf

[Feature request] : Make the scorer directly accessible via a python API

It would be great if the scorer could be called via a python API. At time of writing, the scorer can only be fed with HIPE-compliant tsv-files. This is a limitation for two reasons :

It makes it complicated to evaluate on the fly (e.g. at the end of each epoch).
It makes it necessary to rebuild words out of each model's tokens, which can be subtokens.

This second point can be very problematic, depending on the your labelling strategy. Before sub-tokenizing, an input example may look like :

O      B-PERS     I-PERS  I-PERS
The  Australian  Prime   minister

A model like BERT could tokenize and label this example like so :

O    B-PERS   B-PERS   I-PERS  I-PERS
The  Austral  ##ian    Prime   minister

However, at inference time, the model may predict something like :

O    B-PERS   I-PERS   I-PERS  I-PERS
The  Austral  ##ian    Prime   minister

To evaluate this prediction, you must first rebuild the words to match the ground-truth tsv. However, since austral and ##ian have to different labels, it is not clear which should be chosen.

If there was a possibility to feed the scorer with two simple list objects (prediction and ground-truth, in a seqeval like fashion), things would be easier.

Though the aforementioned problem could be circumvented by labelling only the first sub-token, it would still be great to evaluate predictions on the fly, and even to have the API directly accessible via external frameworks such as HuggingFace.

Add "pip install" support

This could be a first step towards the inclusion in evaluate.

There are no tags in the system response for the column...

Hello,

Currently I'm having an issue doing some internal evaluations when my system only predicts a type of NER. The scorer stops when I do not provide labels for all the types of columns. In my opinion, if the user do not provide labels for a specific column it should return zero in the evaluation of that column rather than stopping it.

Fuzzy results are named `'ent_type'` in json output

In the JSON outputs of the scorer, the results corresponding to fuzzy in the csv-output are named ent-type. For instance:

# To get the strict results, you fetch the values of the 'strict'-key : 
results["NE-COARSE-LIT"]["TIME-ALL"]["LED-ALL"][desired_entity_type]["strict"]

#However, to get the fuzzy results, you have to retrieve the values of the 'ent_type' key : 
results["NE-COARSE-LIT"]["TIME-ALL"]["LED-ALL"][desired_entity_type]["ent_type"]

This is a bit confusing. Unless I am missing something, I think the name of the key should be replaced with 'fuzzy'.

enforce submission requirements

Check the submission files of participants before evaluating.

Rules for system response files:

files must be in UTF-8, tsv encoded (.tsv extension), with annotations in IOB format.
files need to contain all content item lines and empty lines in the order of the original input file.
files must comply with the following naming convention:
TEAMNAME_TASKBUNDLEID_LANG_RUNNUMBER.tsv
where:
TEAMNAME: is the name of the team such as registered via the CLEF portal
TASKBUNDLEID: is one of the bundle ids as indicated in Table 4.
LANG: is de,fr,en
RUNNUMBER: is 1,2,3 Example: dreamteam_bundle1_de_2.tsv
files must include all columns and instantiate the unspecified values in the required
columns according to the chosen task bundle.
files can include the comment lines present in the input, but do not need to.