GithubHelp home page GithubHelp logo

hipe-eval / hipe-scorer Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 4.0 299 KB

A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).

Home Page: https://hipe-eval.github.io

License: MIT License

Python 87.91% Shell 0.22% Perl 11.60% Makefile 0.28%
evaluation machine-learning named-entity-linking named-entity-recognition

hipe-scorer's People

Contributors

aflueckiger avatar creat89 avatar davidsbatista avatar e-maud avatar ivyleavedtoadflax avatar mromanello avatar simon-clematide avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hipe-scorer's Issues

evaluation NEL with multiple links

@maud @e-maud @simon-clematide

With c457b7f, I added the two options to evaluate on a multitude of links:

  • option 1: unionize an arbitrary number of columns (e.g. literal and metonymic columns)
  • option 2: ranked list across separated by pipe within single cell. Participants can provide as many links as they want, the number is limited with a script parameter.

There are some limitations:

  • only NEL, not NERC
  • either option 1 or 2, not both
  • alternative links are allowed in the prediction file only, not in the gold standard
  • alternative links always have identical spans regardless of pipe separation or cross-column separation.
  • blaming the first link of the ranked list in case of spurious links

The limations are not due to technical problems but rather to keep it as simple as possible. Our scorer grew in complexity already considerably due to our advanced evaluation scenarios.

Make file urls amenable to the scorer

Feature request : because all the modules in HIPE-pycommons can be called either with a path or an url, I think the scorer should too. Adding a simple

import urllib
if url:
    response = urllib.request.urlopen(url)
    tsv_data = response.read().decode('utf-8')

Would do the job !

ToDo CLEF scorer

decisions

  • document-level F1-score to weight long/short docs equally,
  • treat nested entities (columns) individually
  • treat every column separate; NEL & NERC & components
  • no fuzziness on the type, only for the boundary (one token needs to match)
  • no macro_type in TSV

--> compare F1-score with CONLL script as a sanity check

programming

  • implement script scaffolding
  • output relevant metrics as tsv
  • output all metrics as json
  • implement document-level F1 macro score with std. dev. as a stability measure
  • unit test for complicated cases (double hits including TP and FP)
  • include original numbers TP / FP / FN
  • check the probably erroneous formula the F1 score in the official formula for partial matches
  • adding argument to glue comp and fine label: #1 (comment)
  • sanity_check if provided labels are all available in gold standard
  • add system name as separate column extracting from filename
  • check how to make a union of two columns (or lists) to evaluate against gold standard
  • implement slot error rate
    • also per type
    • compute F1/P/R

Evaluation Measures: Understanding of macro average

Micro P, R, F1:

  • P, R, F1 on entity level (not on token level): micro average (= over all documents)
    • strict and fuzzy (= at least 1 token overlap)
    • separately per type and cumulative for all types

Macro as document-level average of micro P, R, F1

  • P, R, F1 on entity level (not on token level): doc-level macro average (= average of separate micro evaluation on each document)
    • strict and fuzzy (= at least 1 token overlap)
    • separately per type and cumulative for all types

@e-maud @mromanello: The following type-oriented macro average can be computed from the output of Micro P, R, F1 (spreadsheet style). Therefore the scorer should not directly compute it (for now, at least).

Macro as average over type-specific P, R, F1 measures

  • P, R, F1 on entity type: doc-level macro average (= average of separate micro evaluation on each document)
    • strict and fuzzy (= at least 1 token overlap)

Mention the truncation danger somewhere

This is not really an issue as the code is doing what it should, but I think it would be good to warn users that if they are truncating samples (which, to my knowledge, is the default setting of HuggingFace and PyTorch) to fit the maximum model length (e.g. 512 tokens), then they will have blanks in their reconstructed files, which will lower their results.

Just dropping the idea !

handling of invalid tags vs out-of-GT tags

Proposed change to how predicted tags are handled by the scorer.

Current behaviour:

Given a certain TSV column (e.g. NE-COARSE-LIT), a predicted tag is ignored by the scorer (i.e. considered as if it were an O tag) if it's not in the set of tags contained in the ground-truth (GT) for that specific column.

For example, in case a system returns the tag B-PERS for the column NE-FINE-COMP, and the ground-truth does not contain any tag B-PERS for that column, it is currently considered as an O tag, thus resulting in a false positive error.

New behaviour:

For each column, the scorer will accept as a valid tag any tag present in a predefined tagset known to the scorer. The default tagset list will correspond to the set of tags existing in the HIPE train/dev/test corpora, and could be overwritten.

In this case, if a system returns the tag PERS for the column NE-COARSE-METO it will be considered as a valid tag since PERS is present in the tagset (all tags defined in the annotation schema).

NB: this change is likely to have some impact on evaluation of systems (i.e. slightly worse precision scores).

Wrong precision and recall in entity linking evaluation

The scorer seems to calculate the precision and recall wrongly due to the use of consecutive IDs as gold standard and predictions.

Let's consider that the following text is the gold standard. There are three mentions and according to the scorer, there are three links Q69345, Q78068340 and Q39:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    Q69345    _    LED0.33
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    Q78068340    _    NoSpaceAfter|LED0.09
.    I-pers    O    I-pers.ind    O    I-comp.title    O    Q78068340    _    LED0.09
la    O    O    O    O    O    O    _    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39    _    LED0.00
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39    _    NoSpaceAfter|LED0.00

And let us consider the predicted links:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    NIL    _
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    NIL    _
.    I-pers    O    I-pers.ind    O    I-comp.title    O    NIL    _
la    O    O    O    O    O    O    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39|Q340787|Q568452    _
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39|Q340787|Q568452    _

Although the entity linker predicted two NIL and Q39, the scorer considers that we only predicted one NIL and Q39. Therefore, we have:

P=1/(1+1) = 0.5 Q39/(Q39+NIL)
R=1/(1+2) = 0.33 Q39/(Q39+Q69345+Q78068340)

The error can happens on the other way too. Let us consider the following text as the gold standard:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    NIL    _    LED0.33
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    NIL    _    NoSpaceAfter|LED0.09
.    I-pers    O    I-pers.ind    O    I-comp.title    O    NIL    _    LED0.09
la    O    O    O    O    O    O    _    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39    _    LED0.00
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39    _    NoSpaceAfter|LED0.00

And this the predicted output:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    Q69345    _
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    NIL    _
.    I-pers    O    I-pers.ind    O    I-comp.title    O    NIL    _
la    O    O    O    O    O    O    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39|Q340787|Q568452    _
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39|Q340787|Q568452    _

In this case, the scorer says, that we predicted three links but the gold standard had only two:

P=1/(1+2) = 0.33 Q39/(Q39+NIL+Q78068340)
R=1/(1+1) = 0.5 Q39/(Q39+NIL)

The scorer should take in consideration the NER boundaries to determine when a link starts and ends. This behavior is not seen if we add an extra line without a link, such as in:

Neuchâld    B-loc    O    B-loc.adm.reg    O    O    O    NIL    _    LED0.33
REMOVEME    O    O    O    O    O    O    _    _
SI    B-pers    O    B-pers.ind    O    B-comp.title    O    NIL    _    NoSpaceAfter|LED0.09
.    I-pers    O    I-pers.ind    O    I-comp.title    O    NIL    _    LED0.09
la    O    O    O    O    O    O    _    _    _
Confédération    B-loc    O    B-loc.adm.nat    O    O    O    Q39    _    LED0.00
Suisse    I-loc    O    I-loc.adm.nat    O    O    O    Q39    _    NoSpaceAfter|LED0.00

[Feature request] : Make the scorer directly accessible via a python API

It would be great if the scorer could be called via a python API. At time of writing, the scorer can only be fed with HIPE-compliant tsv-files. This is a limitation for two reasons :

  1. It makes it complicated to evaluate on the fly (e.g. at the end of each epoch).
  2. It makes it necessary to rebuild words out of each model's tokens, which can be subtokens.

This second point can be very problematic, depending on the your labelling strategy. Before sub-tokenizing, an input example may look like :

O      B-PERS     I-PERS  I-PERS
The  Australian  Prime   minister 

A model like BERT could tokenize and label this example like so :

O    B-PERS   B-PERS   I-PERS  I-PERS
The  Austral  ##ian    Prime   minister

However, at inference time, the model may predict something like :

O    B-PERS   I-PERS   I-PERS  I-PERS
The  Austral  ##ian    Prime   minister

To evaluate this prediction, you must first rebuild the words to match the ground-truth tsv. However, since austral and ##ian have to different labels, it is not clear which should be chosen.

If there was a possibility to feed the scorer with two simple list objects (prediction and ground-truth, in a seqeval like fashion), things would be easier.

Though the aforementioned problem could be circumvented by labelling only the first sub-token, it would still be great to evaluate predictions on the fly, and even to have the API directly accessible via external frameworks such as HuggingFace.

There are no tags in the system response for the column...

Hello,

Currently I'm having an issue doing some internal evaluations when my system only predicts a type of NER. The scorer stops when I do not provide labels for all the types of columns. In my opinion, if the user do not provide labels for a specific column it should return zero in the evaluation of that column rather than stopping it.

Fuzzy results are named `'ent_type'` in json output

In the JSON outputs of the scorer, the results corresponding to fuzzy in the csv-output are named ent-type. For instance:

# To get the strict results, you fetch the values of the 'strict'-key : 
results["NE-COARSE-LIT"]["TIME-ALL"]["LED-ALL"][desired_entity_type]["strict"]

#However, to get the fuzzy results, you have to retrieve the values of the 'ent_type' key : 
results["NE-COARSE-LIT"]["TIME-ALL"]["LED-ALL"][desired_entity_type]["ent_type"]

This is a bit confusing. Unless I am missing something, I think the name of the key should be replaced with 'fuzzy'.

enforce submission requirements

Check the submission files of participants before evaluating.

Rules for system response files:

  • files must be in UTF-8, tsv encoded (.tsv extension), with annotations in IOB format.
  • files need to contain all content item lines and empty lines in the order of the original input file.
  • files must comply with the following naming convention:
    TEAMNAME_TASKBUNDLEID_LANG_RUNNUMBER.tsv
    where:
    TEAMNAME: is the name of the team such as registered via the CLEF portal
    TASKBUNDLEID: is one of the bundle ids as indicated in Table 4.
    LANG: is de,fr,en
    RUNNUMBER: is 1,2,3 Example: dreamteam_bundle1_de_2.tsv
  • files must include all columns and instantiate the unspecified values in the required
    columns according to the chosen task bundle.
  • files can include the comment lines present in the input, but do not need to.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.