GithubHelp home page GithubHelp logo

hipe-eval / hipe-2022-data Goto Github PK

View Code? Open in Web Editor NEW
14.0 4.0 4.0 20.04 MB

Data for the HIPE 2022 shared task.

License: Other

Jupyter Notebook 100.00%
dataset evaluation historical-documents information-extraction named-entity-extraction named-entity-linking

hipe-2022-data's People

Contributors

e-maud avatar mromanello avatar simon-clematide avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hipe-2022-data's Issues

Annotation Issue (NewsEye)

Encountered issue while NEL data processing in file HIPE-2022-v2.1-newseye-dev-de.tsv:

lines 32928-32931

Haa¬	B-PER	_	O	_	_	B-ORG	Q1405350	_	NoSpaceAfter|EndOfLine
senstein	I-PER	_	O	_	_	I-ORG	Q56322697	_	_
&	O	_	O	_	_	I-ORG	Q56322697	_	_
Vogler	O	_	O	_	_	I-ORG	Q56322697	_	_

The Qid covers the correct entity of type ORG. "Haasenstein & Vogler" is considered to be nested entity, that has a person type included. From the definition of nested entities, the smaller entity should be the one nested, thus:

Haa¬	B-ORG	_	O	_	_	B-PER	Q1405350	_	NoSpaceAfter|EndOfLine
senstein	I-ORG	_	O	_	_	I-PER	Q56322697	_	_
&	I-ORG	_	O	_	_	O	Q56322697	_	_
Vogler	I-ORG	_	O	_	_	O	Q56322697	_	_

Later on, in the same file (lines 33541-33543), the correct annotation is used:

Haasenstein	B-ORG	_	O	_	_	B-PER	Q56322697	_	_
&	I-ORG	_	O	_	_	O	Q56322697	_	_
Vogler	I-ORG	_	O	_	_	O	Q56322697	_	_

Missing Entities in TopRes19th Dataset

Hi,

during review of adding HIPE-2022 dataset into Flair, we just found that some of the listed entites do not exist in the actual dataset.

These entities are: ALIEN, OTHER, FICTION.

Could you please clarify what happened to these entites? Will they be added later (or will they appear in the final test dataset).

Many thanks,

Stefan

Annotation Issue (coarse.meto-fine.comp)

Hello!

I've noticed a possible missing entity type in COARSE-METO in HIPE-2022-v2.1-hipe2020-train-fr.tsv, where M. Théodore Reinach should (possibly) be a pers.ind (line 2,141-2,150):

M	O	O	O	O	B-comp.title	O	_	_	NoSpaceAfter
.	O	O	O	O	I-comp.title	O	_	_	_
Théodore	O	O	O	O	B-comp.name	O	_	_	_
Reinach	O	O	O	O	I-comp.name	O	_	_	NoSpaceAfter
,	O	O	O	O	O	O	_	_	_
député	O	O	O	O	B-comp.function	O	_	_	_
radical	O	O	O	O	I-comp.function	O	_	_	_
de	O	O	O	O	I-comp.function	O	_	_	_
la	O	O	O	O	I-comp.function	O	_	_	EndOfLine
Savoie	B-loc	O	B-loc.adm.reg	O	I-comp.function	O	Q12745	_	NoSpaceAfter

Due to several evaluation processes on my side, I'll be checking more in depth other annotated files also, and open an issue for each (if any).

Inconsistent naming of masked test files

Masked test files are sometimes called ...-test-allmasked-... (e.g. in ajmc) and sometimes ...-test_allmasked-... (notice the underscore). This should be harmonized.

Minor issues in AJMC v2.0

Hi,

I've just written some testcases for reading the v2.0 version of the corpus in Flair, and it seems that there are some issues for AJMC:

  • HIPE-2022-v2.0-ajmc-train-de.tsv: In line 16.537 the token ἄνδοα starts with a leading whitespace (very minor issue). Leading spaces also appear in other AJMC splits.
  • HIPE-2022-v2.0-ajmc-train-en.tsv: Two "empty" tokens are in the dataset at line 5.157 and 5.645. Those tokens should be removed or replaces with a non-whitespace.
  • HIPE-2022-v2.0-ajmc-dev-en.tsv: Line 5.660 unfortunately has two tokens (separated with whitespace): περάνας sa

Would be awesome if this could be fixed in the next release(s), I'm going to catch these issues in Flair for now :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.