GithubHelp home page GithubHelp logo

historical-entity-linking's Introduction

Historical Named Entities Recognition, Classification and Linking

Named Entity Recognition and Classification (NERC) and Entity Linking (EL) are foundational tasks in knowledge extraction. NERC consists of extracting mentions of named entities (such as people, locations, organisations, etc.) in text and assigning them to a pre-defined set of entity types, such as person, location or organisation. EL consists of correlating named entities' mentions to the actual entity they refer to, choosing possible candidates from a reference Knowledge Base (KB), such as Wikipedia, DBPedia, Wikidata.

NERC and EL have become progressively more relevant for historical documents due to the massive campaigns of digitisation carried out in recent years. NERC and EL are particularly challenging on historical documents due to the noisy quality of the plain text derived through Optical Character Recognition (OCR) technologies. Also, non-contemporary language differs from nowadays' varieties in diverse lexical, morpho-syntactic and semantic aspects, such as spelling variation, sentence structure and naming conventions. Furthermore, annotated datasets of historical texts are scarce, and state-of-the-art (SotA) models, based on the supervised paradigm, are trained on contemporary data based mainly on web documents. Great research effort in this area has recently been carried out by HIPE - Identifying Historical People, Places and Other Entities, a CLEF evaluation lab.

Historical Entity Linking

EL is especially challenging when applied to historical text, especially when it involves entity disambiguation (linking named entities whose superficial mention is common to many named entities to their correct entry in a knowledge base). We hypothesise that the popularity bias in state-of-the-art (SotA) neural entity linkers, a phenomenon investigated in Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP (Chen et al., ACL-IJCNLP 2021), intensifies when applied to entities extrapolated from historical texts. Popularity bias causes popular entities, i.e. entities that are more frequent in a training set, to be preferred to less frequent entities, even if the context in which they appear is unambiguous.

The cause of this heightened bias can be traced back to historical entities being less represented in popular training datasets. Furthermore, in the context of historical documents, this popularity bias tends to evolve into a peculiar bias, which we term as temporal bias. This is the tendency for these neural entity linkers to favour contemporary entities over historical entities, even if other unambiguous historical entities are mentioned in the same context. Eventually, they reach lower performance when applied to historical entities found in historical texts, as compared to the performance reached when applied to contemporary entities found in contemporary texts.

For example, it is possible to appreciate this phenomenon in the AMR graph produced by the Polifonia Knowledge Extractor (PKE) from the sentence "Fabio Constantini flourished about the year 1630, and ultimately became maestro at the chapel of Loretto.", extrapolated from The Quarterly Musical Magazine And Review, n. 22 (1824), part of the Periodicals module of the PTC., reported in the figure below:

Fabio Constantini AMR Graph

In the AMR graph, the NEs recognition and classification (NERC) performed by SPRING correctly recognises the superficial named entity's mention Fabio Constantini, which is erroneously linked by BLINK to the Wikipedia page of Claudio Constantini, a Peruvian composer born in 1983. The correct link should be to the Wikidata entity Fabio Constantini, with QID Q3737659, an Italian composer born in 1575.

Contribution

We aim to mitigate the need for gold-standard resources containing NERC- and EL-annotated historical documents by releasing a new model for historical entity linking and a new benchmark for the task. We release:

High-level Repository Structure

(first to third level only)

Polifonia-Corpus
│   README.md    
│
└───benchmark
│   │   README.md
│   │
│   └───IAA
│   │
│   └───preliminary_study
│   │
│   └───v0.1
│   
└───images
|
|___models
    |   README.md
    │
    └───images
    │
    └───src
    │
    └───vocabs

historical-entity-linking's People

Contributors

arianna-graciotti avatar roccotrip avatar

Watchers

 avatar  avatar

historical-entity-linking's Issues

Repeated item

#document_id:TheQuarterlyMusicalMagazineAndReview__1825-025.txt_1373

Missing annotation

#document_id:TheQuarterlyMusicalMagazineAndReview__1825-025.txt_556

T. Howell

Wrong IOB tag

#document_id:TheMusicalWorld__1843-038.txt_67
"Misses" should not be tagged B-person

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.