GithubHelp home page GithubHelp logo

embeddia / nel_filter Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 323 KB

Post-processing filter for (Named) Entity Linking

License: MIT License

Python 100.00%
named-entity-linking entity-linking historical-documents filter

nel_filter's Introduction

Post-processing filter for (Named) Entity Linking

This program was created for filtering entity linking candidates and reorder them based on heuristics and data coming from Wikidata and DBpedia, including DBpedia Chapters. The filter tries to delete improbable candidates, such as disambiguation pages or people born after the publication of the document (if dates are available). It can improve the global performance of an Entity Linking system.

We present here below an example of how the filter can remove certain candidates:

Example_filter

Where it has been used?

This filter been used in diverse pubications:

Input format

The code use a column-based format, in which it is necessary to have data regarding the tokens, named entity and entity linking. It has been created for the format used in CLEF-HIPE-2020:

TOKEN	NE-COARSE-LIT	NE-COARSE-METO	NE-FINE-LIT	NE-FINE-METO	NE-FINE-COMP	NE-NESTED	NEL-LIT	NEL-METO
# language = en
# newspaper = sn83030483
# date = 1790-01-02
# document_id = sn83030483-1790-01-02-a-i0004
FROM	O	O	O	O	O	_	_	_
A	O	O	O	O	O	_	_	_
VIRGINIA	B-loc	O	B-loc	O	O	_	Q1370|Q1070529|NIL|Q16155633|Q4112016	_
PAPER	O	O	O	O	O	_	_	_
.	O	O	O	O	O	_	_	_

We use the comment date = to extract the publication date.

Although in the last version (ICADL), it is possible to indicate the columns in which this data is available, and separators and comments, we haven't tested it with other formats.

Furthermore, the filter uses the data provided by the NER tags to process the candidates. Currently, it only supports NER tags encoded with a IOB format.

Citing

Please use this publication for citing this work:

@Article{LinharesPontes2021,
	author={Linhares Pontes, Elvys
	and Cabrera-Diego, Luis Adrián
	and Moreno, Jose G.
	and Boros, Emanuela
	and Hamdi, Ahmed
	and Doucet, Antoine
	and Sidere, Nicolas
	and Coustaty, Mickaël},
	title={MELHISSA: a multilingual entity linking architecture for historical press articles},
	journal={International Journal on Digital Libraries},
	year={2021},
	month={Nov},
	day={29},
	issn={1432-1300},
	doi={10.1007/s00799-021-00319-6},
	url={https://doi.org/10.1007/s00799-021-00319-6}
}

If you use the Weighted-Levenshtein and you use the weights provided in the code, please cite as well:

@Inproceedings{8791206,
  author={Nguyen, Thi-Tuyet-Hai and Jatowt, Adam and Coustaty, Mickael and Nguyen, Nhu-Van and Doucet, Antoine},
  booktitle={2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)}, 
  title={Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing}, 
  year={2019},
  volume={},
  number={},
  pages={29-38},
  doi={10.1109/JCDL.2019.00015}}

DBpedia Chapters

Some of the DBpedia chapters have become offline during 2020-2021, and we do not know if they will come online again. Thus, there might be some issues in specific configurations. This version should be more robust if a chapter becomes offile.

Cached data

We provide the cached data that was used for the latest publication. The use of a cache decreases the number of queries to DBpedia and WikiData, and therefore increases the processing speed.

Dependencies

This project has been tested with Python 3.8. The requirements can be found in requirements.txt and can be installed using pip.

Parent projects

This work is is result of the European Union H2020 Project Embeddia and NewsEye. Embeddia is a project that creates NLP tools that focuses on European under-represented languages and that has for objective to improve the accessibility of these tools to the general public and to media enterprises. Visit Embeddia's Github to discover more NLP tools and models created within this project. NewsEye is a project that develops methods and tools for digital humanities that can enhance the access to historical newspapers to a wide range of users. Visit NewsEye's Github to discover the range of tools developed for the digital humanities.

nel_filter's People

Contributors

creat89 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.