RONEC - the Romanian Named Entity Corpus - v2.0

RONEC, at version 2.0, holds 12330 sentences with over 0.5M tokens, annotated with 15 classes, to a total of 80.283 distinctly annotated entities.

It is more than twice the size of the previous version. Quality has been significantly increased as verything has been annotated from scratch to confrom to a single standard. For the older version 1, please see the ronec_v1 folder.

Update - 16 Nov 2021 : RONEC v2 is available on HuggingFace's datasets library. To use, simply pip install datasets and run the following:

from datasets import load_dataset
ronec = load_dataset("ronec")

the ronec dict contains the 'train', 'validation' and 'test' splits. Access elements like ronec["train"][1000] - please see the simple format below. You can also explore the dataset here.

Corpus details

The corpus has the following classes and distribution in the train/valid/test splits:

Classes	Total	Train		Valid		Test
	#	#	%	#	%	#	%
PERSON	26130	19167	73.35	2733	10.46	4230	16.19
GPE	11103	8193	73.79	1182	10.65	1728	15.56
LOC	2467	1824	73.94	270	10.94	373	15.12
ORG	7880	5688	72.18	880	11.17	1312	16.65
LANGUAGE	467	342	73.23	52	11.13	73	15.63
NAT_REL_POL	4970	3673	73.90	516	10.38	781	15.71
DATETIME	9614	6960	72.39	1029	10.7	1625	16.9
PERIOD	1188	862	72.56	129	10.86	197	16.58
QUANTITY	1588	1161	73.11	181	11.4	246	15.49
MONEY	1424	1041	73.10	159	11.17	224	15.73
NUMERIC	7735	5734	74.13	814	10.52	1187	15.35
ORDINAL	1893	1377	72.74	212	11.2	304	16.06
FACILITY	1126	840	74.6	113	10.04	173	15.36
WORK_OF_ART	1596	1157	72.49	176	11.03	263	16.48
EVENT	1102	826	74.95	107	9.71	169	15.34

Format

The data is available as a train/valid/test split in data, as json files. Each file is a list of instances, where an instance is a dictionary that contains the following keys:

{
  "id": 10454,
  "tokens": ["Pentru", "a", "vizita", "locația", "care", "va", "fi", "pusă", "la", "dispoziția", "reprezentanților", "consiliilor", "județene", ",", "o", "delegație", "a", "U.N.C.J.R.", ",", "din", "care", "a", "făcut", "parte", "și", "dl", "Constantin", "Ostaficiuc", ",", "președintele", "C.J.T.", ",", "a", "fost", "prezentă", "la", "Bruxelles", ",", "între", "1-3", "martie", "."], 
  "ner_tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-PERSON", "O", "O", "O", "O", "O", "O", "B-ORG", "O", "O", "O", "O", "O", "O", "O", "B-PERSON", "I-PERSON", "I-PERSON", "I-PERSON", "I-PERSON", "B-ORG", "O", "O", "O", "O", "O", "B-GPE", "O", "B-PERIOD", "I-PERIOD", "I-PERIOD", "O"], 
  "ner_ids": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 3, 0, 0, 0, 0, 0, 5, 0, 19, 20, 20, 0], 
  "space_after": [true, true, true, true, true, true, true, true, true, true, true, true, false, true, true, true, true, false, true, true, true, true, true, true, true, true, true, false, true, true, false, true, true, true, true, true, false, true, true, true, false, false]
}

The tokens are the words of the sentence. The ner_tags are the string tags assigned to each token, following the BIO2 format. For example, the span "între", "1-3", "martie" has three tokens, but is a single class PERIOD, marked as "B-PERIOD", "I-PERIOD", "I-PERIOD". The ner_ids are the integer encoding of each tag, to be compatible with the standard and to be quickly used for model training. Note that each B-starting tag is odd, and each I-starting tag is even. The space_after is used to help if there is a need to detokenize the dataset. A true value means that there is a space after the token on that respective position.

Authors

Stefan Daniel Dumitrescu

Acknowledgements

Big thanks to termene.ro for carefully annotating the full expanded dataset. RONEC v2 would not have seen the light of day without them!

Cite

Please consider citing the following paper as a thank you to the authors of the RONEC, even if it describes v1 of the corpus and you are using v2:

Dumitrescu, Stefan Daniel, and Andrei-Marius Avram. "Introducing RONEC--the Romanian Named Entity Corpus." arXiv preprint arXiv:1909.01247 (2019).

or in .bibtex format:

@article{dumitrescu2019introducing,
  title={Introducing RONEC--the Romanian Named Entity Corpus},
  author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius},
  journal={arXiv preprint arXiv:1909.01247},
  year={2019}
}

dumitrescustefan / ronec Goto Github PK

ronec's Introduction

RONEC - the Romanian Named Entity Corpus - v2.0

Corpus details

Format

Authors

Acknowledgements

Cite

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs