dumitrescustefan / ronec Goto Github PK

Romanian Named Entity Corpus (RONEC) version 2.0

License: MIT License

Python 100.00%

romanian ner corpus named-entity-recognition ronec

ronec's Introduction

About me

I'm an Machine Learning Engineer, working on cool projects at the intersection of NLP and CV. I finished my PhD in 2011, worked as a Research Scientist at the Research Institute for AI (Romanian Academy) for 7 years, then switched applied ML as an ML engineer at Sustainalytics (2017-2019) and now at Adobe.

I'm active in open source, especially on Romanian NLP. Throughout the years I've published, teached and coded, all while having fun. I like to build stuff.

Showcase on HuggingFace:

Named Entity Recognition playground
Write with transformers, but for Romanian :)

Projects I'm proud of

Under development:

Romanian Text Corpus (joint project with Mihai Ilie)
Word Sense Disambiguation Corpus & Models for Romanian (large scale, long running project)
NLI Corpus for Romanian
Sentence segmentation for Romanian (because current Romanian tools fail miserably for anything but clean text)

2023

May Appeared on live TV discussing AI (#1, #2)
Apr Participated in WE Smart Diaspora conference in Timisoara, Romania, presenting "The Impact of Large Language Models"

2022

Nov Released the first T5-base and large pretrained checkpoints on Romanian, trained on TPUv4s from TRC. Invaluable help from Mihai Ilie and Per Egil.
Nov Organized the first edition of the LiRo NLP Hackathon in Politehnica University of Bucharest, with over 80 participants. Thanks to Viorica Patraucean, Traian Rebedea and all the wonderful volunteers.
Jan Released roner, a pip-installable custom NER based on RONECv2, providing SOTA results on Romanian.

2021

Aug Trained and released a monolingual GPT-Neo 780M model, trained on a TPUv3-32 with the help of Mihai Ilie
Nov Lead the development of the first Named Entity Recognition dataset for Romanian. Currently, at version 2.0, holds 12330 sentences with over 0.5M tokens, annotated with 15 classes, to a total of 80.283 distinctly annotated entities. Invaluable help from termene.ro.

2020

Aug I lead the development of the first ML leaderboard named LiRo Benchmark, together with Viorica Patraucean and other amazing RomaniaAI volunteers.
Jun Proposed and lead the development of the Romanian Semantic Textual Similarity dataset. It's a 1:1 high-quality human translation of the English STS dataset.
Apr: Trained an released the first monolingual Romanian BERT model, which became the most used BERT model in Romania, with thousdands of monthly downloads.

2019 and before

RoWordNet pip package providing quick access to the Romanian WordNet. After all these years it's still the only python plug-and-play package for Romanian - seems to be working well :)
Developed NLP-Cube with Tiberiu Boros (lead). Started as an entry in the 2018 Conll competition and evolved into a multilingual toolkit providing Tokenization, Sentence Segmentation, Lemmatization, POS and DEP parsing, trained on the Universal Dependencies dataset.

Selected publications

Google Scholar profile , h-index: 9

Liro: Benchmark and leaderboard for romanian language tasks, SD Dumitrescu et all., 2021
The birth of Romanian BERT, SD Dumitrescu, AM Avram, S Pyysalo, 2020
Introducing RONEC - the Romanian Named Entity Corpus, SD Dumitrescu, AM Avram, 2019
NLP-Cube: End-to-End Raw Text Processing With Neural Networks, T Boroș, SD Dumitrescu, R Burtica, 2018

ronec's People

Contributors

Stargazers

Watchers

Forkers

silviupanaite biodid iliemihai xavirostudio adrianeboyd 0wnrepo bocse milena-sosic cioionut gabypr95 dataworks-baby gabriel-v oodapow

ronec's Issues

Can't train external NER model on RONEC corpus using Spacy

Hello,

I attempted to use the RONEC corpus with Spacy for NER and I encountered some problems while following the tutorial for using Spacy with RONEC:

1

I cloned the repository and I tried to obtain the .json train and dev files using the convert_conllubio.py script and Spacy's convert tool as shown in the tutorial:

!python3 ronec/spacy/train-local-model/convert_conllubio.py ronec/ronec/conllup/raw/ronec.conllup .

!python -m spacy convert train_ronec.conllubio . --converter conllubio

When I ran the second command, for the train data set I got this error:

Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/spacy/main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/convert.py", line 106, in convert
no_print=no_print,
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 25, in conllu2json
for i, (raw_text, tokens) in enumerate(conll_tuples):
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 68, in read_conllx
id, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: too many values to unpack (expected 10)

When I looked at the train_ronec.conllubio file, I noticed that there were 11 columns on the first line instead of 10, as shown below:

1 Tot tot ADV Rp _ 3 advmod _ _ *

I found that deleting the "*" on the first line solved this problem, but I couldn't really understand why this happened.

2

By observing the .json dev and train output files, I noticed that the "ner" field was missing from the train file.
Here is the representation of a word from the dev file:

{
"id":0,
"orth":"Tinerii",
"tag":"Ncmpry",
"head":35,
"dep":"nsubj:pass",
"ner":"U-PERSON"
}

Here is a representation of a word from the train file:

{
"id":35,
"orth":"ROMANIA",
"tag":"Np",
"head":-2,
"dep":"nmod"
}

3

I moved on with the tutorial and I attempted to train the open-source BILSTM-CNN model found here: https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs with Spacy's train tool. I cloned the repo and I used the command below to train:

!python3 -m spacy train ro Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/models/ train_ronec.json dev_ronec.json -p ner

I noticed a very strange behavior for this: the model got stuck at 36%, no matter how much time I let it run. This is the output I got:

Training pipeline: ['ner']
Starting with blank model 'ro'
Counting training words (limit=0)
Itn NER Loss NER P NER R NER F Token % CPU WPS
36% 58105/159192 [00:10<00:16, 6216.60it/s]

I tried a few more times and the behavior did not change: it always stopped progressing at around 30-36%. Since it did not return any errors, I am not sure how to debug it, or if I am using it right.

Environment

I am running this on Google Colab. Here is some information about the environment:

spaCy version:** 2.2.4
Platform:** Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version:** 3.6.9

No compatible models found for v2.3.4 of spaCy

Hi,

When I try to download RONEC copus by running:
python3 -m spacy download ro_core_news_sm
I am getting the following error:
No compatible models found for v2.3.4 of spaCy
This happens also when I try to follow the download commands from this link.

On a second note, I would like to thank you for your work in bringing the RONEC corpus to the world, any of your guidance would be much appreciated.

Cum se poate descarca si folosi in platforma KNIME?

Va salut.
As dori sa folosesc modelul dumneavoastra impreuna cu platforma KNIME si nodul Spacy dar nu stiu cum sa-l descarc de aici si sa-l folosesc in KNIME.
Ma puteti ajuta cumva?
Multumesc!

ronec question

Hello,

I ve tried to use RONEC in SpaCy for the following legal text from Romanian Official Gazette:

"Aprobă revocarea din calitatea de cenzor a următoarelor persoane: IORGA RADU-GABRIEL,GUCEANU DORINA și TEAHA RODICA și aprobă numirea în calitate de auditor financiar al societatii ARYA CONSULTING S.R.L.,CUI 41617624,J23/3968/2019, pentru o perioadă de 2 ani."

..and I ve got the following result:

persoane 56 64 PERSON
IORGA RADU 66 76 ORGANIZATION
GABRIEL 77 84 ORGANIZATION
TEAHA RODICA 103 115 ORGANIZATION
ARYA CONSULTING S.R.L.,CUI 181 209 ORGANIZATION
41617624,J23/3968/2019 209 231 NUMERIC_VALUE
2 ani 254 259 DATETIME

I am guessing the legal text from Official Gazette looks very different from the one in RONEC . Would I need to create a completely new annotated corpus from Official Gazette or just add to the existing RONEC more annotated legal stuff?

Thanks,
Mihai

Weird labeling results

Hi,

I tried to use spacy with RONEC to check if some contact phone names are person's names or not, and I got weird results:

nlp = spacy.load("ro_core_news_sm")
doc = 'Digi24'
doc = nlp(doc)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
(out): Digi24 0 6 PERSON

It gave me the same results for Pizza Zimbru, BRD, Analize, Vox Roaming...
My Spacy version is '3.0.0rc2'.

Is it a normal behavior or I am not using it correctly?