GithubHelp home page GithubHelp logo

dumitrescustefan / ronec Goto Github PK

View Code? Open in Web Editor NEW
60.0 9.0 15.0 24.94 MB

Romanian Named Entity Corpus (RONEC) version 2.0

License: MIT License

Python 100.00%
romanian ner corpus named-entity-recognition ronec

ronec's Introduction

About me

I'm an Machine Learning Engineer, working on cool projects at the intersection of NLP and CV. I finished my PhD in 2011, worked as a Research Scientist at the Research Institute for AI (Romanian Academy) for 7 years, then switched applied ML as an ML engineer at Sustainalytics (2017-2019) and now at Adobe.

I'm active in open source, especially on Romanian NLP. Throughout the years I've published, teached and coded, all while having fun. I like to build stuff.

Showcase on HuggingFace:


Projects I'm proud of

Under development:

  • Romanian Text Corpus (joint project with Mihai Ilie)
  • Word Sense Disambiguation Corpus & Models for Romanian (large scale, long running project)
  • NLI Corpus for Romanian
  • Sentence segmentation for Romanian (because current Romanian tools fail miserably for anything but clean text)

2023

  • May Appeared on live TV discussing AI (#1, #2)
  • Apr Participated in WE Smart Diaspora conference in Timisoara, Romania, presenting "The Impact of Large Language Models"

2022

2021

2020

  • Aug I lead the development of the first ML leaderboard named LiRo Benchmark, together with Viorica Patraucean and other amazing RomaniaAI volunteers.
  • Jun Proposed and lead the development of the Romanian Semantic Textual Similarity dataset. It's a 1:1 high-quality human translation of the English STS dataset.
  • Apr: Trained an released the first monolingual Romanian BERT model, which became the most used BERT model in Romania, with thousdands of monthly downloads.

2019 and before

  • RoWordNet pip package providing quick access to the Romanian WordNet. After all these years it's still the only python plug-and-play package for Romanian - seems to be working well :)
  • Developed NLP-Cube with Tiberiu Boros (lead). Started as an entry in the 2018 Conll competition and evolved into a multilingual toolkit providing Tokenization, Sentence Segmentation, Lemmatization, POS and DEP parsing, trained on the Universal Dependencies dataset.

Selected publications

Google Scholar profile , h-index: 9


ronec's People

Contributors

avramandrei avatar dumitrescustefan avatar iliemihai avatar oodapow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ronec's Issues

Can't train external NER model on RONEC corpus using Spacy

Hello,

I attempted to use the RONEC corpus with Spacy for NER and I encountered some problems while following the tutorial for using Spacy with RONEC:

1

I cloned the repository and I tried to obtain the .json train and dev files using the convert_conllubio.py script and Spacy's convert tool as shown in the tutorial:

!python3 ronec/spacy/train-local-model/convert_conllubio.py ronec/ronec/conllup/raw/ronec.conllup .

!python -m spacy convert train_ronec.conllubio . --converter conllubio

When I ran the second command, for the train data set I got this error:

Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/spacy/main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/convert.py", line 106, in convert
no_print=no_print,
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 25, in conllu2json
for i, (raw_text, tokens) in enumerate(conll_tuples):
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 68, in read_conllx
id
, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: too many values to unpack (expected 10)

When I looked at the train_ronec.conllubio file, I noticed that there were 11 columns on the first line instead of 10, as shown below:

1 Tot tot ADV Rp _ 3 advmod _ _ *

I found that deleting the "*" on the first line solved this problem, but I couldn't really understand why this happened.

2

By observing the .json dev and train output files, I noticed that the "ner" field was missing from the train file.
Here is the representation of a word from the dev file:

{
"id":0,
"orth":"Tinerii",
"tag":"Ncmpry",
"head":35,
"dep":"nsubj:pass",
"ner":"U-PERSON"
}

Here is a representation of a word from the train file:

{
"id":35,
"orth":"ROMANIA",
"tag":"Np",
"head":-2,
"dep":"nmod"
}

3

I moved on with the tutorial and I attempted to train the open-source BILSTM-CNN model found here: https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs with Spacy's train tool. I cloned the repo and I used the command below to train:

!python3 -m spacy train ro Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/models/ train_ronec.json dev_ronec.json -p ner

I noticed a very strange behavior for this: the model got stuck at 36%, no matter how much time I let it run. This is the output I got:

Training pipeline: ['ner']
Starting with blank model 'ro'
Counting training words (limit=0)
Itn NER Loss NER P NER R NER F Token % CPU WPS
36% 58105/159192 [00:10<00:16, 6216.60it/s]

I tried a few more times and the behavior did not change: it always stopped progressing at around 30-36%. Since it did not return any errors, I am not sure how to debug it, or if I am using it right.

Environment

I am running this on Google Colab. Here is some information about the environment:

  • spaCy version:** 2.2.4
  • Platform:** Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version:** 3.6.9

No compatible models found for v2.3.4 of spaCy

Hi,

When I try to download RONEC copus by running:
python3 -m spacy download ro_core_news_sm
I am getting the following error:
No compatible models found for v2.3.4 of spaCy
This happens also when I try to follow the download commands from this link.

On a second note, I would like to thank you for your work in bringing the RONEC corpus to the world, any of your guidance would be much appreciated.

ronec question

Hello,

I ve tried to use RONEC in SpaCy for the following legal text from Romanian Official Gazette:

"Aprobă revocarea din calitatea de cenzor a următoarelor persoane: IORGA RADU-GABRIEL,GUCEANU DORINA și TEAHA RODICA și aprobă numirea în calitate de auditor financiar al societatii ARYA CONSULTING S.R.L.,CUI 41617624,J23/3968/2019, pentru o perioadă de 2 ani."

..and I ve got the following result:

persoane 56 64 PERSON
IORGA RADU 66 76 ORGANIZATION
GABRIEL 77 84 ORGANIZATION
TEAHA RODICA 103 115 ORGANIZATION
ARYA CONSULTING S.R.L.,CUI 181 209 ORGANIZATION
41617624,J23/3968/2019 209 231 NUMERIC_VALUE
2 ani 254 259 DATETIME

I am guessing the legal text from Official Gazette looks very different from the one in RONEC . Would I need to create a completely new annotated corpus from Official Gazette or just add to the existing RONEC more annotated legal stuff?

Thanks,
Mihai

Weird labeling results

Hi,

I tried to use spacy with RONEC to check if some contact phone names are person's names or not, and I got weird results:

nlp = spacy.load("ro_core_news_sm")
doc = 'Digi24'
doc = nlp(doc)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
(out): Digi24 0 6 PERSON

It gave me the same results for Pizza Zimbru, BRD, Analize, Vox Roaming...
My Spacy version is '3.0.0rc2'.

Is it a normal behavior or I am not using it correctly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.