dumitrescustefan / ronec Goto Github PK
View Code? Open in Web Editor NEWRomanian Named Entity Corpus (RONEC) version 2.0
License: MIT License
Romanian Named Entity Corpus (RONEC) version 2.0
License: MIT License
Hi,
When I try to download RONEC copus by running:
python3 -m spacy download ro_core_news_sm
I am getting the following error:
No compatible models found for v2.3.4 of spaCy
This happens also when I try to follow the download commands from this link.
On a second note, I would like to thank you for your work in bringing the RONEC corpus to the world, any of your guidance would be much appreciated.
Hi,
I tried to use spacy with RONEC to check if some contact phone names are person's names or not, and I got weird results:
nlp = spacy.load("ro_core_news_sm")
doc = 'Digi24'
doc = nlp(doc)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
(out): Digi24 0 6 PERSON
It gave me the same results for Pizza Zimbru, BRD, Analize, Vox Roaming...
My Spacy version is '3.0.0rc2'
.
Is it a normal behavior or I am not using it correctly?
Hello,
I ve tried to use RONEC in SpaCy for the following legal text from Romanian Official Gazette:
"Aprobă revocarea din calitatea de cenzor a următoarelor persoane: IORGA RADU-GABRIEL,GUCEANU DORINA și TEAHA RODICA și aprobă numirea în calitate de auditor financiar al societatii ARYA CONSULTING S.R.L.,CUI 41617624,J23/3968/2019, pentru o perioadă de 2 ani."
..and I ve got the following result:
persoane 56 64 PERSON
IORGA RADU 66 76 ORGANIZATION
GABRIEL 77 84 ORGANIZATION
TEAHA RODICA 103 115 ORGANIZATION
ARYA CONSULTING S.R.L.,CUI 181 209 ORGANIZATION
41617624,J23/3968/2019 209 231 NUMERIC_VALUE
2 ani 254 259 DATETIME
I am guessing the legal text from Official Gazette looks very different from the one in RONEC . Would I need to create a completely new annotated corpus from Official Gazette or just add to the existing RONEC more annotated legal stuff?
Thanks,
Mihai
Hello,
I attempted to use the RONEC corpus with Spacy for NER and I encountered some problems while following the tutorial for using Spacy with RONEC:
I cloned the repository and I tried to obtain the .json train and dev files using the convert_conllubio.py
script and Spacy's convert tool as shown in the tutorial:
!python3 ronec/spacy/train-local-model/convert_conllubio.py ronec/ronec/conllup/raw/ronec.conllup .
!python -m spacy convert train_ronec.conllubio . --converter conllubio
When I ran the second command, for the train data set I got this error:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/spacy/main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/convert.py", line 106, in convert
no_print=no_print,
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 25, in conllu2json
for i, (raw_text, tokens) in enumerate(conll_tuples):
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 68, in read_conllx
id, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: too many values to unpack (expected 10)
When I looked at the train_ronec.conllubio
file, I noticed that there were 11 columns on the first line instead of 10, as shown below:
1 Tot tot ADV Rp _ 3 advmod _ _ *
I found that deleting the "*" on the first line solved this problem, but I couldn't really understand why this happened.
By observing the .json dev and train output files, I noticed that the "ner" field was missing from the train file.
Here is the representation of a word from the dev file:
{
"id":0,
"orth":"Tinerii",
"tag":"Ncmpry",
"head":35,
"dep":"nsubj:pass",
"ner":"U-PERSON"
}
Here is a representation of a word from the train file:
{
"id":35,
"orth":"ROMANIA",
"tag":"Np",
"head":-2,
"dep":"nmod"
}
I moved on with the tutorial and I attempted to train the open-source BILSTM-CNN model found here: https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs with Spacy's train tool. I cloned the repo and I used the command below to train:
!python3 -m spacy train ro Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/models/ train_ronec.json dev_ronec.json -p ner
I noticed a very strange behavior for this: the model got stuck at 36%, no matter how much time I let it run. This is the output I got:
Training pipeline: ['ner']
Starting with blank model 'ro'
Counting training words (limit=0)
Itn NER Loss NER P NER R NER F Token % CPU WPS
36% 58105/159192 [00:10<00:16, 6216.60it/s]
I tried a few more times and the behavior did not change: it always stopped progressing at around 30-36%. Since it did not return any errors, I am not sure how to debug it, or if I am using it right.
I am running this on Google Colab. Here is some information about the environment:
Va salut.
As dori sa folosesc modelul dumneavoastra impreuna cu platforma KNIME si nodul Spacy dar nu stiu cum sa-l descarc de aici si sa-l folosesc in KNIME.
Ma puteti ajuta cumva?
Multumesc!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.