GithubHelp home page GithubHelp logo

dumitrescustefan / ronec Goto Github PK

View Code? Open in Web Editor NEW
60.0 9.0 15.0 24.94 MB

Romanian Named Entity Corpus (RONEC) version 2.0

License: MIT License

Python 100.00%
romanian ner corpus named-entity-recognition ronec

ronec's Issues

No compatible models found for v2.3.4 of spaCy

Hi,

When I try to download RONEC copus by running:
python3 -m spacy download ro_core_news_sm
I am getting the following error:
No compatible models found for v2.3.4 of spaCy
This happens also when I try to follow the download commands from this link.

On a second note, I would like to thank you for your work in bringing the RONEC corpus to the world, any of your guidance would be much appreciated.

Weird labeling results

Hi,

I tried to use spacy with RONEC to check if some contact phone names are person's names or not, and I got weird results:

nlp = spacy.load("ro_core_news_sm")
doc = 'Digi24'
doc = nlp(doc)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
(out): Digi24 0 6 PERSON

It gave me the same results for Pizza Zimbru, BRD, Analize, Vox Roaming...
My Spacy version is '3.0.0rc2'.

Is it a normal behavior or I am not using it correctly?

ronec question

Hello,

I ve tried to use RONEC in SpaCy for the following legal text from Romanian Official Gazette:

"Aprobă revocarea din calitatea de cenzor a următoarelor persoane: IORGA RADU-GABRIEL,GUCEANU DORINA și TEAHA RODICA și aprobă numirea în calitate de auditor financiar al societatii ARYA CONSULTING S.R.L.,CUI 41617624,J23/3968/2019, pentru o perioadă de 2 ani."

..and I ve got the following result:

persoane 56 64 PERSON
IORGA RADU 66 76 ORGANIZATION
GABRIEL 77 84 ORGANIZATION
TEAHA RODICA 103 115 ORGANIZATION
ARYA CONSULTING S.R.L.,CUI 181 209 ORGANIZATION
41617624,J23/3968/2019 209 231 NUMERIC_VALUE
2 ani 254 259 DATETIME

I am guessing the legal text from Official Gazette looks very different from the one in RONEC . Would I need to create a completely new annotated corpus from Official Gazette or just add to the existing RONEC more annotated legal stuff?

Thanks,
Mihai

Can't train external NER model on RONEC corpus using Spacy

Hello,

I attempted to use the RONEC corpus with Spacy for NER and I encountered some problems while following the tutorial for using Spacy with RONEC:

1

I cloned the repository and I tried to obtain the .json train and dev files using the convert_conllubio.py script and Spacy's convert tool as shown in the tutorial:

!python3 ronec/spacy/train-local-model/convert_conllubio.py ronec/ronec/conllup/raw/ronec.conllup .

!python -m spacy convert train_ronec.conllubio . --converter conllubio

When I ran the second command, for the train data set I got this error:

Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/spacy/main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/convert.py", line 106, in convert
no_print=no_print,
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 25, in conllu2json
for i, (raw_text, tokens) in enumerate(conll_tuples):
File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 68, in read_conllx
id
, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: too many values to unpack (expected 10)

When I looked at the train_ronec.conllubio file, I noticed that there were 11 columns on the first line instead of 10, as shown below:

1 Tot tot ADV Rp _ 3 advmod _ _ *

I found that deleting the "*" on the first line solved this problem, but I couldn't really understand why this happened.

2

By observing the .json dev and train output files, I noticed that the "ner" field was missing from the train file.
Here is the representation of a word from the dev file:

{
"id":0,
"orth":"Tinerii",
"tag":"Ncmpry",
"head":35,
"dep":"nsubj:pass",
"ner":"U-PERSON"
}

Here is a representation of a word from the train file:

{
"id":35,
"orth":"ROMANIA",
"tag":"Np",
"head":-2,
"dep":"nmod"
}

3

I moved on with the tutorial and I attempted to train the open-source BILSTM-CNN model found here: https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs with Spacy's train tool. I cloned the repo and I used the command below to train:

!python3 -m spacy train ro Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/models/ train_ronec.json dev_ronec.json -p ner

I noticed a very strange behavior for this: the model got stuck at 36%, no matter how much time I let it run. This is the output I got:

Training pipeline: ['ner']
Starting with blank model 'ro'
Counting training words (limit=0)
Itn NER Loss NER P NER R NER F Token % CPU WPS
36% 58105/159192 [00:10<00:16, 6216.60it/s]

I tried a few more times and the behavior did not change: it always stopped progressing at around 30-36%. Since it did not return any errors, I am not sure how to debug it, or if I am using it right.

Environment

I am running this on Google Colab. Here is some information about the environment:

  • spaCy version:** 2.2.4
  • Platform:** Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version:** 3.6.9

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.