emorynlp / nlp4j-old Goto Github PK

NLP tools developed by Emory University.

License: Other

Java 100.00%

nlp4j-old's Introduction

NLP4J

The NLP4J project provides a NLP toolkit for JVM languages. This project is under the Apache 2 license and is currently developed by the NLP Research Group at Emory University. Please join our forum to get notifications about new releases and give your feedback about this project.

Latest release: 1.1.2 (06/29/2016).
Release notes.
Google groups.

Quick Start

Components

Tokenization.
Morphological analysis.
Part-of-speech tagging.
Named entity recognition.
Dependency parsing.
Semantic role labeling (coming soon).
Sentiment analysis (coming soon).
Word2Vec & Struct2Vec (coming soon).

Supplements

English lexica and models (hosted in bitbucket).
Data format.

nlp4j-old's People

Contributors

Stargazers

Watchers

Forkers

csong27 reid-kilgore bgshin mzhai2 meera1hahn pengjiemory mohsensalari robertzm ruiiiijiiiiang justhalf trietnm2 codeaudit dwdamon nlp-fan tsushen adpande leoking01 avinash-k rikimalca

nlp4j-old's Issues

Errors in POS Tagging

For the following cases,

    1   You you PRP _   2   nsubj   _
    2   know    know    VBP _   7   parataxis   _   O
    3   what    what    WP  _   2   ccomp   _   O
    4   ,   ,   ,   _   7   punct   _   O
    5   you you PRP _   7   nsubj   _
    6   ’ve   ’ve   NNP _   7   nsubj   _   U-PERSON
    7   convinced   convince    VBD _   0   root    _   O
    8   me  me  PRP _   7   dobj    _
    9   ,   ,   ,   _   7   punct   _   O
    10  maybe   maybe   RB  _   14  advmod  _   O
    11  tonight tonight NN  _   14  npadvmod    _   U-TIME
    12  we  we  PRP _   14  nsubj   _   O
    13  should  should  MD  _   14  aux _   O
    14  sneak   sneak   VB  _   7   ccomp   _   O
    15  in  in  RP  _   14  prt _   O
    16  and and CC  _   14  cc  _   O
    17  shampoo shampoo VB  _   14  conj    _   O
    18  her her PRP$    _   19  poss    _
    19  carpet  carpet  NN  _   17  dobj    _   O
    20  .   .   .   _   7   punct   _   O

    1   You you PRP _   4   nsubj   _
    2   do  do  VBP _   4   aux _   O
    3   n’t   n’t   PRP _   4   nsubj   _
    4   think   think   VB  _   0   root    _   O
    5   that    that    IN  _   6   nsubj   _   O
    6   crosses cross   VBZ _   4   ccomp   _   O
    7   a   a   DT  _   8   det _   O
    8   line    line    NN  _   6   dobj    _   O
    9   ?   ?   .   _   4   punct   _   O

The token 've and n't seems to have the wrong pos tag. Recurring error with conversational data.

POM in NLP4J 1.1.1 does not use fixed version

I have upgraded the NLP4J dependency in my project to 1.1.1, but the transitive dependencies are still resolved to 1.1.0. This seems to be due to the fact that the POM of the nlp4j artifact does not use proper versions for its dependencies, but rather RELEASE:

<dependency>
<groupId>edu.emory.mathcs.nlp</groupId>
<artifactId>nlp4j-core</artifactId>
<version>RELEASE</version>
</dependency>
<dependency>
<groupId>edu.emory.mathcs.nlp</groupId>
<artifactId>nlp4j-tokenization</artifactId>
<version>RELEASE</version>
</dependency>
<dependency>
<groupId>edu.emory.mathcs.nlp</groupId>
<artifactId>nlp4j-morphology</artifactId>
<version>RELEASE</version>
</dependency>

It would be nice if proper versions would be used such that one doesn't have do add explicit versioned dependencies to their POMs for all the transitive NLP4J dependencies.

Get fields from FeatureTemplate

It would be nice if there was a getter for the fields (/features) used in a feature template. The information can be crudely extracted from the toString() method (or using reflection), but a proper getter would be nicer.

POS tagger model takes time to load

Hi,
I was trying to use your POS Tagger using NLPDecodeRaw class. I am developing an app in python where I need to POS tag one sentence at a time. So, every time I call the java class it loads the model and it takes around 10 seconds, which is a lot in a real time scenario. I tried Serializing the decoder object, to use one loaded copy of the model, but the NLPDecodeRaw class is not serializable.
Can you please suggest some way to get POS tag on a sentence by sentence basis (not a file), without loading the model every time, or if there is any other way out to reduce the turnaround time.

StringIndexOutOfBoundsException

special7.txt
config-decode-en.xml.txt

Not sure how to characterize this, other than the tokenizer does not seem to do enough error handling or bounds checking. I tried to reduce the input as much as possible to reproduce the issue. Would appreciate any feedback on how to pre-process input data.

Running from command line:
java -Xmx4g -XX:+UseConcMarkSweepGC -cp nlp4j-1.1.1.jar edu.emory.mathcs.nlp.bin/NLPDecode -c config-decode-en.xml -i special7.txt

special7.txt contains:

keywords: {words},
URL: http://anyurl.com A. Abbot, "Help BL(1) nephew,"

This appears to break the online demo as well.

java.lang.StringIndexOutOfBoundsException: String index out of range: 63
at java.lang.String.substring(String.java:1963)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenize(Tokenizer.java:113)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.segmentize(Tokenizer.java:133)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decodeRaw(AbstractNLPDecoder.java:221)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:182)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder$NLPTask.run(AbstractNLPDecoder.java:345)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

CLI for Eval?

I went looking for a CLI for the Eval component, so that I could get scores for existing models. I didn't see one. If I'm right that it's not there, would you like a PR that creates one?

GlobalLexica unable to support multiple lexica

The static implementation of GlobalLexica prevents loading different types of lexica e.g. in a multi-threaded environment. Examples would be domain-specific lexica or lexica for different languages.

Error execute: component not found

I'm trying to train with ClearParser and I get this error. Before execute the command I put export CLASSPATH=nlp4j-1.1.0.jar:. and doing java edu.emory.mathcs.nlp.bin.Version I get the version info, so it's installed correctly.

Command line: java -Xmx5g -XX:+UseConcMarkSweepGC edu.emory.mathcs.nlp.bin.NLPTrain -mode dep -c config-train-dep.xml -t /home/iago/Escritorio/idiomasClearParser/UD_English/en-ud-train.conllu -d /home/iago/Escritorio/idiomasClearParser/UD_English/en-ud-dev.conllu -m bestModel-dep.xz

I'm using this config file: https://github.com/emorynlp/nlp4j/blob/master/src/main/resources/edu/emory/mathcs/nlp/configuration/config-train-dep.xml

Error: log4j:WARN No appenders could be found for logger (edu.emory.mathcs.nlp.common.util.BinUtils). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. java.io.FileNotFoundException: edu/emory/mathcs/nlp/lexica/en-brown-clusters-simplified-lowercase.xz (No existe el archivo o el directorio) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at java.io.FileInputStream.<init>(FileInputStream.java:93) at edu.emory.mathcs.nlp.common.util.IOUtils.createFileInputStream(IOUtils.java:147) at edu.emory.mathcs.nlp.common.util.IOUtils.getInputStream(IOUtils.java:316) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.getLexiconFieldPair(GlobalLexica.java:82) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.getLexiconFieldPair(GlobalLexica.java:72) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:64) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:55) at edu.emory.mathcs.nlp.bin.NLPTrain$1.createGlobalLexica(NLPTrain.java:108) at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:193) at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:187) at edu.emory.mathcs.nlp.bin.NLPTrain.train(NLPTrain.java:76) at edu.emory.mathcs.nlp.bin.NLPTrain.main(NLPTrain.java:115) java.io.IOException: Stream closed at java.io.BufferedInputStream.getInIfOpen(BufferedInputStream.java:159) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.tukaani.xz.SingleXZInputStream.initialize(Unknown Source) at org.tukaani.xz.SingleXZInputStream.<init>(Unknown Source) at org.tukaani.xz.XZInputStream.<init>(Unknown Source) at org.tukaani.xz.XZInputStream.<init>(Unknown Source) at edu.emory.mathcs.nlp.common.util.IOUtils.createXZBufferedInputStream(IOUtils.java:220) at edu.emory.mathcs.nlp.common.util.IOUtils.createObjectXZBufferedInputStream(IOUtils.java:259) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.getLexiconFieldPair(GlobalLexica.java:82) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.getLexiconFieldPair(GlobalLexica.java:72) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:64) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:55) at edu.emory.mathcs.nlp.bin.NLPTrain$1.createGlobalLexica(NLPTrain.java:108) at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:193) at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:187) at edu.emory.mathcs.nlp.bin.NLPTrain.train(NLPTrain.java:76) at edu.emory.mathcs.nlp.bin.NLPTrain.main(NLPTrain.java:115) Exception in thread "main" java.lang.NullPointerException at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2338) at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301) at edu.emory.mathcs.nlp.common.util.IOUtils.createObjectXZBufferedInputStream(IOUtils.java:259) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.getLexiconFieldPair(GlobalLexica.java:82) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.getLexiconFieldPair(GlobalLexica.java:72) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:64) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:55) at edu.emory.mathcs.nlp.bin.NLPTrain$1.createGlobalLexica(NLPTrain.java:108) at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:193) at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:187) at edu.emory.mathcs.nlp.bin.NLPTrain.train(NLPTrain.java:76) at edu.emory.mathcs.nlp.bin.NLPTrain.main(NLPTrain.java:115)

Why I'm getting this error? I unpacked the .jar and there is no "lexica" folder neither "en-brown-clusters-simplified-lowercase.xz". Where I can found it?

Regards

Universal Dependencies

Do you know of any converter from the NLP4J representation to Universal Dependencies?

Or, conversely, do you have any sense of what would happen if someone trained the dependency model on UD input?

Infinite recursion in AbstractNLPDecoder.createTSVReader()

I will shortly provide a PR that includes a test case and a fix for an infinite recursion in this method.

Does the dependency parser 'depend' on the named entity tagger?

Can I leave the NE model out of the list when I'm just interested in parse output?

In general, could we have a table in the doc of which of the elements of the XML config are required for which tasks?

Are components reentrant? Can components be cloned?

It takes quite a while to decompress and deserialize a component, and we're going to need to do so often. If they are thread-safe, it's not an issue. if they can be cloned, it's not too bad, but the interface does not advertise this. Can you give me any advice? I can certainly dive into the code and start to look at how to modify it in this direction if needed.

nbest parser output

Hi, how do i get nbest output?

    nodes = nlp4j.decode(sentence);

this seems to generate the 1-best.

thanks

Where are the models?

I am trying to get a very basic dependency parser working (give it a sentence in a string and spit out the dependency tree).

I can't seem to find the models at the indicated location though:
https://bitbucket.org/emorynlp/nlp4j-english/src/fc6cf377142cb554ab74c7b6377eff6d28e43620/src/main/resources/edu/emory/mathcs/nlp/models/?at=master

Only en-pos.xz shows up there, no other models. Am I missing something?

Once I get all those models listed there I assume I'll use this:

https://github.com/emorynlp/nlp4j-demo/blob/master/src/main/java/edu/emory/mathcs/nlp/demo/DEPDecode.java

To get the dependency parser up and running? Do you have any better (quick-start-guide) kind of thing for getting the basic dependency parser up and running?

Also final question :) When is the ETA for SRL and Sentiment Analysis?

Dep parser appears to produce multiple nodes with the head of the tree as the root.

I need to represent the output of the dependency parser as a collection of dependency tuples: [dprel, governor, dependency].

So, I wrote:

private Dependency nodeToDep(NLPNode n) {
        int gov = n.getDependencyHead().getID();
        return new Dependency.Builder(n.getDependencyLabel(), gov - 1, n.getID() - 1).build();
    }

This is working fine for many examples. However, for the this sentence:

and the modern, lightweight, steel, collapsible wheelchair was created by Harry Jennings and his disabled friend Herbert Everest, in 1933.

I end up with two nodes with a head of 0. My representation is below; see the two occurrences of 'ROOT'. Am I misinterpreting the output data structure?

cc(created-12, and-1)
det(modern-3, the-2)
conj(ROOT-0, modern-3)
punct(modern-3, ,-4)
conj(modern-3, lightweight-5)
punct(lightweight-5, ,-6)
conj(lightweight-5, steel-7)
punct(created-12, ,-8)
nmod(wheelchair-10, collapsible-9)
nsubjpass(created-12, wheelchair-10)
auxpass(created-12, was-11)
root(ROOT-0, created-12)
agent(created-12, by-13)
compound(Jennings-15, Harry-14)
pobj(by-13, Jennings-15)
cc(Jennings-15, and-16)
poss(friend-19, his-17)
nmod(friend-19, disabled-18)
conj(Jennings-15, friend-19)
compound(Everest-21, Herbert-20)
appos(friend-19, Everest-21)
punct(created-12, ,-22)
prep(created-12, in-23)
pobj(in-23, 1933-24)
punct(created-12, .-25)

Can i use nlp4j with spanish?

Hello,
I was thinking about using this library in a research project, the fact is that the project is oriented to work with spanish words. Is there any way to use the library with spanish?.

Tanks. 😄

Two root nodes from one sentence to the DP, any advice?

We trained a UD model with the UD treebank plus the WSJ converted to UD with the Stanford converter. Every so often, a sentence we run comes out with a seemingly impossible structure with an 'extra' root node. The cases we've seen have always involved the 'conj' label.

Does this suggest anything to you? I could share the data and/or the model file if you are interested.

In Anglo-American common law courts, appellate review of lower court decisions may also be obtained by filing a petition for review by prerogative writ in certain cases.

case(courts-5, In-1)
amod(courts-5, Anglo-American-2)
amod(courts-5, common-3)
compound(courts-5, law-4)
conj(ROOT-0, courts-5)
punct(courts-5, ,-6)
amod(review-8, appellate-7)
conj(courts-5, review-8)
case(decisions-12, of-9)
amod(decisions-12, lower-10)
compound(decisions-12, court-11)
nmod(review-8, decisions-12)
aux(obtained-16, may-13)
advmod(obtained-16, also-14)
auxpass(obtained-16, be-15)
root(ROOT-0, obtained-16)
mark(filing-18, by-17)
advcl(obtained-16, filing-18)
det(petition-20, a-19)
dobj(filing-18, petition-20)
case(review-22, for-21)
nmod(filing-18, review-22)
case(writ-25, by-23)
compound(writ-25, prerogative-24)
nmod(filing-18, writ-25)
case(cases-28, in-26)
amod(cases-28, certain-27)
nmod(filing-18, cases-28)
punct(obtained-16, .-29)

Garbage tokens?

Just ran the pos tagger, using the code below. Unfortunately, the first token seems to be garbage. Can I always assume that this will be the case?

    val config = DecodeConfig(IOUtils.createFileInputStream(configUri))
    val decoder = NLPDecoder(config)
    val tokens = decoder.decode("For god so loved.")

    for(p in tokens) println(p)

Output:

0   @#r$%   @#r$%   @#r$%   _   _   _   _   @#r$%
1   For for IN  _   _   _   _   @#r$%
2   god god NN  pos2=UH _   _   _   @#r$%
3   so  so  RB  _   _   _   _   @#r$%
4   loved   love    VBD pos2=VBN    _   _   _   @#r$%
5   .   .   .   _   _   _   _   @#r$%

[Fatal Error] :52:71:org.xml.sax.SAXParseException

Hi, I'm trying to train a model with a train and dev from Universal Dependencies-1.2 and after adapt it to ClearNLP format I'm getting this error in command line:

[Fatal Error]: 52: 71: The attribute name "data-pjax-transient" associated with a type of "meta" element must be followed by the character '='. org.xml.sax.SAXParseException; lineNumber: 52; columnNumber: 71; The name attribute "data-pjax-transient" associated with a type of "meta" element must be followed by the character '='. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) at edu.emory.mathcs.nlp.common.util.XMLUtils.getDocumentElement(XMLUtils.java:107) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:55) at edu.emory.mathcs.nlp.bin.NLPTrain$1.createGlobalLexica(NLPTrain.java:108) at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:193) at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:187) at edu.emory.mathcs.nlp.bin.NLPTrain.train(NLPTrain.java:76) at edu.emory.mathcs.nlp.bin.NLPTrain.main(NLPTrain.java:115) Exception in thread "main" java.lang.NullPointerException at edu.emory.mathcs.nlp.common.util.XMLUtils.getFirstElementByTagName(XMLUtils.java:74) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:60) at edu.emory.mathcs.nlp.component.template.util.GlobalLexica.<init>(GlobalLexica.java:55) at edu.emory.mathcs.nlp.bin.NLPTrain$1.createGlobalLexica(NLPTrain.java:108)

My train and dev files looks (separator it's '\t'):

1 [ [ PUNCT _ 10 punct _ _
2 This this DET Number=Sing|PronType=Dem 3 det _ _
3 killing killing NOUN Number=Sing 10 nsubj _ _
4 of of ADP _ 7 case _ _
5 a a DET Definite=Ind|PronType=Art 7 det _ _
6 respected respected ADJ Degree=Pos 7 amod _ _
7 cleric cleric NOUN Number=Sing 3 nmod _ _
8 will will AUX VerbForm=Fin 10 aux _ _
9 be be AUX VerbForm=Inf 10 aux _ _
10 causing cause VERB VerbForm=Ger 0 root _ _
11 us we PRON Case=Acc|Number=Plur|Person=1|PronType=Prs 10 iobj _ _
12 trouble trouble NOUN Number=Sing 10 dobj _ _
13 for for ADP _ 14 case _ _
14 years year NOUN Number=Plur 10 nmod _ _
15 to to PART _ 16 mark _ _
16 come come VERB VerbForm=Inf 14 acl _ _
17 . . PUNCT _ 10 punct _ _
18 ] ] PUNCT _ 10 punct _ _

1 DPA DPA PROPN Number=Sing 0 root _ _
2 : : PUNCT _ 1 punct _ _
3 Iraqi iraqi ADJ Degree=Pos 4 amod _ _
4 authorities authority NOUN Number=Plur 5 nsubj _ _
5 announced announce VERB Mood=Ind|Tense=Past|VerbForm=Fin 1 parataxis _ _
6 that that SCONJ _ 9 mark _ _
7 they they PRON Case=Nom|Number=Plur|Person=3|PronType=Prs 9 nsubj _ _
8 had have AUX Mood=Ind|Tense=Past|VerbForm=Fin 9 aux _ _
9 busted bust VERB Tense=Past|VerbForm=Part 5 ccomp _ _
10 up up ADP _ 9 compound:prt _ _
11 3 3 NUM NumType=Card 13 nummod _ _
12 terrorist terrorist ADJ Degree=Pos 13 amod _ _
13 cells cell NOUN Number=Plur 9 dobj _ _
14 operating operate VERB VerbForm=Ger 13 acl _ _
15 in in ADP _ 16 case _ _
16 Baghdad Baghdad PROPN Number=Sing 14 nmod _ _
17 . . PUNCT _ 1 punct _ _

What's wrong with my colums? Thanks in advance.

Part-of-Speech Tagging example

For one of use cases I don't need syntax analysis or NER. I only need to get root forms for all words in a document. Could you please provide an example for POS tagging ?

slf4j would be more flexible than raw log4j

Many of us use SLF4J as the API to logging rather than going directly to log4j. If you are amenable, I'll provide PRs to the repos.

only POS tagger

Hi,

the command line tagging takes a lot of time,not suitable for real-time systems. Please provide API for single taggers like only for POS.

ArrayIndexOutOfBoundException in NER Training

Hi,
We are facing a problem in NLP training in "ner" mode.The command we used is following .

$ java -Xmx1g -XX:+UseConcMarkSweepGC edu.emory.mathcs.nlp.bin.NLPTrain -mode ner -c config-train-sample.xml -t train.tsv -d sample-dev.tsv -m sample-dep.xz.

train.tsv that we used is following:
1 Peruvanthanam peruvanthanam NNP pos2=NN 0 root _ U-GPE.

The Exception we are getting is:
java.lang.ArrayIndexOutOfBoundsException: 0at edu.emory.mathcs.nlp.learning.optimization.OnlineOptimizer.getPredictedLabelHingeLoss(OnlineOptimizer.java:239)
at edu.emory.mathcs.nlp.learning.optimization.method.AdaGradMiniBatch.getPredictedLabel(AdaGradMiniBatch.java:50)
at edu.emory.mathcs.nlp.learning.optimization.OnlineOptimizer.train(OnlineOptimizer.java:176)
at edu.emory.mathcs.nlp.learning.optimization.OnlineOptimizer.train(OnlineOptimizer.java:167)
at edu.emory.mathcs.nlp.component.template.OnlineComponent.process(OnlineComponent.java:201)
at edu.emory.mathcs.nlp.component.template.OnlineComponent.process(OnlineComponent.java:173)
at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.iterate(OnlineTrainer.java:299)
at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:229)
at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:200)
at edu.emory.mathcs.nlp.component.template.train.OnlineTrainer.train(OnlineTrainer.java:187)
at edu.emory.mathcs.nlp.bin.NLPTrain.train(NLPTrain.java:77)
at edu.emory.mathcs.nlp.bin.NLPTrain.main(NLPTrain.java:117)

Please help .
Thanks in advance.

transition guide from clearnlp 3 to nlp4j

Hi, is there a guide to help transitioning code written for clearnlp 3 to nlp4j?
For example, is there a nlp4j equivalent of the DEPTree class?

thanks

IllegalStateException: unread block data

On running the NLPDemo I get the following exception

java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at edu.emory.mathcs.nlp.decode.NLPDecoder.getComponent(NLPDecoder.java:271)
at edu.emory.mathcs.nlp.decode.NLPDecoder.init(NLPDecoder.java:87)
at edu.emory.mathcs.nlp.decode.NLPDecoder.(NLPDecoder.java:72)

Typo in example in Data Format?

When looking at this file, I noticed that the lemmatized column in the example doesn't match with the word ("founder" lemmatized into "owner" and "EmoryNLP" lemmatized into "emory").

Is this intentional?

Also, I noticed that "'s" in "He's" is lemmatized into "be" with the POS tag "VBZ". I can't reproduce this by using the EnglishMorphAnalyzer in nlp4j-morphology (following this example)

Is the example manually written? Or is there some other process that preprocess the words?

I'm asking this because I believe it's good if the example matches exactly the system behavior.

tsv option not working

This commands produces an empty sample-trn.tsv.nlp

java -cp lib/*:src/main/resources/:. edu.emory.mathcs.nlp.bin.NLPDecode -c src/main/resources/edu/emory/mathcs/nlp/configuration/config-decode-pos.xml -i sample-trn.tsv -format tsv

Other options raw and line work. Just the tsv option doesn't work. Any ideas?

ClearNLP semantic Role Labelling + Entity Co referencing

Hello,
I have used clearNLP for semantic role labelling and entity coreferencing.

(1)Entity corefence - resolves pron-nouns with proper nouns
I am not clear about how to interpret output of coreferencing - basically I require entity names and set of sentences or sentence numbers associated with this. What are clusters here?

[email protected] COMMON UNKNOWN SINGULAR
Clusters: [-1, -1, -1, 1, -1, -1, -1, -1, 7, 1, 9, -1, 10, -1, 9, -1
Confidence: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,

===== Clusters =====
0: morning (Singleton)
3: He -> 1: Stewart

(2) How do interpret dependency parser output? Is there any documentation about labels used? Sometimes we see output with 7 slots and sometimes with 9 slots why? How TVRSReader recognise this?

(3) Adding semantic role annotation adds an extra slot - how do link semantic role outputs?

Thanks

Adding new values to Named Entity.

Hi,
We need to add some more places to the existing model.What is approach.

Demo NLPDemo not working

Hi,
awesome project but I think a recent commit broke something, NLPDemo produces an error when I try to run it:

Exception in thread "main" java.lang.NullPointerException
at edu.emory.mathcs.nlp.common.util.DSUtils.isRange(DSUtils.java:182)
at edu.emory.mathcs.nlp.component.template.node.AbstractNLPNode.getLeftMostDependent(AbstractNLPNode.java:545)
at edu.emory.mathcs.nlp.component.template.node.AbstractNLPNode.getLeftMostDependent(AbstractNLPNode.java:534)
at edu.emory.mathcs.nlp.component.template.node.AbstractNLPNode.getLeftValency(AbstractNLPNode.java:972)
at edu.emory.mathcs.nlp.component.template.node.AbstractNLPNode.getValency(AbstractNLPNode.java:959)
at edu.emory.mathcs.nlp.component.template.feature.FeatureTemplate.getFeature(FeatureTemplate.java:301)
at edu.emory.mathcs.nlp.component.template.feature.FeatureTemplate.getFeature(FeatureTemplate.java:288)
at edu.emory.mathcs.nlp.component.template.feature.FeatureTemplate.getFeature(FeatureTemplate.java:276)
at edu.emory.mathcs.nlp.component.template.feature.FeatureTemplate.createSparseVector(FeatureTemplate.java:239)
at edu.emory.mathcs.nlp.component.template.feature.FeatureTemplate.createFeatureVector(FeatureTemplate.java:221)
at edu.emory.mathcs.nlp.component.template.OnlineComponent.process(OnlineComponent.java:194)
at edu.emory.mathcs.nlp.component.template.OnlineComponent.process(OnlineComponent.java:172)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:286)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:280)
at edu.emory.mathcs.nlp.bin.NLPDemo.main(NLPDemo.java:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)

How do you use the POS tagger?

No example was given and I don't see the source code here.

How to speed up the dependency parser by reworking one method

Changing this method in ColumnMajorVector got the time on 10MB of English wikipedia from:

Decode: 0:04:08.748

Decode: 0:03:17.926

    @Override
    public void addScores(SparseVector x, float[] scores)
    {
        List<SparseItem> itemVector = x.getVector();
        int featureSize = getFeatureSize();
        itemVector.stream().filter(p -> p.getIndex() < featureSize)
                .forEach(p -> {
                    int index = p.getIndex() * label_size;
                    for (int i = 0; i < scores.length; i++) {
                        scores[i] += get(index++) * p.getValue();
                    }
                });
    }

Exception reading dependency model

Have you ever seen this?

java.lang.IllegalStateException: unread block data
    at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2431)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at edu.emory.mathcs.nlp.decode.NLPUtils.getComponent(NLPUtils.java:60)
    at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.init(AbstractNLPDecoder.java:102)
    at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.<init>(AbstractNLPDecoder.java:65)
    at edu.emory.mathcs.nlp.decode.NLPDecoder.<init>(NLPDecoder.java:36)

Nlp Training for NER

I am trying to use NPLTrain in ner mode. I have been using the file attached but get the error below.

Command
./bin/nlptrain -c config-train-ner.xml -mode ner -t sample-trn.tsv -d sample-dev.tsv -m sample-dep.xz

Error
java.lang.IllegalArgumentException: No enum constant edu.emory.mathcs.nlp.component.template.util.BILOU.2
at java.lang.Enum.valueOf(Enum.java:238)

And ideas?

It does generate an output. See sample-dep.xz attached. Is there anyway of previewing this to approve the content?

Also when i manage to generate the output I was wondering which file this would replace in my configuration file - does it replace this?

<named_entity_gazetteers field="word_form_simplified">edu/emory/mathcs/nlp/lexica/en-named-entity-gazetteers-simplified.xz</named_entity_gazetteers>

Tom

Attachments

sample-dep.xz.zip
config-train-ner.xml.zip

Static NLPDecoder.getComponent(InputStream)

The method NLPDecoder.getComponent(InputStream) could be static. It somehow feels like doing something wrong when first having to create an NLPDecoder instance without any configuration just in order to load a serialized component:

                NLPDecoder decoder = new NLPDecoder();
                OnlineComponent<POSState> component = (OnlineComponent<POSState>) 
                        decoder.getComponent(aStream);