GithubHelp home page GithubHelp logo

mate-tools's Issues

Parser kills first token of each sentence

We build an implementation of the mate tools version 3.5 with input injection 
of the CollReader format (one token per line, separated by a line break per 
sentence). 
The Parser however shows some strange behaviour, where it deletes the first 
token of each sentence and starts with the second. This is a relatively new 
issue and might has something to do with the input format/encoding. The 
Lemmatizer and POS-tagger however work fine. All the data is encoded in UTF-8.

Example output (Der Buchstabe A hat eine durchschnittliche Häufigkeit von 
6.51%.):
 -------- TOKEN FORMS @AFTER PARSE
2       Buchstabe       _       buchstabe       _       NN      _       
case=nom|number=sg|gender=masc  -1      3       _       SB      _       _
3       A       _       --      _       NE      _       
case=nom|number=sg|gender=*     -1      1       _       NK      _       _
4       hat     _       haben   _       VAFIN   _       
number=sg|person=3|tense=pres|mood=ind  -1      0       _       --      _       
_
5       in      _       in      _       APPR    _       _       -1      3       
_       MO      _       _
6       deutschen       _       deutsch _       ADJA    _       
case=dat|number=pl|gender=fem|degree=pos        -1      6       _       NK      
_       _
7       Texten  _       text    _       NN      _       
case=dat|number=pl|gender=fem   -1      4       _       NK      _       _
8       eine    _       ein     _       ART     _       
case=acc|number=sg|gender=fem   -1      9       _       NK      _       _
9       durchschnittliche       _       durchschnittlich        _       ADJA    
_       case=acc|number=sg|gender=fem|degree=pos        -1      9       _       
NK      _       _
10      Häufigkeit      _       häufigkeit      _       NN      _       
case=acc|number=sg|gender=fem   -1      3       _       OA      _       _
11      von     _       von     _       APPR    _       _       -1      9       
_       MNR     _       _
12      6,51    _       6,51    _       CARD    _       _       -1      12      
_       NK      _       _
13      %       _       %       _       NN      _       
case=*|number=*|gender=neut     -1      10      _       NK      _       _
14      .       _       --      _       $.      _       _       -1      12      
_       --      _       _



Thanks

Original issue reported on code.google.com by [email protected] on 31 Oct 2013 at 11:15

debug and output of morphtagger go on the same stream if -out is set to /dev/stdout

What steps will reproduce the problem?

1. Input (see attached file):
1   Michael Michael Michael NE  NE  _   _   _   _   _   _   _   _   _
2   war sein    sein    VAFIN   VAFIN   _   _   _   _   _   _   _   _   _
3   ein eine    eine    ART ART _   _   _   _   _   _   _   _   _
4   guter   gut gut ADJA    ADJA    _   _   _   _   _   _   _   _   _
5   Junge   Junge   Junge   NN  NN  _   _   _   _   _   _   _   _   _
6   .   .   .   $.  $.  _   _   _   _   _   _   _   _   _


2. Command for testing:
java -Xmx2G -cp anna-3.3.jar is2.mtag.Tagger -model 
tiger-complete.anna-3-1.morphtagger.model -test /dev/stdin -out /dev/stdout 

3. Current output (see attached file):
45.20.675  is2.data.ParametersFloat 121:read ->        read parameters 
134217727 not zero 4044229
45.20.677  is2.data.Cluster 113:<init> ->              Read cluster with 0 
words 
45.20.678  is2.mtag.Tagger 148:readModel ->            Loading data finished. 
45.20.679  is2.mtag.Tagger 150:readModel ->            number of parameter 
134217727
45.20.679  is2.mtag.Tagger 151:readModel ->            number of classes   268
Processing Sentence: 
1   Michael Michael Michael NE  NE  _   case=nom|number=sg|gender=masc  -1  -1  _   _   _   _
2   war sein    sein    VAFIN   VAFIN   _   number=sg|person=3|tense=past|mood=ind  -1  -1  _   _   _
    _
3   ein eine    eine    ART ART _   case=nom|number=sg|gender=masc  -1  -1  _   _   _   _
4   guter   gut gut ADJA    ADJA    _   case=nom|number=sg|gender=masc|degree=pos   -1  -1  _   _   
_   _
5   Junge   Junge   Junge   NN  NN  _   case=nom|number=sg|gender=masc  -1  -1  _   _   _   _
6   .   .   .   $.  $.  _   _   -1  -1  _   _   _   _

2 0.0095 seconds/sentnece 
Used time 0.019 seconds 

What is the expected output? What do you see instead?

In the latest change of DB.java, the debug variable was switched on by default. 
But since all debug info gets printed to the same stream as processed strings, 
this output can't later be fed into parser via a pipe. Would it be possible to 
switch the debug off, or, even better, to print the debug info  to System.err, 
so that it could be separated from the rest in cases when -out is set to 
/dev/stdout. It's of course possible to fiddle with file descriptors, but 
nevertheless sending debug to System.err would probably be nicer.

What version of the product are you using? On what operating system?
anna-3.3.jar
tiger-complete.anna-3-1.morphtagger.model

OS: Linux 3.4.47-2.38-desktop x86_64

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 4 Sep 2013 at 7:40

Attachments:

mixing up LEMMA and PLEMMA columns

What steps will reproduce the problem?
1. take a sentence, e.g. "Quick brown fox jumps over the lazy dog ." 
2. split, lemmatize, tag and parse it
3. have a look at the intermediate results

Here's what I get:

1   Quick   _   _   _   _   _   _   _   _   _   _   _   _   _
2   brown   _   _   _   _   _   _   _   _   _   _   _   _   _
3   fox _   _   _   _   _   _   _   _   _   _   _   _   _
4   jumps   _   _   _   _   _   _   _   _   _   _   _   _   _
5   over    _   _   _   _   _   _   _   _   _   _   _   _   _
6   the _   _   _   _   _   _   _   _   _   _   _   _   _
7   lazy    _   _   _   _   _   _   _   _   _   _   _   _   _
8   dog _   _   _   _   _   _   _   _   _   _   _   _   _
9   .   _   _   _   _   _   _   _   _   _   _   _   _   _

1   Quick   _   quick   _   _   _   _   -1  _   _   _   _   _
2   brown   _   brown   _   _   _   _   -1  _   _   _   _   _
3   fox _   fox _   _   _   _   -1  _   _   _   _   _
4   jumps   _   jump    _   _   _   _   -1  _   _   _   _   _
5   over    _   over    _   _   _   _   -1  _   _   _   _   _
6   the _   the _   _   _   _   -1  _   _   _   _   _
7   lazy    _   lazy    _   _   _   _   -1  _   _   _   _   _
8   dog _   dog _   _   _   _   -1  _   _   _   _   _
9   .   _   .   _   _   _   _   -1  _   _   _   _   _

1   Quick   quick   _   _   JJ  _   _   -1  _   _   _   _   _
2   brown   brown   _   _   JJ  _   _   -1  _   _   _   _   _
3   fox fox _   _   NN  _   _   -1  _   _   _   _   _
4   jumps   jump    _   _   VBZ _   _   -1  _   _   _   _   _
5   over    over    _   _   IN  _   _   -1  _   _   _   _   _
6   the the _   _   DT  _   _   -1  _   _   _   _   _
7   lazy    lazy    _   _   JJ  _   _   -1  _   _   _   _   _
8   dog dog _   _   NN  _   _   -1  _   _   _   _   _
9   .   .   _   _   .   _   _   -1  _   _   _   _   _

1   Quick   _   quick   _   JJ  _   _   3   3   NMOD    NMOD    _   _
2   brown   _   brown   _   JJ  _   _   3   3   NMOD    NMOD    _   _
3   fox _   fox _   NN  _   _   4   4   SBJ SBJ _   _
4   jumps   _   jump    _   VBZ _   _   0   0   ROOT    ROOT    _   _
5   over    _   over    _   IN  _   _   4   4   ADV ADV _   _
6   the _   the _   DT  _   _   8   8   NMOD    NMOD    _   _
7   lazy    _   lazy    _   JJ  _   _   8   8   NMOD    NMOD    _   _
8   dog _   dog _   NN  _   _   5   5   PMOD    PMOD    _   _
9   .   _   .   _   .   _   _   4   4   P   P   _   _

Note that the value for PLEMMA column produced by the lemmatizer became LEMMA 
value after the tagging. I believe this is not supposed to happen. 
Morphological tagger and dependency parser also swap the predicted and 
gold-standard lemma, so if one skips the morphological tagging step, the two 
swaps cancel out and the end result is fine, otherwise the role labeler reads 
the lemma value from the third column and we end up with "_" in place of the 
lemma.

Original issue reported on code.google.com by [email protected] on 18 Jul 2011 at 3:14

Exception when trying to run several instances of the mate parser (is2.parser) in the same JVM

What steps will reproduce the problem?
1. Run two instances of the mate parser in the same JVM

What do you see instead?

Running two instances of the mate parser in the same JVM leads to following 
exception

java.lang.ArrayIndexOutOfBoundsException: 45
2013-11-14 14:37:49 STDIO [ERROR] at 
is2.parser.ParallelDecoder.call(ParallelDecoder.java:74)
2013-11-14 14:37:49 STDIO [ERROR] at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
2013-11-14 14:37:49 STDIO [ERROR] at 
java.util.concurrent.FutureTask.run(FutureTask.java:138)
2013-11-14 14:37:49 STDIO [ERROR] at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:8
86)
2013-11-14 14:37:49 STDIO [ERROR] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
2013-11-14 14:37:49 STDIO [ERROR] at java.lang.Thread.run(Thread.java:662)



What version of the product are you using? On what operating system?

mate-tools 3.5 checked out from http://mate-tools.googlecode.com/svn/trunk/

Please provide any additional information below.


I noticed this problem while trying to run the mate-tools within a storm 
topology (see http://storm-project.net/). In Storm, you can parallelize an 
operation unit called bolt; in my case the mate parser was the bolt that I 
wanted to parallelize . The storm manager then deployed two instances of the 
parser on the same JVM and this lead to the exception described above.

Regards,
Abou Drame

Original issue reported on code.google.com by [email protected] on 14 Nov 2013 at 2:40

One sentence per line

Hi all,

Does anyone know where or how to get one-sentence-per-line corpus?
I need a dependence-parser, so I want to use this tool. but the input is 
one-sentence-per-line corpus.

Please help me.

Thanks.
Kopro

Original issue reported on code.google.com by [email protected] on 3 Oct 2014 at 7:00

Potential conflict between German and English parser resources when run simultaneously

What steps will reproduce the problem?
1. Load a German pipeline with the following resources:
prs-ger-cs_1.model";
tagger-ct.model";
lemmatizer.model";

2. Load an English SEPARATE(!) pipeline with the following resources:
prs-eng.model";
tag-eng.model";
lemma-eng.model";

3. Now parse a German sentence with the parser from 1. and inspect the output 
(everything is fine):
1 - Karin - Karin - SB - 2 - fliegt - NE - 
2 - fliegt - fliegen - ROOT - 0 - ROOTnode - VVFIN - 
3 - nach - nach - MO - 2 - fliegt - APPR - 
4 - New - New - PNC - 5 - York - NE - 
5 - York - York - NK - 3 - nach - NE - 
6 - . - _ - PUNC - 2 - fliegt - $. - 

4. Now parse an English sentence with the parser from 2. and inspect the output 
(everything is fine):
1 - This - this - SBJ - 2 - is - DT - 
2 - is - be - ROOT - 0 - ROOTnode - VBZ - 
3 - nice - nice - PRD - 2 - is - JJ - 
4 - and - and - COORD - 3 - nice - CC - 
5 - pretty - pretty - CONJ - 4 - and - RB - 
6 - . - . - P - 2 - is - . - 


5. NEW: Again use the parser from 1. and parse the German sentence (OUTPUT 
CONTAINS ERRORS NOW AND LOOKS STRANGE!!!):
1 - Karin - Ka - ROOT - 0 - ROOTnode - NNP - 
2 - fliegt - flieg - P - 1 - Karin - POS - 
3 - nach - nach - MNR - 1 - Karin - NNP - 
4 - New - New - APPO - 1 - Karin - NNP - 
5 - York - York - APPO - 1 - Karin - NNP - 
6 - . - . - P - 1 - Karin - POS - 

Any suggestions? Any help is appreciated.

Original issue reported on code.google.com by nikoschenk on 6 Mar 2012 at 1:54

Wrong path in the build.xml file for mate-tools

What steps will reproduce the problem?
1. svn checkout of the mate-tools package
2. try to build with ANT
3.

Please provide any additional information below.
In the build.xml file the classpath is to ./libs while the library directory in 
the project tree structure is named ./lib

Original issue reported on code.google.com by [email protected] on 18 Apr 2013 at 1:43

parse_full.sh points to wrong .jar files in classpath

What steps will reproduce the problem?
1. Execute sh scripts/parse_full.sh


What is the expected output? What do you see instead?

Output gives

Exception in thread "main" java.lang.NoClassDefFoundError: is2/util/OptionsSuper
    at se.lth.cs.srl.languages.Language.getLemmatizer(Language.java:99)
    at se.lth.cs.srl.languages.Language.getPreprocessor(Language.java:72)
    at se.lth.cs.srl.CompletePipeline.getCompletePipeline(CompletePipeline.java:37)
    at se.lth.cs.srl.CompletePipeline.main(CompletePipeline.java:93)


What version of the product are you using? On what operating system?
SRL pipeline with all required models.
srl-20130917

Please provide any additional information below.

Changing 

classpath variable from 

CP="srl.jar:lib/anna.jar:lib/liblinear-1.51-with-deps.jar:lib/opennlp-tools-1.4.
3.jar:lib/maxent-2.5.2.jar:lib/trove.jar:lib/seg.jar"

to

CP="srl.jar:lib/anna-3.3.jar:lib/liblinear-1.51-with-deps.jar:lib/opennlp-tools-
1.4.3.jar:lib/maxent-2.5.2.jar:lib/trove.jar:lib/seg.jar"


with the required models (cf. next command)

java -cp 
srl.jar:lib/anna-3.3.jar:lib/liblinear-1.51-with-deps.jar:lib/opennlp-tools-1.4.
3.jar:lib/maxent-2.5.2.jar:lib/trove.jar:lib/seg.jar -Xmx3g 
se.lth.cs.srl.CompletePipeline eng -tagger 
models/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model -parser 
models/CoNLL2009-ST-English-ALL.anna-3.3.parser.model -srl 
models/CoNLL2009-ST-English-ALL.anna-3.3.srl-4.1.srl.model -lemma 
models/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model -test input.txt -out 
output.txt

solves the problem.

Original issue reported on code.google.com by nikoschenk on 8 Oct 2014 at 9:53

Private methods disallow building a MATE wrapper

Building a wrapper for the MATE pipeline for processing several documents and 
without loading each model for every document, one needs to call the out() 
methods of the respective tools (after initialising each tool once with a 
model).

However, the following methods are not set to public and thus disallow direct 
use of the is2.parser.Parser:

- is2.parser.Parser.out()
- is2.parser.Pipe.nextInstance() 

Also, using the morph tagger at is2.mtag.Tagger is not possible because of the 
non-public access to these fields:

- is2.mtag.Tagger.pipe
- is2.mtag.Tagger.params 

I do not believe this is intended as the respective fields/methods are public 
in the other processor classes, e.g. the lemmatizer or POS tagger.

The problem can easily be solved by setting these fields/methods to public.

For more details and links to the respective source code, also check 
http://korap.ids-mannheim.de/2013/07/issues-with-mate-pipeline/ 

Original issue reported on code.google.com by [email protected] on 16 Jul 2013 at 12:56

SRL

Is SRL component can run on windows, by api or command line?

Original issue reported on code.google.com by [email protected] on 27 Oct 2013 at 4:29

Problem with CoNLL2009-ST-English-ALL.anna-3.3.srl-4.1.srl.model

What steps will reproduce the problem?
1. Trying SRL 4.3
2. Exception in thread "main" java.lang.NullPointerException: entry
    at java.util.zip.ZipFile.getInputStream(ZipFile.java:342)
    at se.lth.cs.srl.pipeline.Reranker.<init>(Reranker.java:79)

What is the expected output? What do you see instead?

I did try the SRL model given above, but it seems like the file named "global" 
is missing from the zipped model?

What version of the product are you using? On what operating system?
I'm trying SRL 4.3 on Debian 7.0

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 11 Dec 2013 at 1:59

Wrong example in usage description

It says:

 java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse [...]

I believe that it should be 

 java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Parse [...] (a white space after jar)

Original issue reported on code.google.com by [email protected] on 9 Sep 2014 at 1:22

Wrong path for lib directory in build.xml in source code

What steps will reproduce the problem?

1. svn checkout
2. ant compile

What is the expected output? What do you see instead?

I expect it to work; it complains about missing gnu trove.
"ln -s lib libs" does the job

What version of the product are you using? On what operating system?

The checked out source code, revision 168.


Original issue reported on code.google.com by [email protected] on 10 Jul 2012 at 10:18

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.