smartschat / cort Goto Github PK

A toolkit for coreference resolution and error analysis.

License: MIT License

Python 30.16% JavaScript 34.08% CSS 0.30% Perl 11.88% Java 23.29% R 0.29%

cort's Issues

CoreNLP error

hi! I get an error from CoreNLP when I'm trying to predict coreference on raw text using "cort-predict-raw" and standard parameters as described in the manual. this is what I get:

2015-11-05 10:37:23,639 INFO Loading model.
2015-11-05 10:37:55,518 INFO Reading in and preprocessing data.
2015-11-05 10:37:55,519 INFO Starting java subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -XX:ParallelGCThreads=1 -cp '/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/lib/:/Users/yuliagrishina/Documents/Software/CoreNLP' corenlp.SocketServer --outpipe /tmp/corenlp_pywrap_pipe_pypid=2721_time=1446716275.52 --configfile /Library/Python/2.7/site-packages/cort/config_files/corenlp.ini
INFO:CoreNLP_JavaServer: Using CoreNLP configuration file: /Library/Python/2.7/site-packages/cort/config_files/corenlp.ini
Exception in thread "main" java.lang.NoClassDefFoundError: edu/stanford/nlp/pipeline/StanfordCoreNLP
at corenlp.JsonPipeline.initializeCorenlpPipeline(JsonPipeline.java:206)
at corenlp.SocketServer.main(SocketServer.java:102)
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLP
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more

do you have any idea why this happens?

Code not working directly

I was using model-pair-train.obj model provided in your repository. On running the command provided for coreference resolution for raw data, I got the following error.
Traceback (most recent call last): File "/usr/local/bin/cort-predict-raw", line 132, in <module> testing_corpus = p.run_on_docs("corpus", args.input_filename) File "/usr/local/lib/python3.4/dist-packages/cort/preprocessing/pipeline.py", line 38, in run_on_docs codecs.open(doc, "r", "utf-8") File "/usr/local/lib/python3.4/dist-packages/cort/preprocessing/pipeline.py", line 82, in run_on_doc pdeprel=None TypeError: __new__() missing 1 required positional argument: 'extra'
This got resolved when I added a line
extra = None
following line 82 in cort/preprocessing/pipeline.py

Set IDs are lost during writing

I find this behaviour counter-intuitive: if you read a corpus (using Corpus.from_file) and write it out right away (using write_to_file), all set IDs are lost, i.e. the last column contains only minus signs.

I attached a test script and sample document.

test.zip

exceptions while training with gold conll data

Hello,

I encountered a few problems while trying to train a model with the gold standard version of the conll-2012 training set (*_gold_conll).

The first issue occurs during the conversion of certain trees, when some nodes of the trees are deleted but accessed later:

 File "$HOME/.local/bin/cort-train", line 132, in <module>
    "r", "utf-8"))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 79, in from_file
    document_as_strings]))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 14, in from_string
    return documents.CoNLLDocument(string)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 401, in __init__
    [parse.replace("NOPARSE", "S") for parse in parses]#, include_erased=True
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.py", line 116, in convert_trees
    for ptb_tree in ptb_trees)
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.py", line 116, in <genexpr>
    for ptb_tree in ptb_trees)
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/JPypeBackend.py", line 141, in convert_tree
    sentence.renumber()
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/CoNLL.py", line 111, in renumber
    for token in self]
KeyError: 18

This happens for several sentences in the training data set (e.g., document bn/cnn/04/cnn_0432, sentence on lines 272-296). One way to avoid the exception is to set include_erased=True.

The second issue is caused by one sentence in the training set (document mz/sinorama/10/ectb_1005, lines 980-1012):

Traceback (most recent call last):
  File "$HOME/.local/bin/cort-train", line 132, in <module>
    "r", "utf-8"))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 79, in from_file
    document_as_strings]))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 14, in from_string
    return documents.CoNLLDocument(string)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 414, in __init__
    super(CoNLLDocument, self).__init__(identifier, sentences, coref)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 97, in __init__
    self.annotated_mentions = self.__get_annotated_mentions()
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 111, in __get_annotated_mentions
    span, self, first_in_gold_entity=set_id not in seen
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/mentions.py", line 174, in from_document
    mention_property_computer.compute_gender(attributes)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/mention_property_computer.py", line 89, in compute_gender
    if __wordnet_lookup_gender(" ".join(attributes["head"])):
TypeError: sequence item 0: expected string, ParentedTree found

The problems seem to be data-related, as none of them occur when using the *_auto_conll version of the conll-2012 training data.

Collins head finder iterator should be a list instead?

I noticed some discrepancies between the 2 different traversal orders.
reversed() on line 108 returns an iterator and if it gets exhausted in the first iteration of the loop on 111, the subsequent result is incorrect.

cort/cort/core/head_finders.py

Lines 107 to 112 in 2349f03

 if traverse_reversed: 

 to_traverse = reversed(tree) 

 else: 

 to_traverse = tree 

 for val in values: 

 for child in to_traverse:

Suggest to change from
to_traverse = reversed(tree)
to
to_traverse = list(reversed(tree) )

Visualize: html.escape() not available with html package in Python 2.7

Hi,
errors_by_type.visualize() is giving a problem since escape function is not a part of html library for python 2.7. escape is available for html in python 3.x.

It would be great if there is a fix for this.

Thanks,
Joe

Unable to use out of the box

I was trying to use cort straight out of the box to predict coreference chains on raw text, but was unable to get it running. Here's what I did-

Created a virtualenv called cort and installed cort using pip. The github repo was in another folder called cort_tool, Stanford CoreNLP tools were in another folder called stanford-corenlp.
Downloaded model-train-pair.obj and placed it in cort_tool folder.
Created an input.txt file with a single sentence.
Ran the following commands after activating the venv-

$ cd cort_tool
$ cort-predict-raw -in ~/input.txt -model model-pair-train.obj -extractor cort.coreference.approaches.mention_ranking.extract_substructures -perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron -clusterer cort.coreference.clusterer.all_ante -corenlp ~/stanford-corenlp -suffix out 2>&1 | tee ~/output.txt

I got the following output-

2016-10-03 17:17:55,338 INFO Loading model.
In file included from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/ndarraytypes.h:1777:0,
                 from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                 from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from /home/cil/.pyxbld/temp.linux-x86_64-3.4/pyrex/cort/coreference/perceptrons.c:274:
/home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
 #warning "Using deprecated NumPy API, disable it by " \
  ^
In file included from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/ndarrayobject.h:27:0,
                 from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from /home/cil/.pyxbld/temp.linux-x86_64-3.4/pyrex/cort/coreference/perceptrons.c:274:
/home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/__multiarray_api.h:1448:1: warning: ‘_import_array’ defined but not used [-Wunused-function]
 _import_array(void)
 ^
2016-10-03 17:18:02,512 INFO Reading in and preprocessing data.
2016-10-03 17:18:02,513 INFO Starting java subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -XX:ParallelGCThreads=1 -cp '/home/cil/cort/lib/python3.4/site-packages/stanford_corenlp_pywrapper/lib/*:/home/cil/stanford-corenlp/*'      corenlp.SocketServer --outpipe /tmp/corenlp_pywrap_pipe_pypid=20030_time=1475486282.512968  --configfile /home/cil/cort/lib/python3.4/site-packages/cort/config_files/corenlp.ini
INFO:CoreNLP_JavaServer: Using CoreNLP configuration file: /home/cil/cort/lib/python3.4/site-packages/cort/config_files/corenlp.ini
Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/stanford/nlp/pipeline/StanfordCoreNLP : Unsupported major.minor version 52.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:803)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at corenlp.JsonPipeline.initializeCorenlpPipeline(JsonPipeline.java:206)
    at corenlp.SocketServer.main(SocketServer.java:102)

At this moment, the memory usage of the task cort-predict-raw became Zero so I did a keyboard interrupt and tried again, but got the same result.

I'm on Ubuntu 16.04.

Can you please help me out?

Memory requirements

cort currently needs a lot of RAM, predicting with the latent ranking model on the CoNLL-2012 development data takes ~8GB, mainly due to multiprocessing during feature extraction.

adjust_head_for_nam doesn't handle DURATION entities

Hi,
the adjust_head_for_nam function in cort.core.head_finders crashes whenever it encounters a named entity type of DURATION. This entity type is sometimes generated by the latest version of CoreNLP.
I guess it shouldn't be too difficult to add a pattern for it, but I don't know what would make sense.
/Christian

2016-04-09 05:04:42,285 INFO Preprocessing en/ep-00-06-15.xml.gz.
2016-04-09 05:20:08,227 INFO Extracting system mentions from en/ep-00-06-15.xml.gz.
2016-04-09 05:20:11,552 ERROR Discarding document en/ep-00-06-15.xml.gz
2016-04-09 05:20:11,619 ERROR Traceback (most recent call last):
File "/home/staff/ch/PycharmProjects/cort/extra/annot-wmt.py", line 197, in
doc.system_mentions = mention_extractor.extract_system_mentions(doc)
File "/home/staff/ch/PycharmProjects/cort/cort/core/mention_extractor.py", line 36, in extract_system_mentions
for span in __extract_system_mention_spans(document)]
File "/home/staff/ch/PycharmProjects/cort/cort/core/mention_extractor.py", line 36, in
for span in __extract_system_mention_spans(document)]
File "/home/staff/ch/PycharmProjects/cort/cort/core/mentions.py", line 153, in from_document
mention_property_computer.compute_head_information(attributes)
File "/home/staff/ch/PycharmProjects/cort/cort/core/mention_property_computer.py", line 248, in compute_head_information
attributes["ner"][head_index])
File "/home/staff/ch/PycharmProjects/cort/cort/core/head_finders.py", line 214, in adjust_head_for_nam
raise Exception("Unknown named entity annotation: " + ner_type)
Exception: Unknown named entity annotation: DURATION

ParentedTree creeps into "head" attribute

When I read this document from CoNLL-2012 into cort, a TypeError is thrown. The ParentedTree enter "head" in file mention_property_computer.py around line 241 (head = [head_tree[0]]). The value can be traced to head_finder but I stopped there because there are a lot of alternative rules.

>>> from cort.core.corpora import Corpus
>>> with open('output/debug.conll') as f:
...     Corpus.from_file('test', f)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/corpora.py", line 79, in from_file
    documents.append(from_string("".join(current_document)))
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/corpora.py", line 14, in from_string
    return documents.CoNLLDocument(string)
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/documents.py", line 414, in __init__
    super(CoNLLDocument, self).__init__(identifier, sentences, coref)
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/documents.py", line 97, in __init__
    self.annotated_mentions = self.__get_annotated_mentions()
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/documents.py", line 111, in __get_annotated_mentions
    span, self, first_in_gold_entity=set_id not in seen
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/mentions.py", line 174, in from_document
    mention_property_computer.compute_gender(attributes)
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/mention_property_computer.py", line 91, in compute_gender
    if __wordnet_lookup_gender(" ".join(attributes["head"])):
TypeError: sequence item 0: expected str instance, ParentedTree found

struct.error and OverflowError

I am running cort-train when these errors happen. My setup is Ubuntu 16.04.3, 64G RAM, 4 CPUs.

Process ForkPoolWorker-9:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
    put((job, i, result))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
    put((job, i, (False, wrapped)))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 349, in put
    obj = ForkingPickler.dumps(obj)
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a string larger than 4GiB
Process ForkPoolWorker-10:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
    put((job, i, result))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
    put((job, i, (False, wrapped)))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 349, in put
    obj = ForkingPickler.dumps(obj)
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a string larger than 4GiB
Process ForkPoolWorker-11:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
    put((job, i, result))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
    put((job, i, (False, wrapped)))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 349, in put
    obj = ForkingPickler.dumps(obj)
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a string larger than 4GiB

"KeyError: None" while training

I'm trying to train a model derived from CoNLL-2012 training set when I got this error.

This is the details of the model:
('-extractor', 'cort.coreference.approaches.mention_ranking.extract_substructures', '-perceptron', 'cort.coreference.approaches.mention_ranking.RankingPerceptron', '-cost_function', 'cort.coreference.cost_functions.cost_based_on_consistency', '-cost_scaling', '100')

This is the error:

    2018-09-10 19:57:49,116 INFO Started epoch 1
    Traceback (most recent call last):
    File "output/cort/venv/bin/cort-train", line 4, in <module>
        __import__('pkg_resources').run_script('cort==0.2.4.5', 'cort-train')
    File "/Users/minh/EvEn/output/cort/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 658, in run_script
        self.require(requires)[0].run_script(script_name, ns)
    File "/Users/minh/EvEn/output/cort/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1438, in run_script
        exec(code, namespace, namespace)
    File "/Users/minh/EvEn/output/cort/venv/lib/python3.7/site-packages/cort-0.2.4.5-py3.7.egg/EGG-INFO/scripts/cort-train", line 141, in <module>
        perceptron
    File "/Users/minh/EvEn/output/cort/venv/lib/python3.7/site-packages/cort-0.2.4.5-py3.7.egg/cort/coreference/experiments.py", line 43, in learn
        perceptron.fit(substructures, arc_information)
    File "output/cort/venv/lib/python3.7/site-packages/cort-0.2.4.5-py3.7.egg/cort/coreference/perceptrons.pyx", line 182, in cort.coreference.perceptrons.Perceptron.fit
        self.__update(cons_arcs,
    File "output/cort/venv/lib/python3.7/site-packages/cort-0.2.4.5-py3.7.egg/cort/coreference/perceptrons.pyx", line 331, in cort.coreference.perceptrons.Perceptron.__update
        arc_information[arc][0]
    KeyError: None

Could you please have a look?

`CoNLLDocument.document_table` is changed in `CoNLLDocument.get_string_representation`

Because CoNLLDocument.document_table is a (mutable) list of lists, line 531 also changes the original table, because it is not copied in the following line:

cort/cort/core/documents.py

Line 528 in c637594

new_table = self.document_table

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 13: ordinal not in range(128)

I was trying to load a file which is composed of all gold sentences in CoNLL-2012 dev set when this error occurred. Bellow is the full stack trace:

In [2]: reference = corpora.Corpus.from_file("reference", open("output/Thu-Jan-12-17-22-15-CET-2017.gold.txt"))
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-57d8e778731d> in <module>()
----> 1 reference = corpora.Corpus.from_file("reference", open("output/Thu-Jan-12-17-22-15-CET-2017.gold.txt"))

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/corpora.pyc in from_file(description, coref_file)
     77
     78         return Corpus(description, sorted([from_string(doc) for doc in
---> 79                                            document_as_strings]))
     80
     81

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/corpora.pyc in from_string(string)
     12
     13 def from_string(string):
---> 14     return documents.CoNLLDocument(string)
     15
     16

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/documents.pyc in __init__(self, document_as_string)
    399         sd = StanfordDependencies.get_instance()
    400         dep_trees = sd.convert_trees(
--> 401             [parse.replace("NOPARSE", "S") for parse in parses],
    402         )
    403         sentences = []

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.pyc in convert_trees(self, ptb_trees, representation, universal, include_punct, include_erased, **kwargs)
    114                       include_erased=include_erased)
    115         return Corpus(self.convert_tree(ptb_tree, **kwargs)
--> 116                       for ptb_tree in ptb_trees)
    117
    118     @abstractmethod

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.pyc in <genexpr>((ptb_tree,))
    114                       include_erased=include_erased)
    115         return Corpus(self.convert_tree(ptb_tree, **kwargs)
--> 116                       for ptb_tree in ptb_trees)
    117
    118     @abstractmethod

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/JPypeBackend.pyc in convert_tree(self, ptb_tree, representation, include_punct, include_erased, add_lemmas, universal)
     85         self._raise_on_bad_input(ptb_tree)
     86         self._raise_on_bad_representation(representation)
---> 87         tree = self.treeReader(ptb_tree)
     88         if tree is None:
     89             raise ValueError("Invalid Penn Treebank tree: %r" % ptb_tree)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 13: ordinal not in range(128)

The data looks like this:

Minhs-MacBook-Pro:EvEn cumeo$ head output/Thu-Jan-12-17-22-15-CET-2017.gold.txt
#begin document (bc/cctv/00/cctv_0000); part 000
bc/cctv/00/cctv_0000	0	0	In	IN	(TOP(S(PP*	-	-	-	Speaker#1	*	*	*	*-
bc/cctv/00/cctv_0000	0	1	the	DT	(NP(NP*	-	-	-	Speaker#1	(DATE*	*	*	*	-
bc/cctv/00/cctv_0000	0	2	summer	NN	*)	summer	-	1	Speaker#1	*	*	*	*	-
bc/cctv/00/cctv_0000	0	3	of	IN	(PP*	-	-	-	Speaker#1	*	*	*	*	-
bc/cctv/00/cctv_0000	0	4	2005	CD	(NP*))))	-	-	-	Speaker#1	*)	*	*	*-
bc/cctv/00/cctv_0000	0	5	,	,	*	-	-	-	Speaker#1	*	*	*	*	-
bc/cctv/00/cctv_0000	0	6	a	DT	(NP(NP*	-	-	-	Speaker#1	*	(ARG0*	*	*	-
bc/cctv/00/cctv_0000	0	7	picture	NN	*)	picture	-	8	Speaker#1	*	*)	*	*	-
bc/cctv/00/cctv_0000	0	8	that	WDT	(SBAR(WHNP*)	-	-	-	Speaker#1	*	(R-ARG0*)	**	-

Anyone has any ideas how to fix this?

Best regards,
Minh

cort-predict-raw cannot run on all specific raw text

I'm trying to run cort-predict-raw OOTB using the following setup:

cort-predict-raw -in ~/data/test1/*.txt \
		-model models/model-pair-train.obj \
		-extractor cort.coreference.approaches.mention_ranking.extract_substructures \
		-perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron \
		-clusterer cort.coreference.clusterer.all_ante \
		-corenlp ~/systems/stanford/stanford-corenlp-full-2016-10-31 \
		#-features my_features.txt \

For some reason it throws an exception for the string "SEC" (with quotations) in:

Hello my name is "SEC".

If I replace SEC or remove the quotations the file will pass through.

The exception:

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/cort-predict-raw", line 136, in <module>
    doc.system_mentions = mention_extractor.extract_system_mentions(doc)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/mention_extractor.py", line 36, in extract_system_mentions
    for span in __extract_system_mention_spans(document)]
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/mention_extractor.py", line 36, in <listcomp>
    for span in __extract_system_mention_spans(document)]
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/mentions.py", line 153, in from_document
    mention_property_computer.compute_head_information(attributes)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/mention_property_computer.py", line 248, in compute_head_information
    attributes["ner"][head_index])
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/head_finders.py", line 214, in adjust_head_for_nam
    raise Exception("Unknown named entity annotation: " + ner_type)
Exception: Unknown named entity annotation: DURATION

Display only files with errors

Is it possible to visualize only files that contain at least one error? I have a large corpus and after filtering there are only a couple hundreds of errors. So I find myself looking at clean files most of the time (i.e. no annotations besides mention spans).

error visualization problems

Hello,
I encounter a problem when trying to visualize a system's recall errors by type, as described in the documentation. My reference and system files are in conll, no errors are displayed when running the code, but the resulting html file doesn't display the document text and any fields in the left panel except for "Documents". When the jquery and jquery.jsPlumb imports in the html file are commented out, everything is correctly displayed (document text, left panel, and gold/system mention boundaries), but without the possibility to interact. Reproduced in the latest Firefox and chrome; python 2.7. The visualization of a document processed with cort-predict-raw seems to work fine.
Thanks!

error analysis visualization: unable to scroll

I visualised coreference errors (errors_by_type.visualize()), but it is not possible to scroll the left part of the visualisation (the right part with the text works well).
I am still using MacOS and Safari, I know it hasn't been tested, just thought you might be interested to know.

cort-predict-raw runs on python2 but not python3.5

I was trying to run cort-predict-raw with following command:

python3.5 /usr/local/bin/cort-predict-raw -in ~/data/pilot_44_docs/*.txt
-model models/model-pair-train.obj
-extractor cort.coreference.approaches.mention_ranking.extract_substructures
-perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron
-clusterer cort.coreference.clusterer.all_ante
-corenlp ~/systems/stanford/stanford-corenlp-full-2016-10-31

and got the following error message:

Traceback (most recent call last):
File "/usr/local/bin/cort-predict-raw", line 136, in
doc.system_mentions = mention_extractor.extract_system_mentions(doc)
File "/usr/local/lib/python3.5/dist-packages/cort/core/mention_extractor.py", line 36, in extract_system_mentions
for span in __extract_system_mention_spans(document)]
File "/usr/local/lib/python3.5/dist-packages/cort/core/mention_extractor.py", line 36, in
for span in __extract_system_mention_spans(document)]
File "/usr/local/lib/python3.5/dist-packages/cort/core/mentions.py", line 126, in from_document
i, sentence_span = document.get_sentence_id_and_span(span)
TypeError: 'NoneType' object is not iterable
2017-04-27 09:17:06,058 WARNING Killing subprocess 14154
2017-04-27 09:17:06,395 INFO Subprocess seems to be stopped, exit code -9

It works without a problem with python2 though. I'm running this on Ubuntu16.04.

Retraining models

Is it possible to retrain models (for example, the one's from https://github.com/smartschat/cort/blob/master/COREFERENCE.md#model-downloads) with new data?

I tried training using-

cort-train -in new_retraining_data.conll \
           -out pretrained_model.obj \
           -extractor cort.coreference.approaches.mention_ranking.extract_substructures \
           -perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron \
           -cost_function cort.coreference.cost_functions.cost_based_on_consistency \
           -n_iter 5 \ 
           -cost_scaling 100 \
           -random_seed 23

but I think it overwrites the model.

	if traverse_reversed:
	to_traverse = reversed(tree)
	else:
	to_traverse = tree
	for val in values:
	for child in to_traverse:

smartschat / cort Goto Github PK

cort's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs