GithubHelp home page GithubHelp logo

smartschat / cort Goto Github PK

View Code? Open in Web Editor NEW
129.0 12.0 34.0 6.95 MB

A toolkit for coreference resolution and error analysis.

License: MIT License

Python 30.16% JavaScript 34.08% CSS 0.30% Perl 11.88% Java 23.29% R 0.29%

cort's Introduction

cort

cort is a coreference resolution toolkit. It consists of two parts: the coreference resolution component implements a framework for coreference resolution based on latent variables, which allows you to rapidly devise approaches to coreference resolution, while the error analysis component provides extensive functionality for analyzing and visualizing errors made by coreference resolution systems.

If you have any questions or comments, drop me an e-mail at [email protected].

Branches/Forks

  • the kbest branch contains code for kbest extraction of coreference information, as described in Ji et al. (2017)
  • the v03 branch contains a version of cort with more models and a better train/dev/test workflow. For more details on the models see Martschat (2017).
  • Nafise Moosavi's fork of cort implements search space pruning on top of cort, as described in Moosavi and Strube (2016)

Documentation

Installation

cort is available on PyPi. You can install it via

pip install cort

Dependencies (automatically installed by pip) are nltk, numpy, matplotlib, mmh3, PyStanfordDependencies, cython, future, jpype and beautifulsoup. It ships with stanford_corenlp_pywrapper and the reference implementation of the CoNLL scorer.

cort is written for use on Linux with Python 3.3+. While cort also runs under Python 2.7, I strongly recommend running cort with Python 3, since the Python 3 version is much more efficient.

References

Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi and Noah A. Smith (2017). Dynamic Entity Representations in Neural Language Models. To appear in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 7-11 September 2017.
PDF

Sebastian Martschat (2017). Structured Representations for Coreference Resolution. PhD thesis, Heidelberg University.
PDF

Nafise Sadat Moosavi and Michael Strube (2016). Search space pruning: A simple solution for better coreference resolvers. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, Cal., 12-17 June 2016, pages 1005-1011.
PDF

Sebastian Martschat and Michael Strube (2015). Latent Structures for Coreference Resolution. Transactions of the Association for Computational Linguistics, 3, pages 405-418.
PDF

Sebastian Martschat, Patrick Claus and Michael Strube (2015). Plug Latent Structures and Play Coreference Resolution. In Proceedings of the Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, 26-31 July 2015, pages 61-66.
PDF

Sebastian Martschat, Thierry Göckel and Michael Strube (2015). Analyzing and Visualizing Coreference Resolution Errors. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Denver, Colorado, USA, 31 May-5 June 2015, pages 6-10.
PDF

Sebastian Martschat and Michael Strube (2014). Recall Error Analysis for Coreference Resolution. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25-29 October 2014, pages 2070-2081.
PDF

Sebastian Martschat (2013). Multigraph Clustering for Unsupervised Coreference Resolution. In Proceedings of the Student Research Workshop at the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 5-7 August 2013, pages 81-88.
PDF

If you use the error analysis component in your research, please cite the EMNLP'14 paper. If you use the coreference component in your research, please cite the TACL paper. If you use the multigraph system, please cite the ACL'13-SRW paper.

Changelog

Wednesday, 4 November 2015
Support numeric features. Due to a different feature representation the models changed, hence I have updated the downloadable models.

Friday, 9 October 2015
Now supports label-dependent cost functions.

Tuesday, 15 September 2015
Minor bugfixes.

Monday, 27 July 2015
Now can perform coreference resolution on raw text.

Tuesday, 21 July 2015
Updated to status of TACL paper.

Wednesday, 3 June 2015
Improvements to visualization (mention highlighting and scrolling).

Monday, 1 June 2015
Fixed a bug in mention highlighting for visualization.

Sunday, 31 May 2015
Updated to status of NAACL'15 demo paper.

Wednesday, 13 May 2015
Fixed another bug in the documentation regarding format of antecedent data.

Tuesday, 3 February 2015
Fixed a bug in the documentation: part no. in antecedent file must be with trailing 0s.

Thursday, 30 October 2014
Fixed data structure bug in documents.py. The results from the paper are not affected by this bug.

Wednesday, 22 October 2014
Initial release.

cort's People

Contributors

bheinzerling avatar smartschat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cort's Issues

Memory requirements

cort currently needs a lot of RAM, predicting with the latent ranking model on the CoNLL-2012 development data takes ~8GB, mainly due to multiprocessing during feature extraction.

error visualization problems

Hello,
I encounter a problem when trying to visualize a system's recall errors by type, as described in the documentation. My reference and system files are in conll, no errors are displayed when running the code, but the resulting html file doesn't display the document text and any fields in the left panel except for "Documents". When the jquery and jquery.jsPlumb imports in the html file are commented out, everything is correctly displayed (document text, left panel, and gold/system mention boundaries), but without the possibility to interact. Reproduced in the latest Firefox and chrome; python 2.7. The visualization of a document processed with cort-predict-raw seems to work fine.
Thanks!

adjust_head_for_nam doesn't handle DURATION entities

Hi,
the adjust_head_for_nam function in cort.core.head_finders crashes whenever it encounters a named entity type of DURATION. This entity type is sometimes generated by the latest version of CoreNLP.
I guess it shouldn't be too difficult to add a pattern for it, but I don't know what would make sense.
/Christian

2016-04-09 05:04:42,285 INFO Preprocessing en/ep-00-06-15.xml.gz.
2016-04-09 05:20:08,227 INFO Extracting system mentions from en/ep-00-06-15.xml.gz.
2016-04-09 05:20:11,552 ERROR Discarding document en/ep-00-06-15.xml.gz
2016-04-09 05:20:11,619 ERROR Traceback (most recent call last):
File "/home/staff/ch/PycharmProjects/cort/extra/annot-wmt.py", line 197, in
doc.system_mentions = mention_extractor.extract_system_mentions(doc)
File "/home/staff/ch/PycharmProjects/cort/cort/core/mention_extractor.py", line 36, in extract_system_mentions
for span in __extract_system_mention_spans(document)]
File "/home/staff/ch/PycharmProjects/cort/cort/core/mention_extractor.py", line 36, in
for span in __extract_system_mention_spans(document)]
File "/home/staff/ch/PycharmProjects/cort/cort/core/mentions.py", line 153, in from_document
mention_property_computer.compute_head_information(attributes)
File "/home/staff/ch/PycharmProjects/cort/cort/core/mention_property_computer.py", line 248, in compute_head_information
attributes["ner"][head_index])
File "/home/staff/ch/PycharmProjects/cort/cort/core/head_finders.py", line 214, in adjust_head_for_nam
raise Exception("Unknown named entity annotation: " + ner_type)
Exception: Unknown named entity annotation: DURATION

Code not working directly

I was using model-pair-train.obj model provided in your repository. On running the command provided for coreference resolution for raw data, I got the following error.
Traceback (most recent call last): File "/usr/local/bin/cort-predict-raw", line 132, in <module> testing_corpus = p.run_on_docs("corpus", args.input_filename) File "/usr/local/lib/python3.4/dist-packages/cort/preprocessing/pipeline.py", line 38, in run_on_docs codecs.open(doc, "r", "utf-8") File "/usr/local/lib/python3.4/dist-packages/cort/preprocessing/pipeline.py", line 82, in run_on_doc pdeprel=None TypeError: __new__() missing 1 required positional argument: 'extra'
This got resolved when I added a line
extra = None
following line 82 in cort/preprocessing/pipeline.py

Set IDs are lost during writing

I find this behaviour counter-intuitive: if you read a corpus (using Corpus.from_file) and write it out right away (using write_to_file), all set IDs are lost, i.e. the last column contains only minus signs.

I attached a test script and sample document.

test.zip

struct.error and OverflowError

I am running cort-train when these errors happen. My setup is Ubuntu 16.04.3, 64G RAM, 4 CPUs.

Process ForkPoolWorker-9:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
    put((job, i, result))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
    put((job, i, (False, wrapped)))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 349, in put
    obj = ForkingPickler.dumps(obj)
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a string larger than 4GiB
Process ForkPoolWorker-10:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
    put((job, i, result))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
    put((job, i, (False, wrapped)))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 349, in put
    obj = ForkingPickler.dumps(obj)
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a string larger than 4GiB
Process ForkPoolWorker-11:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
    put((job, i, result))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
    put((job, i, (False, wrapped)))
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 349, in put
    obj = ForkingPickler.dumps(obj)
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a string larger than 4GiB

Collins head finder iterator should be a list instead?

I noticed some discrepancies between the 2 different traversal orders.
reversed() on line 108 returns an iterator and if it gets exhausted in the first iteration of the loop on 111, the subsequent result is incorrect.

if traverse_reversed:
to_traverse = reversed(tree)
else:
to_traverse = tree
for val in values:
for child in to_traverse:

Suggest to change from
to_traverse = reversed(tree)
to
to_traverse = list(reversed(tree) )

ParentedTree creeps into "head" attribute

When I read this document from CoNLL-2012 into cort, a TypeError is thrown. The ParentedTree enter "head" in file mention_property_computer.py around line 241 (head = [head_tree[0]]). The value can be traced to head_finder but I stopped there because there are a lot of alternative rules.

>>> from cort.core.corpora import Corpus
>>> with open('output/debug.conll') as f:
...     Corpus.from_file('test', f)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/corpora.py", line 79, in from_file
    documents.append(from_string("".join(current_document)))
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/corpora.py", line 14, in from_string
    return documents.CoNLLDocument(string)
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/documents.py", line 414, in __init__
    super(CoNLLDocument, self).__init__(identifier, sentences, coref)
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/documents.py", line 97, in __init__
    self.annotated_mentions = self.__get_annotated_mentions()
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/documents.py", line 111, in __get_annotated_mentions
    span, self, first_in_gold_entity=set_id not in seen
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/mentions.py", line 174, in from_document
    mention_property_computer.compute_gender(attributes)
  File "/home/minhle/.local/lib/python3.5/site-packages/cort-0.2.4.5-py3.5.egg/cort/core/mention_property_computer.py", line 91, in compute_gender
    if __wordnet_lookup_gender(" ".join(attributes["head"])):
TypeError: sequence item 0: expected str instance, ParentedTree found

Retraining models

Is it possible to retrain models (for example, the one's from https://github.com/smartschat/cort/blob/master/COREFERENCE.md#model-downloads) with new data?

I tried training using-

cort-train -in new_retraining_data.conll \
           -out pretrained_model.obj \
           -extractor cort.coreference.approaches.mention_ranking.extract_substructures \
           -perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron \
           -cost_function cort.coreference.cost_functions.cost_based_on_consistency \
           -n_iter 5 \ 
           -cost_scaling 100 \
           -random_seed 23

but I think it overwrites the model.

"KeyError: None" while training

I'm trying to train a model derived from CoNLL-2012 training set when I got this error.

This is the details of the model:
('-extractor', 'cort.coreference.approaches.mention_ranking.extract_substructures', '-perceptron', 'cort.coreference.approaches.mention_ranking.RankingPerceptron', '-cost_function', 'cort.coreference.cost_functions.cost_based_on_consistency', '-cost_scaling', '100')

This is the error:

    2018-09-10 19:57:49,116 INFO Started epoch 1
    Traceback (most recent call last):
    File "output/cort/venv/bin/cort-train", line 4, in <module>
        __import__('pkg_resources').run_script('cort==0.2.4.5', 'cort-train')
    File "/Users/minh/EvEn/output/cort/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 658, in run_script
        self.require(requires)[0].run_script(script_name, ns)
    File "/Users/minh/EvEn/output/cort/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1438, in run_script
        exec(code, namespace, namespace)
    File "/Users/minh/EvEn/output/cort/venv/lib/python3.7/site-packages/cort-0.2.4.5-py3.7.egg/EGG-INFO/scripts/cort-train", line 141, in <module>
        perceptron
    File "/Users/minh/EvEn/output/cort/venv/lib/python3.7/site-packages/cort-0.2.4.5-py3.7.egg/cort/coreference/experiments.py", line 43, in learn
        perceptron.fit(substructures, arc_information)
    File "output/cort/venv/lib/python3.7/site-packages/cort-0.2.4.5-py3.7.egg/cort/coreference/perceptrons.pyx", line 182, in cort.coreference.perceptrons.Perceptron.fit
        self.__update(cons_arcs,
    File "output/cort/venv/lib/python3.7/site-packages/cort-0.2.4.5-py3.7.egg/cort/coreference/perceptrons.pyx", line 331, in cort.coreference.perceptrons.Perceptron.__update
        arc_information[arc][0]
    KeyError: None

Could you please have a look?

Display only files with errors

Is it possible to visualize only files that contain at least one error? I have a large corpus and after filtering there are only a couple hundreds of errors. So I find myself looking at clean files most of the time (i.e. no annotations besides mention spans).

Unable to use out of the box

I was trying to use cort straight out of the box to predict coreference chains on raw text, but was unable to get it running. Here's what I did-

  1. Created a virtualenv called cort and installed cort using pip. The github repo was in another folder called cort_tool, Stanford CoreNLP tools were in another folder called stanford-corenlp.
  2. Downloaded model-train-pair.obj and placed it in cort_tool folder.
  3. Created an input.txt file with a single sentence.
  4. Ran the following commands after activating the venv-
$ cd cort_tool
$ cort-predict-raw -in ~/input.txt -model model-pair-train.obj -extractor cort.coreference.approaches.mention_ranking.extract_substructures -perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron -clusterer cort.coreference.clusterer.all_ante -corenlp ~/stanford-corenlp -suffix out 2>&1 | tee ~/output.txt

I got the following output-

2016-10-03 17:17:55,338 INFO Loading model.
In file included from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/ndarraytypes.h:1777:0,
                 from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                 from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from /home/cil/.pyxbld/temp.linux-x86_64-3.4/pyrex/cort/coreference/perceptrons.c:274:
/home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
 #warning "Using deprecated NumPy API, disable it by " \
  ^
In file included from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/ndarrayobject.h:27:0,
                 from /home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from /home/cil/.pyxbld/temp.linux-x86_64-3.4/pyrex/cort/coreference/perceptrons.c:274:
/home/cil/cort/lib/python3.4/site-packages/numpy/core/include/numpy/__multiarray_api.h:1448:1: warning: ‘_import_array’ defined but not used [-Wunused-function]
 _import_array(void)
 ^
2016-10-03 17:18:02,512 INFO Reading in and preprocessing data.
2016-10-03 17:18:02,513 INFO Starting java subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -XX:ParallelGCThreads=1 -cp '/home/cil/cort/lib/python3.4/site-packages/stanford_corenlp_pywrapper/lib/*:/home/cil/stanford-corenlp/*'      corenlp.SocketServer --outpipe /tmp/corenlp_pywrap_pipe_pypid=20030_time=1475486282.512968  --configfile /home/cil/cort/lib/python3.4/site-packages/cort/config_files/corenlp.ini
INFO:CoreNLP_JavaServer: Using CoreNLP configuration file: /home/cil/cort/lib/python3.4/site-packages/cort/config_files/corenlp.ini
Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/stanford/nlp/pipeline/StanfordCoreNLP : Unsupported major.minor version 52.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:803)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at corenlp.JsonPipeline.initializeCorenlpPipeline(JsonPipeline.java:206)
    at corenlp.SocketServer.main(SocketServer.java:102)

At this moment, the memory usage of the task cort-predict-raw became Zero so I did a keyboard interrupt and tried again, but got the same result.

I'm on Ubuntu 16.04.

Can you please help me out?

error analysis visualization: unable to scroll

I visualised coreference errors (errors_by_type.visualize()), but it is not possible to scroll the left part of the visualisation (the right part with the text works well).
I am still using MacOS and Safari, I know it hasn't been tested, just thought you might be interested to know.

exceptions while training with gold conll data

Hello,

I encountered a few problems while trying to train a model with the gold standard version of the conll-2012 training set (*_gold_conll).

The first issue occurs during the conversion of certain trees, when some nodes of the trees are deleted but accessed later:

 File "$HOME/.local/bin/cort-train", line 132, in <module>
    "r", "utf-8"))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 79, in from_file
    document_as_strings]))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 14, in from_string
    return documents.CoNLLDocument(string)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 401, in __init__
    [parse.replace("NOPARSE", "S") for parse in parses]#, include_erased=True
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.py", line 116, in convert_trees
    for ptb_tree in ptb_trees)
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.py", line 116, in <genexpr>
    for ptb_tree in ptb_trees)
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/JPypeBackend.py", line 141, in convert_tree
    sentence.renumber()
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/CoNLL.py", line 111, in renumber
    for token in self]
KeyError: 18

This happens for several sentences in the training data set (e.g., document bn/cnn/04/cnn_0432, sentence on lines 272-296). One way to avoid the exception is to set include_erased=True.

The second issue is caused by one sentence in the training set (document mz/sinorama/10/ectb_1005, lines 980-1012):

Traceback (most recent call last):
  File "$HOME/.local/bin/cort-train", line 132, in <module>
    "r", "utf-8"))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 79, in from_file
    document_as_strings]))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 14, in from_string
    return documents.CoNLLDocument(string)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 414, in __init__
    super(CoNLLDocument, self).__init__(identifier, sentences, coref)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 97, in __init__
    self.annotated_mentions = self.__get_annotated_mentions()
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 111, in __get_annotated_mentions
    span, self, first_in_gold_entity=set_id not in seen
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/mentions.py", line 174, in from_document
    mention_property_computer.compute_gender(attributes)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/mention_property_computer.py", line 89, in compute_gender
    if __wordnet_lookup_gender(" ".join(attributes["head"])):
TypeError: sequence item 0: expected string, ParentedTree found

The problems seem to be data-related, as none of them occur when using the *_auto_conll version of the conll-2012 training data.

cort-predict-raw runs on python2 but not python3.5

I was trying to run cort-predict-raw with following command:

python3.5 /usr/local/bin/cort-predict-raw -in ~/data/pilot_44_docs/*.txt
-model models/model-pair-train.obj
-extractor cort.coreference.approaches.mention_ranking.extract_substructures
-perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron
-clusterer cort.coreference.clusterer.all_ante
-corenlp ~/systems/stanford/stanford-corenlp-full-2016-10-31

and got the following error message:

Traceback (most recent call last):
File "/usr/local/bin/cort-predict-raw", line 136, in
doc.system_mentions = mention_extractor.extract_system_mentions(doc)
File "/usr/local/lib/python3.5/dist-packages/cort/core/mention_extractor.py", line 36, in extract_system_mentions
for span in __extract_system_mention_spans(document)]
File "/usr/local/lib/python3.5/dist-packages/cort/core/mention_extractor.py", line 36, in
for span in __extract_system_mention_spans(document)]
File "/usr/local/lib/python3.5/dist-packages/cort/core/mentions.py", line 126, in from_document
i, sentence_span = document.get_sentence_id_and_span(span)
TypeError: 'NoneType' object is not iterable
2017-04-27 09:17:06,058 WARNING Killing subprocess 14154
2017-04-27 09:17:06,395 INFO Subprocess seems to be stopped, exit code -9

It works without a problem with python2 though. I'm running this on Ubuntu16.04.

CoreNLP error

hi! I get an error from CoreNLP when I'm trying to predict coreference on raw text using "cort-predict-raw" and standard parameters as described in the manual. this is what I get:

2015-11-05 10:37:23,639 INFO Loading model.
2015-11-05 10:37:55,518 INFO Reading in and preprocessing data.
2015-11-05 10:37:55,519 INFO Starting java subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -XX:ParallelGCThreads=1 -cp '/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/lib/:/Users/yuliagrishina/Documents/Software/CoreNLP' corenlp.SocketServer --outpipe /tmp/corenlp_pywrap_pipe_pypid=2721_time=1446716275.52 --configfile /Library/Python/2.7/site-packages/cort/config_files/corenlp.ini
INFO:CoreNLP_JavaServer: Using CoreNLP configuration file: /Library/Python/2.7/site-packages/cort/config_files/corenlp.ini
Exception in thread "main" java.lang.NoClassDefFoundError: edu/stanford/nlp/pipeline/StanfordCoreNLP
at corenlp.JsonPipeline.initializeCorenlpPipeline(JsonPipeline.java:206)
at corenlp.SocketServer.main(SocketServer.java:102)
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLP
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more

do you have any idea why this happens?

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 13: ordinal not in range(128)

I was trying to load a file which is composed of all gold sentences in CoNLL-2012 dev set when this error occurred. Bellow is the full stack trace:

In [2]: reference = corpora.Corpus.from_file("reference", open("output/Thu-Jan-12-17-22-15-CET-2017.gold.txt"))
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-57d8e778731d> in <module>()
----> 1 reference = corpora.Corpus.from_file("reference", open("output/Thu-Jan-12-17-22-15-CET-2017.gold.txt"))

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/corpora.pyc in from_file(description, coref_file)
     77
     78         return Corpus(description, sorted([from_string(doc) for doc in
---> 79                                            document_as_strings]))
     80
     81

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/corpora.pyc in from_string(string)
     12
     13 def from_string(string):
---> 14     return documents.CoNLLDocument(string)
     15
     16

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/documents.pyc in __init__(self, document_as_string)
    399         sd = StanfordDependencies.get_instance()
    400         dep_trees = sd.convert_trees(
--> 401             [parse.replace("NOPARSE", "S") for parse in parses],
    402         )
    403         sentences = []

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.pyc in convert_trees(self, ptb_trees, representation, universal, include_punct, include_erased, **kwargs)
    114                       include_erased=include_erased)
    115         return Corpus(self.convert_tree(ptb_tree, **kwargs)
--> 116                       for ptb_tree in ptb_trees)
    117
    118     @abstractmethod

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.pyc in <genexpr>((ptb_tree,))
    114                       include_erased=include_erased)
    115         return Corpus(self.convert_tree(ptb_tree, **kwargs)
--> 116                       for ptb_tree in ptb_trees)
    117
    118     @abstractmethod

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/JPypeBackend.pyc in convert_tree(self, ptb_tree, representation, include_punct, include_erased, add_lemmas, universal)
     85         self._raise_on_bad_input(ptb_tree)
     86         self._raise_on_bad_representation(representation)
---> 87         tree = self.treeReader(ptb_tree)
     88         if tree is None:
     89             raise ValueError("Invalid Penn Treebank tree: %r" % ptb_tree)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 13: ordinal not in range(128)

The data looks like this:

Minhs-MacBook-Pro:EvEn cumeo$ head output/Thu-Jan-12-17-22-15-CET-2017.gold.txt
#begin document (bc/cctv/00/cctv_0000); part 000
bc/cctv/00/cctv_0000	0	0	In	IN	(TOP(S(PP*	-	-	-	Speaker#1	*	*	*	*-
bc/cctv/00/cctv_0000	0	1	the	DT	(NP(NP*	-	-	-	Speaker#1	(DATE*	*	*	*	-
bc/cctv/00/cctv_0000	0	2	summer	NN	*)	summer	-	1	Speaker#1	*	*	*	*	-
bc/cctv/00/cctv_0000	0	3	of	IN	(PP*	-	-	-	Speaker#1	*	*	*	*	-
bc/cctv/00/cctv_0000	0	4	2005	CD	(NP*))))	-	-	-	Speaker#1	*)	*	*	*-
bc/cctv/00/cctv_0000	0	5	,	,	*	-	-	-	Speaker#1	*	*	*	*	-
bc/cctv/00/cctv_0000	0	6	a	DT	(NP(NP*	-	-	-	Speaker#1	*	(ARG0*	*	*	-
bc/cctv/00/cctv_0000	0	7	picture	NN	*)	picture	-	8	Speaker#1	*	*)	*	*	-
bc/cctv/00/cctv_0000	0	8	that	WDT	(SBAR(WHNP*)	-	-	-	Speaker#1	*	(R-ARG0*)	**	-

Anyone has any ideas how to fix this?

Best regards,
Minh

cort-predict-raw cannot run on all specific raw text

I'm trying to run cort-predict-raw OOTB using the following setup:

cort-predict-raw -in ~/data/test1/*.txt \
		-model models/model-pair-train.obj \
		-extractor cort.coreference.approaches.mention_ranking.extract_substructures \
		-perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron \
		-clusterer cort.coreference.clusterer.all_ante \
		-corenlp ~/systems/stanford/stanford-corenlp-full-2016-10-31 \
		#-features my_features.txt \

For some reason it throws an exception for the string "SEC" (with quotations) in:

Hello my name is "SEC".

If I replace SEC or remove the quotations the file will pass through.

The exception:

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/cort-predict-raw", line 136, in <module>
    doc.system_mentions = mention_extractor.extract_system_mentions(doc)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/mention_extractor.py", line 36, in extract_system_mentions
    for span in __extract_system_mention_spans(document)]
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/mention_extractor.py", line 36, in <listcomp>
    for span in __extract_system_mention_spans(document)]
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/mentions.py", line 153, in from_document
    mention_property_computer.compute_head_information(attributes)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/mention_property_computer.py", line 248, in compute_head_information
    attributes["ner"][head_index])
  File "/home/ubuntu/.local/lib/python3.5/site-packages/cort/core/head_finders.py", line 214, in adjust_head_for_nam
    raise Exception("Unknown named entity annotation: " + ner_type)
Exception: Unknown named entity annotation: DURATION

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.