vene / marseille Goto Github PK

Mining Argument Structures with Expressive Inference (Linear and LSTM Engines)

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

nlp nlp-machine-learning argumentation discourse-analysis structured-learning deep-learning machine-learning natural-language-processing

marseille's Introduction

marseille

mining argument structures with expressive inference (with linear and lstm engines)

What is it?

Marseille learns to predict argumentative proposition types and the support relations between them, as inference in a expressive factor graph.

Requirements

numpy
scipy
scikit-learn
pystruct
nltk
dill
docopt
dynet v1.1
lightning
ad3 >= v2.1 (pip install ad3)

Usage

(replace $ds with cdcp or ukp)

download the data from http://joonsuk.org/ and unzip it in the subdirectory data, i.e. the path ./data/process/erule/train/ is valid.
extract relevant subset of GloVe embeddings:

    python -m marseille.preprocess embeddings $ds --glove-file=/p/glove.840B.300d.txt

extract features:

    python -m marseille.features $ds

    # (for cdcp only:)
    python -m marseille.features cdcp-test

generate vectorized train-test split (for baselines only)

    mkdir data/process/.../
    python -m marseille.vectorize split cdcp

run chosen model, for example:

    python -m experiments.exp_train_test $ds --method rnn-struct --model strict

(for dynet models, set --dynet-seed=42 for exact reproducibility)

compare results:

    python -m experiments.plot_test_results.py $ds

To reproduce cross-validation model selection, you also would need to run:

    python -m marseille.vectorize folds $ds

Running a model on your own data:

If you have some documents e.g. F.txt, G.txt that you would like to run a pretrained model on, read on.

download the required preprocessing toolkits: Stanford CoreNLP (tested with version 3.6.0) and the WING-NUS PDTB discourse parser (tested with this commit) and configure their paths:

    export MARSEILLE_CORENLP_PATH=/home/vlad/corenlp  #  path to CoreNLP
    export MARSEILLE_WINGNUS_PATH=/home/vlad/wingnus  #  path to WING-NUS parser

Note: If you already generated F.txt.json with CoreNLP and F.txt.pipe with the WING-NUS parser (e.g., on a different computer), you may skip this step and marseille will detect those files automatically.

Otherwise, these files are generated the first time that a UserDoc object is instantiated for a given document. In particular, the step below will do this automatically.

extract the features:

    python -m marseille.features user F G  # raw input must be in F.txt & G.txt

This is needed for the RNN models too, because the feature files encode some metadata about the document structure.

predict, e.g. using the model saved in step 4 above:

    python -m experiments.predict_pretrained --method=rnn-struct \
    test_results/exact=True_cdcp_rnn-struct_strict F G

marseille's People

Contributors

Stargazers

Watchers

marseille's Issues

Info about data output

Is there any information available about the format of the output from running a model on custom data?
I'm not completely sure how to interpet it.

Thanks in advance.

Function missing in experiments.predict_pretrained

Hi!

I'm trying to run a trained model (linear, non-struct, full) on my data while I realize "linear" is not implemented even though it is mentioned as an option in ln 15 of experiments.predict_pretrained.py

Thanks!

about dataset

Hi vene:
could you share the ukp dataset ?

Interrupted by signal 11:SIGSEGV

I got the issue "Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)"
when i am trying to reproduce the same results over the ukp data set.

The problem appears while running exp_train_test.py using the arguments "ukp --method rnn-struct --model strict [--dynet-seed=42]"

The console output is as follows:

[dynet] random seed: 3694361057
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
2017-07-18 12:27:07,154 - root - INFO - rnn-struct strict on ukp ({'max_iter': 10, 'mlp_dropout': 0.15})
2017-07-18 12:27:13,659 - root - INFO - Setting node class weights Claim: 1.0, MajorClaim: 1.0, Premise: 1.0
2017-07-18 12:27:13,660 - root - INFO - Setting link class weights False: 1.0, True: 4.725530458590007
2017-07-18 12:27:13,660 - root - INFO - Overriding n_embeds to glove size 300
2017-07-18 12:27:13,671 - root - INFO - Initializing embeddings...
2017-07-18 12:27:13,799 - root - INFO - ...done

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Do you know what can be causing this problem ?, and i am using dynet v1.1

Unable to install dynet version 1.1

The code for this repository uses an older version of dynet(1.1) and gives error on running with the latest version.

However, installing the older version of dynet seems tricky because:

The installer tries to clone the repository for the library 'Eigen' but gets a 404 not found error.
I tried to manually update the command to the current URL for Eigen but it further gives the following error:

Running setup.py clean for dyNET
Failed to build dyNET
Installing collected packages: dyNET
    Running setup.py install for dyNET ... error
    ERROR: Command errored out with exit status 1:
     command: /home/webis/anaconda3/envs/vishal_arg_3_6/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-yupx_wwp/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-yupx_wwp/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-mt6j83jp/install-record.txt --single-version-externally-managed --compile --install-headers /home/webis/anaconda3/envs/vishal_arg_3_6/include/python3.6m/dyNET
         cwd: /tmp/pip-req-build-yupx_wwp/
    Complete output (93 lines):
    running install
    running build
    INFO:root:==============================
    INFO:root:CMake path: /usr/bin/cmake
    INFO:root:Make path: /usr/bin/make
    INFO:root:Make flags: -j 16
    INFO:root:Mercurial path: /home/webis/anaconda3/envs/vishal_arg_3_6/bin/hg
    INFO:root:C compiler path: /usr/bin/gcc
    INFO:root:CXX compiler path: /usr/bin/g++
    INFO:root:---
    INFO:root:Script directory: /tmp/pip-req-build-yupx_wwp
    INFO:root:Build directory: /tmp/pip-req-build-yupx_wwp/build/py3.6-64bit
    INFO:root:Library installation directory: /home/webis/anaconda3/envs/vishal_arg_3_6/lib/python3.6/site-packages/../../..
    INFO:root:Python executable: /home/webis/anaconda3/envs/vishal_arg_3_6/bin/python
    INFO:root:==============================
    cmake version 2.8.12.2
    g++ (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4
    Copyright (C) 2013 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    INFO:root:Creating build directory /tmp/pip-req-build-yupx_wwp/build/py3.6-64bit
    INFO:root:Cloning Eigen (gitlab)...
    Cloning into 'eigen'...
    INFO:root:Configuring...
    -- The C compiler identification is GNU 4.8.4
    -- The CXX compiler identification is GNU 4.8.4
    -- Check for working C compiler: /usr/bin/gcc
    -- Check for working C compiler: /usr/bin/gcc -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/g++
    -- Check for working CXX compiler: /usr/bin/g++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    CMake Error at /usr/share/cmake-2.8/Modules/FindBoost.cmake:1131 (message):
      Unable to find the requested Boost libraries.
    
      Boost version: 1.54.0
    
      Boost include path: /usr/include
    
      Could not find the following Boost libraries:
    
              boost_regex
              boost_serialization
    
      Some (but not all) of the required Boost libraries were found.  You may
      need to install these additional Boost libraries.  Alternatively, set
      BOOST_LIBRARYDIR to the directory containing Boost libraries or BOOST_ROOT
      to the location of Boost.
    Call Stack (most recent call first):
      CMakeLists.txt:111 (find_package)
    
    
    -- Boost dir is /usr/include
    -- BACKEND not specified, defaulting to eigen.
    -- Eigen dir is /tmp/pip-req-build-yupx_wwp/build/py3.6-64bit/eigen
    -- Looking for include file pthread.h
    -- Looking for include file pthread.h - found
    -- Looking for pthread_create
    -- Looking for pthread_create - not found
    -- Looking for pthread_create in pthreads
    -- Looking for pthread_create in pthreads - not found
    -- Looking for pthread_create in pthread
    -- Looking for pthread_create in pthread - found
    -- Found Threads: TRUE
    CMake Error at /usr/share/cmake-2.8/Modules/FindBoost.cmake:1131 (message):
      Unable to find the requested Boost libraries.
    
      Boost version: 1.54.0
    
      Boost include path: /usr/include
    
      Could not find the following Boost libraries:
    
              boost_unit_test_framework
              boost_serialization
    
      Some (but not all) of the required Boost libraries were found.  You may
      need to install these additional Boost libraries.  Alternatively, set
      BOOST_LIBRARYDIR to the directory containing Boost libraries or BOOST_ROOT
      to the location of Boost.
    Call Stack (most recent call first):
      tests/CMakeLists.txt:4 (find_package)
    
    
    -- Found Cython version 0.29.22
    
    -- Configuring incomplete, errors occurred!
    See also "/tmp/pip-req-build-yupx_wwp/build/py3.6-64bit/CMakeFiles/CMakeOutput.log".
    See also "/tmp/pip-req-build-yupx_wwp/build/py3.6-64bit/CMakeFiles/CMakeError.log".
    error: /usr/bin/cmake /tmp/pip-req-build-yupx_wwp -DCMAKE_INSTALL_PREFIX=/home/webis/anaconda3/envs/vishal_arg_3_6/lib/python3.6/site-packages/../../.. -DEIGEN3_INCLUDE_DIR=/tmp/pip-req-build-yupx_wwp/build/py3.6-64bit/eigen -DPYTHON=/home/webis/anaconda3/envs/vishal_arg_3_6/bin/python
    ----------------------------------------
ERROR: Command errored out with exit status 1: /home/webis/anaconda3/envs/vishal_arg_3_6/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-yupx_wwp/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-yupx_wwp/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-mt6j83jp/install-record.txt --single-version-externally-managed --compile --install-headers /home/webis/anaconda3/envs/vishal_arg_3_6/include/python3.6m/dyNET Check the logs for full command output.

Is there a way to work around this issue? If you could share the zip file containing the Eigen repository(for the revision required by dynet 1.1), that would be really helpful.

Thanks in advance.

locating data and data-repository

hello!

I am unable to locate the data from http://joonsuk.org/. Could you provide a direct link.
Also, the repository ./data/process/erule/train/ does no longer seem to exist.

best, Vald

Error on preprocess cannot import StringIO

Hi,

I am looking into using your framework as a part of a tool i am developing as part of my thesis to asses the usability and credibility of whitepapers using internet information triangulation. One of the aspects, among other factors, we are interested in is the amount of reasoning used in these whitepapers as an indicator for credibility.

I read your paper and it looked quite promissing, so i would like to try how it performs in our use case.
I am however running into some problems.

When trying to run the following command I get the error that the StringIO package cannot be imported. I did install all the requirements listed in the readme using pip.
I am using a clean virtualenv running python 2.7.12 on ubuntu

(marseille) ubuntu@ubuntu-xenial:/code/marseille$ python -m marseille.preprocess embeddings cdcp --glove-file=/p/glove.840B.300d.tx t Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/code/marseille/marseille/preprocess.py", line 242, in <module> store_optimized_embeddings(dataset, args['--glove-file']) File "/code/marseille/marseille/preprocess.py", line 213, in store_optimized_embeddings from marseille.datasets import get_dataset_loader File "marseille/datasets.py", line 6, in <module> from marseille.argdoc import UkpEssayArgumentationDoc, CdcpArgumentationDoc File "marseille/argdoc.py", line 9, in <module> from io import StringIO ImportError: cannot import name StringIO (marseille) ubuntu@ubuntu-xenial:/code/marseille$

I have the feeling that this is due to the io file which is part of the project, but i also assume that this file serves a purpose. So i am a bit lost at the moment.
Furtermore i noticed that the file structure in the downloaded dataset does not match the path as you describe it in the readme.

Any help would be greatly appreciated.
With kind regards,
Tim Jongsma

KeyError: 'collapsed-ccprocessed-dependencies'

Hi,

When trying to run the pretrained model on a custom document, i run into the following error.
Which i have not been able to solve.

The command: 'python -m marseille.features user F'
Results in the following error:
Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/code/marseille/marseille/features.py", line 614, in <module> for prop_id in prop_ids] File "/code/marseille/marseille/features.py", line 614, in <listcomp> for prop_id in prop_ids] File "/code/marseille/marseille/features.py", line 205, in prop_features for arc in sents[sent]['collapsed-ccprocessed-dependencies'] KeyError: 'collapsed-ccprocessed-dependencies'

For debug purposes i have tried replacing the custom document with one of the files from the cdcp dataset yielding the same error.

Any help would be appreciated,
With kind regards,
Tim

DeprecationWarning

Hey,

I followed the instructions in the README and tried training the rnn-struct model. While running python -m experiments.exp_train_test cdcp --method rnn-struct --model strict I get the following warning multiple times:

/data/defacto/env/lib/python3.5/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use array.size > 0 to check that an array is not empty.

Complete Output:

[dynet] random seed: 3187696204
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
2018-04-03 19:55:42,096 - root - INFO - rnn-struct strict on cdcp ({'max_iter': 10, 'mlp_dropout': 0.2})
2018-04-03 19:55:56,745 - root - INFO - Setting node class weights fact: 1.0, policy: 1.0, reference: 1.0, testimony: 1.0, value: 1.0
2018-04-03 19:55:56,745 - root - INFO - Setting link class weights False: 1.0, True: 30.359851988899166
2018-04-03 19:55:56,745 - root - INFO - Overriding n_embeds to glove size 300
2018-04-03 19:55:56,752 - root - INFO - Initializing embeddings...
2018-04-03 19:55:57,006 - root - INFO - ...done
[epoch=1 eta=0.001 clips=568 updates=579] Iter 0 loss 0.5513
74.4% integer, 24.4% fractional, 1.2% not solved
[epoch=2 eta=0.001 clips=544 updates=567] Iter 1 loss 0.5325
84.9% integer, 13.9% fractional, 1.2% not solved
[epoch=3 eta=0.001 clips=502 updates=560] Iter 2 loss 0.5270
83.0% integer, 16.2% fractional, 0.9% not solved
[epoch=4 eta=0.001 clips=465 updates=541] Iter 3 loss 0.5169
84.5% integer, 13.8% fractional, 1.7% not solved
[epoch=5 eta=0.001 clips=435 updates=540] Iter 4 loss 0.5120
82.3% integer, 15.8% fractional, 1.9% not solved
[epoch=6 eta=0.001 clips=418 updates=535] Iter 5 loss 0.5089
84.7% integer, 12.9% fractional, 2.4% not solved
[epoch=7 eta=0.001 clips=437 updates=526] Iter 6 loss 0.5083
85.0% integer, 12.6% fractional, 2.4% not solved
[epoch=8 eta=0.001 clips=425 updates=513] Iter 7 loss 0.4793
86.6% integer, 11.9% fractional, 1.5% not solved
[epoch=9 eta=0.001 clips=395 updates=489] Iter 8 loss 0.4506
82.3% integer, 16.0% fractional, 1.7% not solved
[epoch=10 eta=0.001 clips=373 updates=459] Iter 9 loss 0.4102
85.0% integer, 13.1% fractional, 1.9% not solved
2018-04-03 21:22:50,187 - root - INFO - Training time: 579.24s/iteration (1.00s/doc-iter)
/data/defacto/env/lib/python3.5/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambi
guous. Returning False, but in future this will result in an error. Use array.size > 0 to check that an array is not empty.
  if diff:
/data/defacto/env/lib/python3.5/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambi
guous. Returning False, but in future this will result in an error. Use array.size > 0 to check that an array is not empty.
.
.
.
.
2018-04-03 21:23:48,273 - root - INFO - Prediction time: 0.37s/doc
2018-04-03 21:23:48,274 - root - INFO - Test inference status: 100.0% integer

Also, I am not able to run python -m experiments.plot_test_results.py cdcp which gives the following error:

/data/defacto/env/bin/python: Error while finding spec for 'experiments.plot_test_results.py' (AttributeError: module 'experiments.plot_test_results' has no attribute '__path__')