GithubHelp home page GithubHelp logo

josydipa's Introduction

Joint Syntaco-Discourse Parsing and Treebank

License: MIT

This repository contains the implementation of the joint syntaco-discourse parser and the syntaco-discourse treebank. For more details, please refer to the paper Joint Syntacto-Discourse Parsing and the Syntacto-Discourse Treebank.

Syntaco-Discourse Treebank

Due to copyright restriction, we can not provide the joint treebank in the form that can be directly used to train a parser. Instead, we provide a patch tool kit to generate the Syntaco-Discourse Treebank giving the RST Discourse Treebank and the Penn Treebank.

Required Python Dependencies

  1. python-gflags for parsing script arguments.

  2. nltk for tokenization.

Procedures to Generate Treebank

Please follow the steps below to generate the treebank:

  1. Place the RST Discourse Treebank in folder dataset/rst. Put the discourse trees (wsj_xxxx.out.dis files) in the RST Discourse Treebank to dataset/rst/train and dataset/rst/test respectively. Here each wsj_xxxx.out.dis file corresponds to one WSJ article, where xxxx is the article number.

  2. Place the Penn Treebank trees in folder dataset/ptb. These constituency trees are in parentheses format. They are grouped as one treebank file (with name wsj_xxxx.cleangold) for a WSJ article.

  3. Apply patches to the RST Discourse Treebank files and Penn Treebank files. This step is necessary because there are some small mismatches between the RST Discourse tree texts and the Penn tree texts.

    cd dataset/rst/train
    patch -p0 < ../../../patches/rst-ptb.train.patch
    cd ../test
    patch -p0 < ../../../patches/rst-ptb.test.patch
    cd ../../ptb
    patch -p0 < ../../patches/ptb-rst.patch
    cd ...
  4. Run tokenization.

    python src/tokenize_rst.py --rst_path dataset/rst/train
    python src/tokenize_rst.py --rst_path dataset/rst/test
  5. Generate the training set and testing set for the joint treebank separately:

    mkdir dataset/joint
    python src/aligner.py --rst_path dataset/rst/train --const_path dataset/ptb > dataset/joint/train.txt
    python src/aligner.py --rst_path dataset/rst/test --const_path dataset/ptb > dataset/joint/test.txt

To test the scripts above, you can play with the sample data:

python src/tokenize_rst.py --rst_path sampledata/rst
python src/aligner.py --rst_path sampledata/rst --const_path sampledata/ptb > sampledata/joint.txt

Syntaco-Dsicourse Parser

Required Python Dependencies

Since the joint parser is based on the Span-based Constituency Parser, please have the following required dependencies installed:

  1. DyNet for the underlying neural model.

  2. numpy for interacting with DyNet.

Training

To train on the provided sample data, you can simply run:

mkdir exps
python src/trainer.py --train sampledata/joint.txt --dev sampledata/joint.txt --epoch 200 --save exps/sampledata.model

You can find the training parameters and their descriptions in src/trainer.py.

Evaluating

To evaluate the trained model on the sample data, you can run:

python src/parser.py --train sampledata/joint.txt --test sampledata/joint.txt --model exps/sampledata.model --verbose

josydipa's People

Contributors

kaayy avatar

Stargazers

Arne Neumann avatar Yang An avatar Hiroaki Hayashi avatar Mathieu Morey avatar

Watchers

 avatar Arne Neumann avatar

Forkers

yusifu

josydipa's Issues

Where to get WSJ .cleangold files from?

Dear Kai,

in your README you refer to the WSJ articles as wsj_xxxx.cleangold. In the PTB treebank,
I only have wsj_xxxx.mrg and wsj_xxxx.prd files. Is there an extra conversion step necessary?

It seems that the conversion from .mrg

( (S
    (NP-SBJ (NNP Alan) (NNP Murray) )
    (VP (VBD contributed)
      (PP-CLR (TO to)
        (NP (DT this) (NN article) )))
    (. .) ))

to .cleangold

 (TOP (S (NP (NNP Alan) (NNP Murray)) (VP (VBD contributed) (PP (TO to) (NP (DT this) (NN article)))) (. .)))

is relatively straight forward but since the cleangold files have to perfectly match to make the patch files work, I thought I'd better ask.

Kind regards,
Arne

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.