GithubHelp home page GithubHelp logo

00mjk / sgcp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from malllabiisc/sgcp

0.0 0.0 0.0 401 KB

TACL 2020: Syntax-Guided Controlled Generation of Paraphrases

License: Apache License 2.0

Shell 0.03% Python 99.97%

sgcp's Introduction

Syntax-Guided Controlled Generation of Paraphrases

Source code for TACL 2020 paper: Syntax-Guided Controlled Generation of Paraphrases

Image

  • Overview: Architecture of SGCP (proposed method). SGCP aims to paraphrase an input sentence, while conforming to the syntax of an exemplar sentence (provided along with the input). The input sentence is encoded using the Sentence Encoder to obtain a semantic signal ct . The Syntactic Encoder takes a constituency parse tree (pruned at height H) of the exemplar sentence as an input, and produces representations for all the nodes in the pruned tree. Once both of these are encoded, the Syntactic Paraphrase Decoder uses pointer-generator network, and at each time step takes the semantic signal ct , the decoder recurrent state st , embedding of the previous token and syntactic signal hYt to generate a new token. Note that the syntactic signal remains the same for each token in a span (shown in figure above curly braces). The gray shaded region (not part of the model) illustrates a qualitative comparison of the exemplar syntax tree and the syntax tree obtained from the generated paraphrase.

Dependencies

  • Compatible with Pytorch 1.3.0 and Python 3.x
  • The necessary packages can be install through requirements.txt

Setup

To get the project's source code, clone the github repository:

$ git clone https://github.com/malllabiisc/SGCP

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

$ pip install -r requirements.txt

Create essential folders in the repository using:

$ chmod a+x setup.sh
$ ./setup.sh

Resources

Dataset

  • Download the following dataset(s): Data
  • Extract and place them in the SGCP/data directory

Path: SGCP/data/<dataset-folder-name>.

A sample dataset folder might look like

data/QQPPos/<train/test/val>/<src.txt/tgt.txt/refs.txt/src.txt-corenlp-opti/tgt.txt-corenlp-opti/refs.txt-corenlp-opti>

Pre-trained Models:

  • Download the following pre-trained models for both QQPPos and ParaNMT50m datasets: Models
  • Extract and place them in the SGCP/Models directory

Path: SGCP/Models/<dataset_Models>

Evaluation Essentials

  • Download the evaluation file: evaluation
  • Extract and place it in SGCP/src/evaluation directory
  • Give executable permissions to SGCP/src/evaluation/apps/multi-bleu.perl

Path: SGCP/src/evaluation/<apps/data/ParaphraseDetection>

This contains all the necessary files needed to evaluate the model. It also contains the Paraphrase Detection Score Models for Model-based evaluation.

Training the model

  • For training the model with default hyperparameter settings, execute the following command:

    python -m src.main -mode train -run_name testrun -dataset <DatasetName> -gpu <GPU-ID> -bpe
    
    • -run_name: To specify the name of the run for storing model parameters
    • -dataset: Which dataset to train the model on, choose from QQPPos and ParaNMT50m
    • -gpu: For a multi-GPU machine, specify the id of the gpu where you wish to run the code. For a single GPU machine simply use 0 as the ID
    • -bpe: To enable byte-pair encoding for tokenizing data.
  • Other hyperparameters can be viewed in src/args.py

Generation and Evaluation

  • For generating paraphrases on the QQPPos dataset, execute the following command:

    python -m src.main -mode decode -dataset QQPPos -run_name QQP_Models -gpu <gpu-num> -beam_width 10 -max_length 60 -res_file generations.txt
    
  • Similarly for ParaNMT dataset:

    python -m src.main -mode decode -dataset ParaNMT50m -run_name ParaNMT_Models -gpu <gpu-num> -beam_width 10 -max_length 60 -res_file generations.txt
    
  • To evaluate BLEU, ROUGE, METEOR, TED and Prec. scores, first clean the generations:

    • For QQPPos:
    python -m src.utils.clean_generations -gen_dir Generations/QQP_Models -data_dir data/QQPPos/test
    -gen_file generations.txt
    
    • For ParaNMT50m
    python -m src.utils.clean_generations -gen_dir Generations/ParaNMT_Models -data_dir data/ParaNMT50m/test
    -gen_file generations.txt
    
  • Since our model generates multiple paraphrases corresponding to different heights of the syntax tree, to select a single generation:

    python -m src.utils.candidate_selection -gen_dir Generations/QQP_Models
    -clean_gen_file clean_generations.csv -res_file final_paraphrases.txt -crt <SELECTION CRITERIA>
    
    • -crt: Criteria to use for selecting a single generation from the given candidates. Choose 'rouge' for ROUGE based selection as given in paper (SGCP-R) and 'maxht' for selecting the generation corresponding to the full height of the tree (SGCP-F)
  • Finally, to obtain the scores, run:

    • For QQPPos:
    python -m src.evaluation.eval -i Generations/QQP_Models/final_paraphrases.txt
    -r data/QQPPos/test/ref.txt -t data/QQPPos/test/tgt.txt
    
    • For ParaNMT50m:
    python -m src.evaluation.eval -i Generations/ParaNMT_Models/final_paraphrases.txt
    -r data/ParaNMT50m/test/ref.txt -t data/ParaNMT50m/test/tgt.txt
    

Custom Dataset Processing

Preprocess and parse the data using the following steps.

  1. Move the contents of your custom dataset in the data/ directory, with files arranged something like this:

    • data
      • Custom_Dataset
        • train
          • src.txt
          • tgt.txt
        • val
          • src.txt
          • tgt.txt
          • ref.txt
        • test
          • src.txt
          • tgt.txt
          • ref.txt

    Here, src.txt contains the source sentences, tgt.txt contains exemplars and ref.txt contains the paraphrases.

  2. Construct a byte-pair codes file which will be used to generate byte pair encodings of the dataset. From the main directory of this repo, run: subword-nmt learn-bpe <data/Custom_Dataset/train/src.txt> data/Custom_Dataset/train/codes.txt Note: [Optional] Generate codes from both src.txt and tgt.txt - For that first concatenate the two files and replace src.txt with the name of the concatenated file in the command.

  3. Parse the data files using stanford corenlp. First start a corenlp server by executing the following commands:

cd src/evaluation/apps/stanford-corenlp-full-2018-10-05
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos,lemma,ner,parse -parse.model /edu/stanford/nlp/models/srparser/englishSR.ser.gz -status_port <PORT_NUMBER> -port <PORT_NUMBER> -timeout 15000
  1. Finally run the parser on the text files.
cd <PATH_TO_THIS_REPO>
python -m src.utils.con_parser -infile data/Custom_Dataset/train/src.txt -codefile data/Custom_Dataset/train/codes.txt -port <PORT_NUMBER (where the corenlp server is running, from step 3)> -host localhost

This will generate a file in train folder called src.txt-corenlp-opti Run this for all other files i.e. tgt.txt in train folder, src.txt, tgt.txt, ref.txt in val folder and similarly for the files in test folder.

Citing:

Please cite the following paper if you use this code in your work.

@article{sgcp2020,
author = {Kumar, Ashutosh and Ahuja, Kabir and Vadapalli, Raghuram and Talukdar, Partha},
title = {Syntax-Guided Controlled Generation of Paraphrases},
journal = {Transactions of the Association for Computational Linguistics},
volume = {8},
number = {},
pages = {330-345},
year = {2020},
doi = {10.1162/tacl\_a\_00318},
URL = { https://doi.org/10.1162/tacl_a_00318 },
eprint = { https://doi.org/10.1162/tacl_a_00318 }
}

For any clarification, comments, or suggestions please create an issue or contact [email protected]

sgcp's People

Contributors

ashutoshml avatar dependabot[bot] avatar kabirahuja2431 avatar parthatalukdar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.