GithubHelp home page GithubHelp logo

salt-nlp / formalitystyletransfer Goto Github PK

View Code? Open in Web Editor NEW
8.0 3.0 3.0 751 KB

Code for "Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization"

License: MIT License

Python 96.20% C++ 1.01% Cuda 2.32% Shell 0.20% Lua 0.27%
styletransfer languagemodel semi-supervised-learning mutual-information textgeneration formality

formalitystyletransfer's Introduction

Code for Semi-supervised Formality Style Transfer

Kunal Chawla, Diyi Yang: Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization. In Findings of the 25th Annual Meeting of the Empirical Methods in Natural Language Processing (EMNLP'2020 Findings)

If you would like to refer to it, please cite the paper mentioned above.

Getting Started

Following are the instructions to get started with the code.

Requirements

  • Python 3.6 or higher
  • Pytorch >= 1.2.0 (preferably with CUDA support)
  • nltk

Code Structure

|__ fairseq/
        |__ models				
            |__ BART
            |__ model.py                            Model file
        |__ criterions
            |__ classification.py                   Pre-training Discrminator
            |__ label_smoothed_cross_entropy.py     Training main model
        |__ criterions
            |__ language_pair_dataset.py            Dataset Processing
        |__ trainer.py                              Training helper code
        |__ options.py                              Options and default values

|__fairseq_cli/
        |__ train.py                                Main training code
        |__ generate.py                             Generation code
|__ preprocess.ph                                   Preprocess data
|__ pipeline.sh                                     Training scipt

Build the code

The code is based on Fairseq (https://github.com/pytorch/fairseq). To build it, run

pip install --editable setup.py

Further instructions can be found on Fairseq official page.

Dataset and Pre-Processing

The Grammarly Yahoo Corpus Dataset (GYAFC) is available on request from here. Please download it and place it in the root directory.

To preprocess the dataset, run

bash preprocess.sh [options]

We followed the same instructions as Fairseq's BART model. Follow the instructions for further converting the data to binary format that can be used for training.

Training

Download the pre-trained "bart.large" model. To start training, run

bash pipeline.sh [options]

The details of all parameters are given in fairseq/options.py. For details on parameters and values, refer to the paper and appendix.

Evaluation and Outputs

For generation, run

python evaluation/gen.py

Some folder paths may need to be changed depending on configuration. For evaluation and BLEU scores, run

python evaluation/calc_score.py path_to_output_file

The outputs for our and various other models are given in evaluation/outputs. As mentioned in Table 4 of the paper, we provide outputs for Hybrid Annotations, Pretrained w/ rules, Ours and Target. "_family" refers to F&R Domain and "_music" refers to E&M Domain.

formalitystyletransfer's People

Contributors

jiaaoc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

formalitystyletransfer's Issues

What does "plain" mean in preprocess.sh ?

Hi, thanks for sharing your code with us.

I wonder what the SPLIT "plain" means in preprocess.sh:

https://github.com/GT-SALT/FormalityStyleTransfer/blob/a86d287d0c48238f7cd39f6f34b465b0b7ccb2f4/preprocess.sh#L6

https://github.com/GT-SALT/FormalityStyleTransfer/blob/a86d287d0c48238f7cd39f6f34b465b0b7ccb2f4/preprocess.sh#L26

Also, the preprocess.sh script requires files named like "train.source" and "val.target". The GYAFC dataset does not follow this format, however. Should I reformat the original files into the required format? In that case, which files should be renamed "plain.source" and "plain.target"?

File missing

Hi,

Could you please provide the file language_pair_dataset.py for data preprocessing? Thank you.

raise StopIteration StopIteration

in this module "fairseq/data/iterators.py " in line 49 there is an error named "StopIteration" which is not compatible with updated Faiseq. it means there is a conflict with fairseq folder on your repository with pytorch/fairseq. how would you run this code? please write a clear description about the process of runnig the code

Screen Shot 2022-01-04 at 12 14 38 PM

.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.