GithubHelp home page GithubHelp logo

tingofurro / keep_it_simple Goto Github PK

View Code? Open in Web Editor NEW
36.0 1.0 4.0 67 KB

Codebase, data and models for the Keep it Simple paper at ACL2021

License: Apache License 2.0

Python 70.33% Jupyter Notebook 4.14% JavaScript 2.32% HTML 23.21%
news reinforcement-learning simplification text-simplification unsupervised-learning acl2021 bert

keep_it_simple's Introduction

Keep it Simple (KiS)

This repository contains the code for ACL2021 paper: Keep It Simple: Unsupervised Simplification of Multi-Paragraph Text.

Running the KiS model

From the HuggingFace Hub

The easiest way to use the model is through the hosted Hub model: https://huggingface.co/philippelaban/keep_it_simple The basic use would be:

tokenizer = AutoTokenizer.from_pretrained("philippelaban/keep_it_simple")
kis_model = AutoModelForCausalLM.from_pretrained("philippelaban/keep_it_simple")

See the model card for a detailed example.

Manual approach

To simplify text with a trained model, an example script is provided:

python run_keep_it_simple.py --model_card gpt2-medium --model_file /home/phillab/models/ACL2021/gpt2_med_keep_it_simple.bin

The script outputs several candidate simplifications for a given input paragraph, emphasizing the insertions and deletions made by the model using color (green, red).

In the Keep it Simple Release, we provide a model checkpoint we trained using the Keep it Simple procedure that achieves a high-average reward on news paragraphs: gpt2_med_keep_it_simple.bin (this is identical to the model card on the HuggingFace Hub).

Training

Installation Requirements

The requirements.txt provides the list of pip packages required to use and train models. One must also install a spaCy model:

python -m spacy download en_core_web_sm

Must also manually install the apex library, used for mixed-precision training (see: https://github.com/nvidia/apex), as it is not avaiable on pip.

Training Script

For training, two pre-trained models are needed, which we provide in the Keep it Simple Release:

  • coverage_roberta.bin: A model compatible with a roberta-base of the Roberta HuggingFace implementation, used for the salience scorer (coverage model).
  • gpt2_med_cp90.bin: A model compatible with a gpt2-medium of the GPT2 HuggingFace implementation, used as the initial model for the generator.

Once the packages are installed, and the models are downloaded, the training script can be run:

python train_keep_it_simple.py --experiment initial_run --model_start_file /path/to/gpt2_med_cp90.bin

See the script for additional hyper-parameters. With the default hyperparameters provided, the script should converge within 16-24 hours to a model achieving a strong (yet not optimal) score, when trained using a single V-100 or equivalent.

The provided training script uses CCNews as a rudimentary demonstration dataset, and was not the one used to obtain results in our experiments (we use a larger news corpus that we cannot release due to copyright). We recommend replacing CCNews with in-domain data for better results.

Example Training Run

To ease with debugging and reproducibilty, we release the log of an example training run of Keep it Simple. It can be accessed as a view-only Wandb report.

Human Evaluation Details

The /study_interface folder contains details from the usability, including: the HTML / Javascript used during the study, as well as all the data simplification_user_study.json used during the study, including all model candidate simplifications, the comprehension questions used and distractors.

Cite the work

If you make use of the code, models, or algorithm, please cite our paper:

@inproceedings{laban2021keep_it_simple,
  title={Keep It Simple: Unsupervised Simplification of Multi-Paragraph Text},
  author={Philippe Laban and Tobias Schnabel and Paul N. Bennett and Marti A. Hearst},
  booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
  volume={1},
  year={2021}
}

Contributing

If you'd like to contribute, or have questions or suggestions, you can contact us at [email protected]. All contributions welcome! For example, if you have a type of text data on which you want to apply Keep it Simple.

keep_it_simple's People

Contributors

tingofurro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

keep_it_simple's Issues

Processing in data collator

Hi Tingofurro,

Thanks for sharing a nice simplification repository.
I have a query for the explanation of the processing happening in the data collator:

def cc_news_collate(inps):
batch_paras = []
for inp in inps:
text = inp["text"]
paragraphs = sorted(text.split("\n"), key=lambda p: abs(p.count(" ")-35))
batch_paras.append(paragraphs[0])
return batch_paras

Why are you only appending the largest paragraph (if I am correct) rather than the complete text?

Looking forward to your response.

utils_misc can't find freer GPU

The nvidia-smi parsing code results in an empty sequence.

>>> import utils_misc
>>> utils_misc.select_freer_gpu()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/corey/workspace/keep_it_simple/utils_misc.py", line 24, in select_freer_gpu
    freer_gpu = str(get_freer_gpu())
  File "/home/corey/workspace/keep_it_simple/utils_misc.py", line 11, in get_freer_gpu
    return np.argmax(memory_available)
  File "<__array_function__ internals>", line 200, in argmax
  File "/home/corey/workspace/keep_it_simple/.venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 1242, in argmax
    return _wrapfunc(a, 'argmax', axis=axis, out=out, **kwds)
  File "/home/corey/workspace/keep_it_simple/.venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 54, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/home/corey/workspace/keep_it_simple/.venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: attempt to get argmax of an empty sequence

Changing to 'grep -A5' appears to work for running the sample on a system with one consumer GPU, but I'm not equipped to evaluate the overall impact of the change.

os.system('nvidia-smi -q -d Memory |grep -A5 GPU|grep Free >tmp_smi')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.