GithubHelp home page GithubHelp logo

georgevern / lmcor Goto Github PK

View Code? Open in Web Editor NEW
6.0 2.0 1.0 415 KB

Code for the EACL 2024 paper: "Small Language Models Improve Giants by Rewriting Their Outputs"

Python 100.00%
data-to-text-generation grammatical-error-correction large-language-models parameter-efficient-fine-tuning summarization machine-translation

lmcor's Introduction

LM-Corrector

Small Language Models Improve Giants by Rewriting Their Outputs
Giorgos Vernikos, Arthur Bražinskas, Jakub Adamek, Jonathan Mallinson, Aliaksei Severyn, Eric Malmi
European Chapter of the Association for Computational Linguistics (EACL) 2024


Overview

LMCor is a novel method for enhancing the performance of Language Model Models (LLMs) by leveraging only their outputs. We introduce the LM-Corrector (LMCor), a small model designed to rank, combine, and edit diverse candidate outputs generated by LLMs, consistently outperforming in-context learning and reranking strategies.

This repository contains code for training a T5 (or similar) model as an LM-Corrector or for standard fine-tuning for text generation tasks. These tasks include grammatical error correction, data-to-text generation, summarization, and machine translation.

Installation

This project requires Python 3.10, PyTorch 1.12.1, and transformers 4.34.0.

It's advisable to set up a separate environment for this project and install the necessary dependencies:

conda create -n lmcor python=3.10
conda activate lmcor
pip install -r requirements.txt

Datasets

LMCor is evaluated on various tasks and datasets:

  • Grammatical Error Correction: CoNLL-14
  • Data-to-Text Generation: E2E NLG (cleaned)
  • Summarization: XSum
  • Machine Translation: WMT22 En->De

The code integrates E2E and XSum datasets via the Datasets library. For WMT22 En->De, you need to manually download the validation and test sets from sacreBLEU and store them in the data/wmt22/en-de/ folder with filenames validation.<x> and test.<x>, where <x> = en, de. For training, 200k sentences are sampled from News Commentary v16, available here.

Training LMCor

LLM-generated candidates

To train an LM-Corrector, you first need predictions for the training and validation sets from a Language Model (LLM). Assume these files are stored in the corresponding data/<task>/ folder as train_[llm_name] and validation_[llm_name]. In this project, we used the greedy decoded output along with 4 sampled outputs from the LLM. You can edit the filenames in the train_t5.py script using the FILE_SAMPLE and FILE_GREEDY global variables.

To train the corrector execute the train_t5.py script:

python train_t5.py --task xsum --corrector --bsize 8 --grad_acc_steps 16 --output_dir lmcor_xsum

Note: to train a standard t5 model remove the --corrector flag

To change the directory where the HuggingFace models are stored edit the MODELS_DIR global variable in the t5_utils.py script

Evaluate LMCor

To obtain predictions from the corrector, use the eval_t5.py script:

python eval_t5.py --task xsum --corrector --ckpt lmcor_xsum --split test --bsize 32 

The outputs of LMCor will be saved in the model folder in the file model_preds.txt.

Finally, to compute scores for various metrics, run the compute_textgen_metrics.py script:

python compute_textgen_metrics --task xsum --hyp lmcor_xsum/model_preds.txt

Reference

Please feel free to cite our paper if you use our code or proposed algorithm.:

@inproceedings{vernikos-etal-2024-small,
    title = "Small Language Models Improve Giants by Rewriting Their Outputs",
    author = "Vernikos, Giorgos  and Brazinskas, Arthur  and Adamek, Jakub  and Mallinson, Jonathan  and Severyn, Aliaksei  and Malmi, Eric",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.eacl-long.165",
}

Contact

Please feel free to raise an issue or contact me in case you require any help setting up the repo!

lmcor's People

Contributors

georgevern avatar r0cketdyne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

r0cketdyne

lmcor's Issues

LLM-generated candidates

Hi I am a beginner in NPL and I would like to ask, should I download the dataset myself for example (Xsum) and then input the big model myself to get the output and put it into the train_[llm_name] and validation_[llm_name] file? What should be the format?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.