GithubHelp home page GithubHelp logo

ymliucs / mrgec Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 1.0 2.04 MB

Code of ACL 2024 Findings paper: Towards Better Utilization of Multi-Reference Training Data for Chinese Grammatical Error Correction

Home Page: https://github.com/ymliucs/Publications/blob/main/MrGEC(ACL'24%20Findings).pdf

License: MIT License

Python 97.05% Macaulay2 0.77% Shell 2.18%

mrgec's Introduction

Towards Better Utilization of Multi-Reference Training Data for Chinese Grammatical Error Correction

Yumeng Liu, Zhenghua Liโœ‰๏ธ, Haochen Jiang, Bo Zhang, Chen Li, Ji Zhang

Abstract

This repo contains the code for our ACL 2024 Findings paper: Towards Better Utilization of Multi-Reference Training Data for Chinese Grammatical Error Correction.

Set up

  1. Prepare the conda environment for MrGEC
conda create -n mrgec python==3.10.10
conda activate mrgec
pip install -r requirements.txt
python -m spacy download en
  1. Prepare the conda environment for the evaluation tool ChERRANT
conda create -n cherrant python==3.8
conda activate cherrant
pip install -r utils/ChERRANT/requirements.txt
  1. Download the pre-trained models
python utils/download.py --repo_id HillZhang/pseudo_native_bart_CGEC
python utils/download.py --repo_id fnlp/bart-large-chinese

Before running, you are required to preprocess each instance into the format of

S   [src]
T   [tgt1]
T   [tgt2]
T   [tgt3]

S   [src]
T   [tgt1]
T   [tgt2]

Where [src] and [tgt] are the source and target sentences, respectively. A \t is used to separate the prefix S or T and the sentence. Each instance is separated by a blank line.

Handle data leakage

We find the FCGEC-Train and NaSGEC-Exam/NaCGEC have a severe data leakage problem. The code in utils/handle_data_leakage_tool can handle all the Chinese GEC datasets which have data leakage problem.

All the datasets need to be processed to the follow format:

[idx] [src] [tgt1] [tgt2] ... 

Where [idx] is the index number of an instance which starts with 1, and the sentences are separate by \t.

Usage example:

python handle_leakage.py --data_dir data/ns_original --out_dir data/ns_leakage_processed --train_file FCGEC_train_filtered.para --extract_test_files nasgec.exam.para,nacgec.all.para  --frozen_test_files fcgec.dev.para,fcgec.test.para

python handle_leakage.py --data_dir data/ns_original --out_dir data/ns_leakage_processed --train_file FCGEC_train_filtered.para --extract_test_files nasgec.exam.para

python handle_leakage.py --data_dir data/ns_original --out_dir data/ns_leakage_processed --train_file FCGEC_train_filtered.para --frozen_test_files nasgec.exam.para

Download data

You can download all the data we use here.

Run

You can see all the commands for running our experiments in the bash folder, and the hyperparameters can be set in the configs folder.

Examples:

bash bash/run_lang8_cat.sh
bash bash/run_lang8_avgl_minl.sh
bash bash/run_fcgec_cat.sh
bash bash/run_fcgec_avgl_minl.sh

You can download and check all the logs of our experiments here.

Acknowledgements

  1. This repository is completely based on SuPar.
  2. We use the ChERRANT for all the evaluation.

Citation

If you find this repo helpful, please cite the following paper:

@inproceedings{liu-etal-2024-towards,
    title = "{Towards Better Utilization of Multi-Reference Training Data for Chinese Grammatical Error Correction}",
    author = "Liu, Yumeng  and
      Li, Zhenghua  and
      Jiang, Haochen  and
      Zhang, Bo  and
      Li, Chen  and
      Zhang, Ji,
    booktitle = "Findings of ACL",
    year = "2024",
}

mrgec's People

Contributors

ymliucs avatar jacob-zhou avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

jacob-zhou

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.