GithubHelp home page GithubHelp logo

Formatting data about neuqe HOT 3 CLOSED

nusnlp avatar nusnlp commented on August 26, 2024
Formatting data

from neuqe.

Comments (3)

shamilcm avatar shamilcm commented on August 26, 2024

Currently, we have only released the source code of the toolkit neuqe, along with pre-trained models and a sample evaluation dataset.

The training datasets for GEC QE need to be obtained from their respective owners and processed following the description in our paper. Sections 5.1, 5.2, and 5.2 in our paper (https://aclweb.org/anthology/D18-1274) describe how we obtained the datasets for training, validating, and testing the predictor and estimator models, and also regarding the preparation of the GEC system that generates system outputs for training and evaluating the estimator.

For training the predictor, you should be having the following files
$TRAIN_PATH_PREFIX.src and $TRAIN_PATH_PREFIX.trg, and $VALID_PATH_PREFIX.src and $VALID_PATH_PREFIX.trg,
where, $TRAIN_PATH_PREFIX is the prefix of the path to the training dataset.
and $VALID_PATH_PREFIX is the prefix of the path to the validation dataset.
.src file contains the erroneous source sentences and the .trg file contains the corresponding human-corrected target sentences.

For training the estimator, you should be having:
$QE_TRAIN_DATA_PATH_PREFIX.src , $QE_TRAIN_DATA_PATH_PREFIX.hyp, and $QE_TRAIN_DATA_PATH_PREFIX.$SCORE_SUFFIX files.
where $QE_TRAIN_DATA_PATH_PREFIX is the prefix of the path to the estimator training dataset, .src contains the erroneous source sentences, .hyp contains the GEC output sentences, and $SCORE_SUFFIX is the suffix (file extension) of the file containing the sentence-level scores (one score per line). For example, if the estimator training dataset files are train.src, train.hyp, and train.hter , then QE_TRAIN_DATA_PATH_PREFIX=train and SCORE_SUFFIX=hter.
Similarly, $QE_VALID_DATA_PATH_PREFIX files for validation.

We have given an example dataset for validation: https://github.com/nusnlp/neuqe/tree/master/examples/gec_emnlp18/data for training the estimator model.

The validation data consists of sentences from the development portion of CLC-FCE and CoNLL-2013 test sets (mentioned in Section 5.2 of the paper)

For further assistance with data processing and/or any queries on the methodology in the paper, please email me at [email protected].
If you face any issues with the toolkit neuqe, please report as an issue here in the Github repo.

from neuqe.

gcunhase avatar gcunhase commented on August 26, 2024

Thank you for your reply, is there a pre-processing script that I can use after downloading the datasets from their respective owners?

from neuqe.

shamilcm avatar shamilcm commented on August 26, 2024

Unfortunately, I do not have a consolidated script to process the dataset. The preprocessing pipeline for Lang-8 dataset follows https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh.

Email me if have further questions about the paper or the data preprocessing.

from neuqe.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.