TLDR; Is there a data formatting available? I've downloaded C

Thank you for your reply, is there a pre-processing that I can use after downlo

Unfortunately, I do not have a consolidated to process the dataset. The preproc

Formatting data about neuqe HOT 3 CLOSED

nusnlp commented on August 26, 2024

Formatting data

from neuqe.

Comments (3)

shamilcm commented on August 26, 2024

Currently, we have only released the source code of the toolkit neuqe, along with pre-trained models and a sample evaluation dataset.

The training datasets for GEC QE need to be obtained from their respective owners and processed following the description in our paper. Sections 5.1, 5.2, and 5.2 in our paper (https://aclweb.org/anthology/D18-1274) describe how we obtained the datasets for training, validating, and testing the predictor and estimator models, and also regarding the preparation of the GEC system that generates system outputs for training and evaluating the estimator.

For training the predictor, you should be having the following files
$TRAIN_PATH_PREFIX.src and $TRAIN_PATH_PREFIX.trg, and $VALID_PATH_PREFIX.src and $VALID_PATH_PREFIX.trg,
where, $TRAIN_PATH_PREFIX is the prefix of the path to the training dataset.
and $VALID_PATH_PREFIX is the prefix of the path to the validation dataset.
.src file contains the erroneous source sentences and the .trg file contains the corresponding human-corrected target sentences.

For training the estimator, you should be having:
$QE_TRAIN_DATA_PATH_PREFIX.src , $QE_TRAIN_DATA_PATH_PREFIX.hyp, and $QE_TRAIN_DATA_PATH_PREFIX.$SCORE_SUFFIX files.
where $QE_TRAIN_DATA_PATH_PREFIX is the prefix of the path to the estimator training dataset, .src contains the erroneous source sentences, .hyp contains the GEC output sentences, and $SCORE_SUFFIX is the suffix (file extension) of the file containing the sentence-level scores (one score per line). For example, if the estimator training dataset files are train.src, train.hyp, and train.hter , then QE_TRAIN_DATA_PATH_PREFIX=train and SCORE_SUFFIX=hter.
Similarly, $QE_VALID_DATA_PATH_PREFIX files for validation.

We have given an example dataset for validation: https://github.com/nusnlp/neuqe/tree/master/examples/gec_emnlp18/data for training the estimator model.

The validation data consists of sentences from the development portion of CLC-FCE and CoNLL-2013 test sets (mentioned in Section 5.2 of the paper)

For further assistance with data processing and/or any queries on the methodology in the paper, please email me at [email protected].
If you face any issues with the toolkit neuqe, please report as an issue here in the Github repo.

from neuqe.

gcunhase commented on August 26, 2024

Thank you for your reply, is there a pre-processing script that I can use after downloading the datasets from their respective owners?

from neuqe.

shamilcm commented on August 26, 2024

Unfortunately, I do not have a consolidated script to process the dataset. The preprocessing pipeline for Lang-8 dataset follows https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh.

Email me if have further questions about the paper or the data preprocessing.

from neuqe.

Formatting data about neuqe HOT 3 CLOSED

Comments (3)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs