awasthiabhijeet / pie Goto Github PK

Fast + Non-Autoregressive Grammatical Error Correction using BERT. Code and Pre-trained models for paper "Parallel Iterative Edit Models for Local Sequence Transduction": www.aclweb.org/anthology/D19-1435.pdf (EMNLP-IJCNLP 2019)

License: MIT License

Python 33.13% Shell 1.36% Macaulay2 65.51%

grammatical-error-correction sequence-transduction bert bert-model bert-models natural-language-processing sequence-editing sequence-labeling nlp post-editing

pie's People

Contributors

Stargazers

Watchers

pie's Issues

Time taken for multi_round_infer.sh to run?

Iam running the scripts on gitpod and it's taking a long time? May I know how much time it usually takes?

can it use multi gpus to train ？

About Synthetic training

Hi, I am working on PIE-BERT-Base, when I use the released synthetic data ( actually, I chose a2 ), I performed 2 epochs of training on a2 and fine-tune on the Lang8+fce+nucle for 2 epochs, using PIE-Base. I found the synthetic-training didn't improve the model.
During the synthetic training stage, I observed that the loss function started to fall and then jump between about 10~14 until the 2 epoch training finished. I use the pickles you mentioned in another issue during both synthetic training and fine-tune. Hyperparameters were copied from Appendix(A.5) for PIE-Base.

I wonder if you have experiment results showing synthetic training boosts the PIE-Base, and by how many F-0.5 scores? In the ablation study part of the paper, there is only one result about PIE-Base(56.6). I want to know how many F-0.5 scores is this result be boosted by synthetic training. And whether you have observed the same phenomenon about loss values during synthetic training?
Is there anything else to note about synthetic training?

Thank u so much

which source of correct sentences did you used to make the errorful sentences?

Hi you mentioned in readme that in order to construct errorful sentences we need to specify the path to a correct file along with an output path.My question is " from which source did you extracted the correct sentences to form the erraneous dataset provided in the repository?" Since i also want to construct an erraneous dataset of preposition errors but first i need a correct dataset for that. Also Kindly provide your suggestions on how i can proceed in constructing a dataset with just preposition errors.
Thanks in advance.

while runing !./multi_round_infer.sh shows error

multi_round_1_test_results.txt not found
multi_round_0_test_predictions.txt

usage for last_dot_first_capital?

what is the usecase for the following check:

PIE/tokenization.py

Line 100 in 91e02ba

def last_dot_first_capital(text):

If the sentece ends in a Capital word, and has a dot, these will be considered as a single token, and another dot will be added:
"My name is John." -> "My", "name", "is", "John.", "."

Strange thing is that this is only used with ".", but not with "!", "?"

@awasthiabhijeet can you tell me what was the purpose for this?

Try to export estimator for more efficient deployment

Hi,
I noticed that using estimator.evaluate is a very inefficient way to use this model on new datapoints. The best method i know of is to export the model with estimator.export_savedmodel

i tried by adding a few flags and this lines of code:

if FLAGS.do_export: estimator._export_to_tpu = False estimator.export_savedmodel(FLAGS.export_dir, serving_input_fn)

where serving_input_fn is

def serving_input_fn(): edit_sequence = tf.placeholder(tf.int32, [None, FLAGS.max_seq_length], name='edit_sequence') input_ids = tf.placeholder(tf.int32, [None, FLAGS.max_seq_length], name='input_ids') input_mask = tf.placeholder(tf.int32, [None, FLAGS.max_seq_length], name='input_mask') segment_ids = tf.placeholder(tf.int32, [None, FLAGS.max_seq_length], name='segment_ids') input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({ 'edit_sequence': edit_sequence, 'input_ids': input_ids, 'input_mask': input_mask, 'segment_ids': segment_ids, })() return input_fn

However i was unable to do so. I think it would be a very good addition to your code. but i get an error:ValueError: Couldn't find trained model at PIE_ckpt.

Im very confused by this error.

Inference using own sentences

Hi
Authors

Could you please tell me how to infer on my own tests set of grammatically incorrect sentences ?

Bad results trying to train using end_to_end.sh

Hi,

I am trying to run end_to_end.sh, following the instructions in the example_scripts README, however, I am getting bad predictions and therefore cannot reproduce anywhere near the 26.6 F-score mentioned.

I have corrected the typo in end_to_end.sh (./m2eval.sh --> ./m2_eval.sh) and am using https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip on my end, so the code runs, but for example I see results like this:

Test data : Keeping the Secret of Genetic Testing

Prediction round 1: . . . . . . . keeping on a on for the Good , and Secret . . So . . , and of , the , and the , and genetic , and , and for the Testing . . . . . . . . [SEP] . . . . . . . [SEP] .

Any tips on where I'm going wrong?

How to fine-tune PIE on customized dataset?

Hi Authors,

Thanks for your great work! Now I am stuck at fine-tuning pre-trained PIE on customized data, is there any way to do so? Many thanks for your reply.

Hi dear authors, I want the script of producing synthetic data, much grateful if you will release it

Hi dear authors of PIE, I am a student from Peking University, learning NAR-GEC task now. This is a so excellent and impressive work. To reproduce the result you achieved, I want to know how many sentence pairs in the One-Billion-word corpus were used to pre-train the model. And I will be much grateful if you will release the script for producing synthetic data. My email is [email protected] Thank u so much.

Found Garbled in the Synthetic Dataset

Just like this. Is it still available to be trained?

what's the diff between PIE and LaserTagger ?

Attention mask for computation of replace and append operation

Hi, you mentioned in the papar that we calculate r_{i}^{l} over h_{j}^{l} for all j except i, but calculate a_{i}^{l} over h_{j}^{l} for all j including i.
Why there is such a difference that we can't have information about the current token x_{i} when dealing with the replace operation but have access to the current token for append operation on the contrary?

Releasing weights

Hi authors,
Great work with the paper! I'm interested if you're going to release weights, not of the single best model, but the one mentioned in example_scripts/README.md with F_{0.5} score close to 26.6.
Thank you

Small bug in errorify/error.py

Hi, first of all, thank you for releasing your code.

I just found a small bug in errorify/error.py - in the function readn at line 53, the extra "yield clist" outside the for loop will yield a duplicate batch if the number of lines in the file is divisible by the batch size.

(so for a source file of 1,000 lines with batch size 200, corr_sentences.txt and incorr_sentences.txt would have 1,200 lines).

A simple fix is removing the boolean "start" checks and setting "clist = []" after every "yield clist" in the for loop, so if the number of lines in the file is divisible by the batch size, the yield outside of the loop will just return an empty list.

What impact does it have on the overall time with this variable set.

PIE/apply_opcode.py

Line 30 in 2975a6c

DO_PARALLEL = False

Running pretrained PIE model gets stuck at INFO:tensorflow:Done running local_init_op.

Hi there, I am trying to run the pretrained PIE model as per the instruction from the repo. However, I get stuck at the point where the output says "INFO:tensorflow:Done running local_init_op." after just the first call of the pie_infer.sh file. I have tried solutions such as commenting out the d.repeat() in the word_edit_model.py but I am still encountering the problem, would really appreciate if I could get some help on this issue, thank you!

EDIT: I have managed to get past that line, but it takes quite long (about 5 minutes). So far it takes about 3 minutes per 10 iterations of the enumeration when enumerating wem_utils.timer(result), is there any way to speed this up? Thank you!

Is the edit space consistent during pre-train and fine-tune?

Hi, I have a question about whether the edit_space(\Sigma_a) is consistent when pre-train and fine-tune? Is it derived from the Lang8 dataset although the distribution of synthetic data is different from the Lang8?

Which part of the code reflects the synthetic training?

Trying to use the pretrained model

Hi thank you for contributing this model. I am currently trying to use it with your pretrained PIE. Is it correct to download the pretrained PIE into the directory which then I will pass to the "output_dir" argument? Thank you in advance.

how to set the number of common deletes？

because I find，the bea common deletes set is much lager than conll common deletes set， so I want to ask how to set the number

about the Edit-factorized BERT Architecture

for replace , when we calculate attention score of position i , we don't consider the token w(i).

at the first layer , I think it is no problem, but we use the info of w(i) indirectly at the seconder or upper layers.

Is it ok ?

Pretrained model bad correction performance?

Hi, I am testing out your model and i noticed that if i run your pretrained model on the conll_test.txt file you have it get very poor performances. the output is something of the sort:

Day I think I think I think remember I think I think I
Day What I think is here and I think and I
Day I think large refers every day large your every day chance every day of I think every daying every day large I think I am disease every day every day

which has no resemblance with the input. do you maybe know what might be going on? I am just running multi_round_infer.sh. the only flags i have changed are "use_tpu" to false (cause i dont have a tpu and im runnign on gpu).

awasthiabhijeet / pie Goto Github PK

pie's People

Contributors

Stargazers

Watchers

Forkers

pie's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs