GithubHelp home page GithubHelp logo

kanyun-inc / fairseq-gec Goto Github PK

View Code? Open in Web Editor NEW
244.0 244.0 67.0 2.79 MB

Source code for paper: Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data

License: Other

Python 98.10% C++ 0.44% Shell 0.91% Lua 0.54%
grammar nlp

fairseq-gec's People

Contributors

zhawe01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fairseq-gec's Issues

Details about the token-level labeling Task

Thank you for your excellent job of the GEC task. I saw you mentioned the token-level labeling task. I didn't find it in this code. Could you give me more details about how to combine the token-level labeling task with the present work. Thank you for your amazing works again.

Multi-Task

Hello~ We are so interested in this paper's thoughts, while we cannot find the code of multi-task, and if you will release this code next~

Is negative loss normal?

I trained my own model as the train.sh scripts, but found loss is negative after many epochs. I'm wondering is negative loss possible to appear or not?

Question about model parameters, UNK words, loss function and spell error correction system

Hi,
I find some parameters, learning rate and weight decay, mentioned in Section 5.2 are not consistent with the released train.sh script. So, for all single models shown in Table 5, how do you set for these parameters when you did Single Model Ablation Study?

Besides, if all single models shown in Table 5 use the edit-weighted MLE? And "Ignoring UNK words as edits" means replacing the with the source word, i.e., using the "--replace-unk" parameter or just dropping the token.

I notice you use the a statistical-based spell error correction system to pre-process the training data. How can I find this system?

Cross entropy loss giving RuntimeError: CUDA error: device-side assert triggered

Hi,

I tried to train a model from scratch on the data that I have downloaded from the link mentioned on your readme. However, when I am trying to train the model I am facing a runtime error. The issue I think is perhaps due to the format of data because when I tried to run the same code on a different dataset (eng-german), the code seems to run without any glitch.
Currently, I am using python 3.7 and pytorch version 1.3.
Any help would be deeply appreciated.

How to make data/train_merge and data/valid?

Hello, thank you for your great work.
I am trying to replicate this work and do further work,
using this copy-attention model for GEC as a baseline model.

Currently I am trying to switch the dataset (to Korean).
I already have the preprocessed data.

But, I have trouble making the input for the model.
I think that I have to make things that are in out/data_bin,
which are [train/valid].src-tgt.[src/tgt].[bin/idx] and [train/valid].label.[src/tgt].txt.

By analyzing the code, I found that I can make the label file and the binary file by running preprocess.sh.
But I always get the error:
FileNotFoundError: [Errono 2] No such file or directory: 'data/train_merge.src'
by further analyzing your code, I found that we need "trainpref" and "validpref", which are listed as
'data/train_merge' and 'data/valid', to generate label and binary file.
but I couldn't find the code that generates this, which means that I have to make this by myself.

My question is this.
Overall: How can I make input for training the model?

  1. How can I make data/train_merge.src, data/train_merge.tgt, data/valid, and the alignfile(data/train_merge.forward)? What are the formats?
    (example: dict.src.txt have [word] [frequency] format of the training data)
  2. Is there other files that I need other than data/train_merge.[src/forward], data/valid, and dicts/dict.src.txt, In order to run preprocess.sh and get all the inputs needed?

Also, it will be so much helpful if you tell me the general process about how to run the model with different preprocessed dataset.
Here, "preprocessed" means that I have done all of this(#14) to make training data, and I now have a clean sentence pair of [grammatically correct/ grammatically incorrect] dataset.

Thank you for reading my question. I will be waiting for your answer.

How to train this on a new language?

Hello,

thanks for sharing the repository. What are the steps to follow to train this model on the GEC task for languages different from english (german, italian, french, etc...)?

Question about difference of formula of composition_score and copy_score between paper and code

As defined in Equation 5 of the paper, the final probability distribution P_t is a mix of generation distribution and copy distribution.
p_t(w) = (1−α_copy)∗p_gen(w)+(α_copy)∗p_copy(w)
However, I found the formula used in code as follows:
composite_scores = copy_alpha * composite_scores
copy_scores = (1 - copy_alpha) * copy_attn
I am concered about whether the difference would influence the effect of model and how it would effect on the performance.

About the validation set

Thanks for your excellent work!
After running train.sh, I found that there were about 5,000 valid samples. Since there are only 1381 sentences in CoNLL-2013 test set, can you please tell me where those valid samples come from?

How to run this model

Thx for your program.
I have watched your instruction video in ACL, and I want to run this program in my research. But I donnot know how to use it.
Thx.

Some problems about run pretrain.sh

Hello, I want to use subtoken instead of the original token to modify this code, and then after generating training data, the following error occurred when running the code pretrain.sh.

Exception: process 1 terminated with signal SIGKILL

Have you had any similar error? And how much memory did you use when you ran 100 million data?

Details about training the best single model described in the paper

In the Copynet paper, in Table-5: Single Model ablation study, it is mentioned that F-score for the copy-augmented transformer architecture with the denoising auto-encoder(pre-trained) is 58.80.

  • I tried downloading the code, pre-trained models and the data you have provided in the repo and got an F-score of 52.76
  • Does the current repo with the given train.sh, pre-trained model file: checkpoint9.pt and data provided in the download link replicate the result of 58.8 F-score with 9 epochs as mentioned in the train.sh script?
  • Should we change any of the hyper-parameters to achieve the F-score of 58.8 mentioned in the paper?

Thanks.

Inference results using "generate.sh"

Hi ☺

Thank you for your great work in advance.

I tried to generate inference results using your pretrained model and my dataset (It consists of two files: source and target).

But after preprocessing ("preprocess.sh"), I've encountered a problem when i run "generate.sh":

>>> bash generate.sh 0 _expr_ht
_last
Traceback (most recent call last):
  File "generate.py", line 196, in <module>
    cli_main()
  File "generate.py", line 192, in cli_main
    main(args)
  File "generate.py", line 111, in main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
  File "/data02/jeiyoon_park/fairseq-gec/fairseq/tasks/fairseq_task.py", line 243, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/data02/jeiyoon_park/anaconda3/envs/ydfu/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/data02/jeiyoon_park/fairseq-gec/fairseq/sequence_generator.py", line 382, in generate
    scores.view(bsz, beam_size, -1)[:, :, :step],
  File "/data02/jeiyoon_park/fairseq-gec/fairseq/search.py", line 83, in step
    torch.div(self.indices_buf, vocab_size, out=self.beams_buf)
RuntimeError: result type Float can't be cast to the desired output type Long

I checked generated "outputema_last.nbest.txt" file and found some logs

Label file not found: /data02/jeiyoon_park/fairseq-gec/ht_result/test.label.src.txt
Label file not found: /data02/jeiyoon_park/fairseq-gec/ht_result/test.label.tgt.txt

I just want to infer the results using your pretrained model and my dataset, not train a model

Could you let me know how to infer the results?

Thank you!

About the model performance

I have tried this code without pretrain and the performance is presented as follow:
P:0.6580 R:0.2570 F0.5:0.5015
However, I find the paper report this performance should be about 54(Transformer + Copy).

Where is "g.sh"?

According README.md, for evaluating on the CoNLL-2014 testset, it should run the following script:

sh g.sh \${device_id} \${experiment_name}

But, where is the "g.sh"?? I don't find it in this repository.

Dropout causing negative loss values.

Hello,

Pytorch implements inverted dropout which scales up the values. This means that the weights in the MultiHeadAttention module from fairseq can go over 1 since dropout is applied after the softmax.

Since the model uses the attention weights from the copy module to compute the final probability distribution, p_t(w), values over 1 are also possible. In turn, this makes it possible for the cross entropy loss to be negative. Cross entropy expect a probability distribution which is not what is given during the training of this model.

Is it something that you considered?

I've noticed that no negative loss occur using your training data which surprise me. Maybe there is something I'm not seeing in the way you compute the loss. It seems to be related to the labels made from the train.forward file. However, using my own training data (and your code), it does happen.

Thank you.

Got an Error while retraining model using ./train.sh

Hi,
I am trying to retrain the given model with a new dataset for my thesis. Preprocessing worked fine but now I get the following error when trying to run train.sh:

neg_target = target.new_tensor(target).masked_fill_(target_label, self.padding_idx)
RuntimeError: The expanded size of the tensor (384) must match the existing size (832) at non-singleton dimension 0. Target sizes: [384]. Tensor sizes: [832]

I didn't change anything besides the pretrained model path in train.sh.

I previously fixed this error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 1, 1536]], which is output 0 of AddBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

by changing q *= self.scaling to q = q * self.scaling in line 109 of the multihead_attention.py of fairseq.

Thank you.

question about the sentences copy task

First,thank you for your good work,I want to know, in sentence-copy task,how to construct one batch, half of the correct sentence pairs and half of the edited sentence pairs? or one batch only containing the correct sentence pairs and the next batch only containing the edited sentence pairs

About pre-train the DAE

I find you generated 9 noised data from one-billion data. Do you use all these data to pre-train? And, in the paper "We set Λ = 3 when we train the denoising auto-encoder, and set Λ =[1; 1:8] when we train GEC models", but in the code, Λ = 1.3, which has better performance?

I can't run running bash train.sh

Hello , :)

I got this error when tried to run the train.sh , how to fix this issue ?

`aiman@ta:~/fairseq-gec-master$ bash train.sh 0 aiman
fatal: not a git repository (or any of the parent directories): .git
GIT: unknown unknown
2021-12-01 20:14:16
--------------------------------------------------------------------------------
Namespace(adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer', attention_dropout=0.2, bucket_cap_mb=25, clip_norm=2.0, copy_attention=True, copy_attention_dropout=0.2, copy_attention_heads=1, copy_ext_dict=False, cpu=False, criterion='cross_entropy', curriculum=0, data=['out/data_bin'], ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.2, ema_decay=0.9999, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_period_updates=73328.0, lr_scheduler='triangular', lr_shrink=0.95, max_epoch=9, max_lr=0.004, max_sentences=64, max_sentences_valid=64, max_source_positions=1024, max_target_positions=1024, max_tokens=3000, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-05, momentum=0.99, no_ema=False, no_epoch_checkpoints=False, no_progress_bar=True, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='nag', optimizer_overrides='{}', positive_label_weight=1.2, pretrained_model='./out/models_pretrain/checkpoint9.pt', raw_text=False, relu_dropout=0.2, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='out/modelsaiman', save_interval=1, save_interval_updates=0, seed=4321, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, shrink_min=True, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, weight_decay=0.0)
Traceback (most recent call last):
  File "train.py", line 435, in <module>
    cli_main()
  File "train.py", line 431, in cli_main
    main(args)
  File "train.py", line 42, in main
    task = tasks.setup_task(args)
  File "/home/aiman/fairseq-gec-master/fairseq/tasks/__init__.py", line 19, in setup_task
    return TASK_REGISTRY[args.task].setup_task(args)
  File "/home/aiman/fairseq-gec-master/fairseq/tasks/translation.py", line 112, in setup_task
    raise Exception('Could not infer language pair, please provide it explicitly')
Exception: Could not infer language pair, please provide it explicitly`

Kind regards
Aiman Solyman

Why do you need alignfile for our implementation? Please document better.

An alignfile is needed to train the model on new data, however in the alignfile I see:

source ./config.sh
mkdir data_align

trainpref='data/train_merge'
trainpref='data/valid'

python scripts/build_sym_alignment.py --fast_align_dir ~/software/fast_align/build/ --mosesdecoder_dir fakkk --source_file $trainpref.src --target_file $trainpref.tgt --output_dir data_align 

cp data_align/align.forward $trainpref.forward
cp data_align/align.backward $trainpref.backward

rm -rf data_align

I'm assuming setting trainpref twice is a bug, so should I remove it? Do I need a validation alignfile as well? Why is an alignfile even necessary? When I actually do run it I get:

sh: 1: /h/user/software/fast_align/build/fast_align: not found
Traceback (most recent call last):
  File "scripts/build_sym_alignment.py", line 101, in <module>
    main()
  File "scripts/build_sym_alignment.py", line 75, in main
    assert os.system(fwd_fast_align_cmd) == 0
AssertionError
cp: cannot stat 'data_align/align.backward': No such file or directory

Why aren't your paths relative instead of absolute? Which commit of the fast_align implementation are you using, I'm assuming just https://github.com/clab/fast_align master?

Please actually test your code for the use-cases you're advertising before releasing. For reference, https://github.com/kanekomasahiro/bert-gec is also based on the fairseq library and manages to get up and running painlessly in minutes. Please improve your documentation if you actually want people to adopt your work!!!!!

bash train.sh got error

I run bash train.sh 0 name ... but I got error:

RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

Can help me ?

How should I use pretrained model to process new data

I want to use the pretrained model checkpoint9.pt directly to process the new data, how do I organize the code.(The methods I find on baidu cannot be used, so I ask you for some suggestion ,like torch.load() function)

Questions regarding checkpoints and interactive.sh

To run interactive.sh, it requires a model file ./out_big_art/models_denoise/checkpoint5.pt. I created such directory and copied the checkpoint from out/model/checkpoint5.pt and it can still run and give seemingly good results. How is this supposedly existing ./out/models_denoise/checkpoint5.pt different from checkpoints in out/model/, i.e., can I use checkpoints from out/model to run interactive.sh and will it give different results? And what are the ./out/modelscheckpointema.pt for? How are they different from normal checkpoints in the same directory?

Thank you so much for all the amazing work and gracious help!

Question about evaluation?

Can I do the evaluations without any training? If so, where can I find the script g.sh? Or do I have to train it with the pre-trained model first and then it will generate a g.sh script? Or do I have to create a g.sh script myself by calling some of the scripts? Thank you for the gracious help!

Questions about training data

I found two fields in the training data, source_label and target_label, each item consists of 0 and 1, and has a length equal to src_tokens and target.
Can you please explain the meaning of these two fields? How are they used during training?

How to make training data.

Thank you for your excellent job of the GEC task. I am troubled with preprocessing of training data.
Please tell me how to preprocess training data. Specifically, I want to know how to make training data.

when single node with multi gpus , it does not run .

hi ,
question descirption :
states = torch.load(args.pretrained_model)['model'] --> error line
error message:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
please give me a hand .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.