kakaobrain / helo-word Goto Github PK

View Code? Open in Web Editor NEW

89.0 9.0 22.0 4.2 MB

Team Kakao&Brain's Grammatical Error Correction System for the ACL 2019 BEA Shared Task

License: Apache License 2.0

Python 97.70% Makefile 0.05% Batchfile 0.07% Shell 1.47% C++ 0.32% Lua 0.39%

nlp deep-learning grammatical-error-correction pre-training transfer-learning transformer fairseq

helo-word's People

Contributors

Stargazers

Watchers

helo-word's Issues

ImportError: cannot import name 'LMScorer'

When I run the preprocess.py, some errors occur.
Have anyone encountered with this problem?

ImportError: cannot import name 'LMScorer'

Hi everyone,
When I ran preprocess.py, it gave me this error message. I have installed lm_scorer package before I ran preprocess.py. But I found the calling procedure is different from ./gec/spell.py just like this link: https://github.com/simonepri/lm-scorer. I wonder how to use LM_Scorer in this project and what's the version of lm_scorer or how to install lm_scorer. Can anyone help me? ^_^

when I use "from lm_scorer import LMScorer" in ./gec/spell.py, the error message is as follows.
File "./gec/spell.py", line 8, in
from lm_scorer import LMScorer
ImportError: cannot import name 'LMScorer'
When I use "from lm_scorer.models.auto import AutoLMScorer as LMScorer" from https://github.com/simonepri/lm-scorer, the error message is as follows.
File ".gec/spell.py", line36, in load_lm
return LMScorer(args)
TypeError: init() takes 1 positional argument but 2 were given

Exception: Could not infer language pair, please provide it explicitly

Thank you so much for taking the time to reupload your code for BEA 19 Shared Task! Unfortunately, while trying to run the command python train.py --track 1 --train-mode pretrain --model base --ngpu 1, I ran into the following error:

Traceback (most recent call last):
  File "/home/pohjie/anaconda3/envs/helo/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 407, in cli_main
    main(args)
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 39, in main
    task = tasks.setup_task(args)
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq/tasks/__init__.py", line 19, in setup_task
    return TASK_REGISTRY[args.task].setup_task(args)
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq/tasks/translation.py", line 109, in setup_task
    raise Exception('Could not infer language pair, please provide it explicitly')
Exception: Could not infer language pair, please provide it explicitly

I have tried my best to debug this to no avail. May I know what is a possible solution to this? Thank you.

Domain specific corpus

Hi
Could explain a way to incorporate domain specific corpus to train the model? My work involves identifying n-grams prevalent in medical texts, such as "sudden infant death syndrome" which appears only across handful instances in the corpus files. Are there any scripts we can tweak to include files and how?
Or otherwise, can the current model perform across domains?

Ggh

Hhhhh

is there any script about ensembling ?

In paper you ensemble 5 models with different arch , I wonder how to achieve this. does fairseq support this ? it will be nice if there are some scripts

Trained models

Is it possible to upload the models and perhaps have a documentation on how to correct using the trained models (whether supplied or not)

TypeError: init() got an unexpected keyword argument 'bucket_cap_mb'

Hi, I am running the pre-train section of the code by executing the command 'python train.py --track 1 --train-mode pretrain --model base --ngpu 2'

| model transformer, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 60522496 (num. trained: 60522496)
| training on 2 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None

| WARNING: 1 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[5747045]
| no existing checkpoint found /home/pohjie/beast19/helo_word/track1/ckpt/pretrain-base-lr0.0005-dr0.3/checkpoint_last.pt

However, after the following messages above, I encounter this error:

Traceback (most recent call last):
  File "/home/pohjie/anaconda3/envs/helo/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 392, in cli_main
    distributed_main(args.device_id, args)
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 380, in distributed_main
    main(args, init_distributed=True)
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 94, in main
    trainer.dummy_train_step([dummy_batch])
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq/trainer.py", line 350, in dummy_train_step
    self.train_step(dummy_batch, dummy_batch=True)
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq/trainer.py", line 163, in train_step
    self.model.train()
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq/trainer.py", line 81, in model
    self.args, self._model,
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq/models/distributed_fairseq_model.py", line 67, in DistributedFairseqModel
    return _DistributedFairseqModel(**init_kwargs)
  File "/home/pohjie/beast19/helo_word/fairseq/fairseq/models/distributed_fairseq_model.py", line 59, in __init__
    super().__init__(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'bucket_cap_mb'

May I know how should I go about resolving this? Thank you!

bad result when i replicate this model

when I replicate this model by the instruction on track1, I just got the f score about 28 on valid set, less than 10 on test set.
Have anyone replicate this model? Does it work? Or someone gives me some instructions?

About spell correction

After read your paper and code, I have a question about the spell correction phase.
Had you applied spell correction on pre-train dataset? Or you only applied this on fine-tune dataset?

Request for dev and test set outputs

Hi guys,

I fear that I cannot reproduce your impressive results as mentioned in your paper. May I know if your team can share the dev and test set outputs please?

Thank you.

generator raised StopIteration when preprocessing when preprocessing

Hi, thank you for releasing your code!
I ran the preprocessing code preprocess.py and meet a runtime error.

INFO:root:skip this step as /workspace/helo_word/data/conll2014 is NOT empty
INFO:root:STEP 0-8. Download language model
INFO:root:skip this step as /workspace/helo_word/data/language_model/data-bin is NOT empty
INFO:root:STEP 1. Word-tokenize the original files and merge them
INFO:root:STEP 1-1. gutenberg
INFO:root:skip this step as /workspace/helo_word/data/gutenberg/gutenberg.txt already exists
INFO:root:STEP 1-2. tatoeba
INFO:root:skip this step as /workspace/helo_word/data/tatoeba/tatoeba.txt already exists
INFO:root:STEP 1-3. wiki103
INFO:root:skip this step as /workspace/helo_word/data/wiki103/wiki103.txt already exists
INFO:root:STEP 2. Train bpe model
INFO:root:skip this step as /workspace/helo_word/data/bpe-model/gutenberg.model already exists
INFO:root:STEP 3. Split wi.dev into wi.dev.3k and wi.dev.1k
INFO:root:skip this step as /workspace/helo_word/data/bea19/wi+locness/m2/ABCN.dev.gold.bea19.3k.m2 already exists
INFO:root:STEP 4. Perturb and make parallel files
INFO:root:Track 1
INFO:root:STEP 4-1. writing perturbation scenario
INFO:root:STEP 4-2. gutenberg
# multiprocessing settings
# prepare inputs
# work
  0%|                                                                                        | 0/1 [00:00<?, ?it/s]
--- SKIP ---
  0%|                                                                                        | 0/1 [00:08<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/pattern3/text/__init__.py", line 412, in _read
    raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/workspace/helo_word/gec/perturb.py", line 160, in make_parallel
    perturbation = apply_perturbation(words, word2ptbs, word_change_prob, type_change_prob)
  File "/workspace/helo_word/gec/perturb.py", line 121, in apply_perturbation
    w = change_type(w, t, type_change_prob)
  File "/workspace/helo_word/gec/perturb.py", line 34, in change_type
    word = conjugate(word, verb_type)
  File "/opt/conda/lib/python3.7/site-packages/pattern3/text/__init__.py", line 2123, in conjugate
    b = self.lemma(verb, parse=kwargs.get("parse", True))
  File "/opt/conda/lib/python3.7/site-packages/pattern3/text/__init__.py", line 2088, in lemma
    self.load()
  File "/opt/conda/lib/python3.7/site-packages/pattern3/text/__init__.py", line 2042, in load
    for v in _read(self._path):
RuntimeError: generator raised StopIteration
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess.py", line 169, in <module>
    args.word_change_prob, args.type_change_prob))
  File "preprocess.py", line 15, in maybe_do
    func(*inputs)
  File "/workspace/helo_word/gec/perturb.py", line 183, in do
    p.map(make_parallel, inputs_li)
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
RuntimeError: generator raised StopIteration

I tried to skip processing gutenberg corpus, but the same error raised when processing the next corpus.
How can I fix it?

The training of the copy_augmented Transformer model is very slow!

I have tried to train the model with copy option on 4GPU (12GB each) but the training was very slow. I have tried tuning some hyperparameters ending up with almost the same issue. Am I missing something: here are my settings:

python train.py --track 1 --train-mode pretrain --model copy --ngpu 4

        model_config = f"--ddp-backend=no_c10d --arch copy_augmented_transformer " \
            f"--update-freq 4 --alpha-warmup 10000 --optimizer adam --lr {lr} " \
            f"--dropout {dropout} --max-tokens 3000 --min-lr '1e-09' --save-interval-updates 5000 " \
            f"--lr-scheduler inverse_sqrt --weight-decay 0.0001 --max-epoch {max_epoch} " \
            f"--warmup-updates 4000 --warmup-init-lr '1e-07' --adam-betas '(0.9, 0.98)' "

Need help for replicating Low Resource Track.

I followed the instructions for the low-resource track. After the DAE step, I get 0.33 score with evaluate.py on the best model.
But on training with the 3k dev set and evaluating, the score becomes very low. I tried increasing the number of epochs, lowering learning rates. But I am unable to even retain the F score from the DAE pre-training.
Need help to replicate the train step.

No such file or directory: '/home/helo_word/data/parallel/raw/ABCN.test.bea19.orig'

Hi @hammouse, I am running preprocess.py. When it reaches step 6-6, I get the mentioned error above.

INFO:root:STEP 6-6. wi test
Namespace(cpu=False, data='/home/pohjie/beast19/helo_word/data/language_model/data-bin', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, fpath=None, future_target=False, gen_subset='test', lazy_load=False, log_format=None, log_interval=1000, max_sentences=None, max_tokens=None, memory_efficient_fp16=False, min_loss_scale=0.0001, model_overrides='{}', no_progress_bar=True, num_shards=1, num_workers=4, output_dictionary_size=-1, output_sent=False, past_target=False, path='/home/pohjie/beast19/helo_word/data/language_model/wiki103.pt', quiet=True, raw_text=False, remove_bpe=None, sample_break_mode=None, seed=1, self_target=False, shard_id=0, skip_invalid_size_inputs_valid_test=False, task='language_modeling', tensorboard_logdir='', threshold_loss_scale=None, tokens_per_sample=1024, user_dir=None)
| dictionary: 267744 types
| loading model(s) from /home/pohjie/beast19/helo_word/data/language_model/wiki103.pt
| dictionary: 267744 types
Traceback (most recent call last):
  File "preprocess.py", line 237, in <module>
    maybe_do(fp.WI_TEST_SP_ORI, spell.check, (fp.WI_TEST_ORI, fp.WI_TEST_SP_ORI))
  File "preprocess.py", line 15, in maybe_do
    func(*inputs)
  File "/home/pohjie/beast19/helo_word/gec/spell.py", line 168, in check
    spellcheck(model, fin, fout, speller=speller)
  File "/home/pohjie/beast19/helo_word/gec/spell.py", line 103, in spellcheck
    lines = open(fin, 'r', encoding='utf-8').read().splitlines()
FileNotFoundError: [Errno 2] No such file or directory: '/home/pohjie/beast19/helo_word/data/parallel/raw/ABCN.test.bea19.orig'

I've tried running the script a few times to ensure that the previous steps have been run correctly- may I know what am I doing wrongly? I've checked the original wi+locness folder and it does not have the mentioned file too.

Thanks!

kakaobrain / helo-word Goto Github PK

helo-word's People

Contributors

Stargazers

Watchers

Forkers

helo-word's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs