kakaobrain / helo-word Goto Github PK
View Code? Open in Web Editor NEWTeam Kakao&Brain's Grammatical Error Correction System for the ACL 2019 BEA Shared Task
License: Apache License 2.0
Team Kakao&Brain's Grammatical Error Correction System for the ACL 2019 BEA Shared Task
License: Apache License 2.0
When I run the preprocess.py, some errors occur.
Have anyone encountered with this problem?
Hi everyone,
When I ran preprocess.py, it gave me this error message. I have installed lm_scorer package before I ran preprocess.py. But I found the calling procedure is different from ./gec/spell.py just like this link: https://github.com/simonepri/lm-scorer. I wonder how to use LM_Scorer in this project and what's the version of lm_scorer or how to install lm_scorer. Can anyone help me? ^_^
Thank you so much for taking the time to reupload your code for BEA 19 Shared Task! Unfortunately, while trying to run the command python train.py --track 1 --train-mode pretrain --model base --ngpu 1
, I ran into the following error:
Traceback (most recent call last):
File "/home/pohjie/anaconda3/envs/helo/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 407, in cli_main
main(args)
File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 39, in main
task = tasks.setup_task(args)
File "/home/pohjie/beast19/helo_word/fairseq/fairseq/tasks/__init__.py", line 19, in setup_task
return TASK_REGISTRY[args.task].setup_task(args)
File "/home/pohjie/beast19/helo_word/fairseq/fairseq/tasks/translation.py", line 109, in setup_task
raise Exception('Could not infer language pair, please provide it explicitly')
Exception: Could not infer language pair, please provide it explicitly
I have tried my best to debug this to no avail. May I know what is a possible solution to this? Thank you.
Hi
Could explain a way to incorporate domain specific corpus to train the model? My work involves identifying n-grams prevalent in medical texts, such as "sudden infant death syndrome" which appears only across handful instances in the corpus files. Are there any scripts we can tweak to include files and how?
Or otherwise, can the current model perform across domains?
Hhhhh
In paper you ensemble 5 models with different arch , I wonder how to achieve this. does fairseq support this ? it will be nice if there are some scripts
Is it possible to upload the models and perhaps have a documentation on how to correct using the trained models (whether supplied or not)
Hi, I am running the pre-train section of the code by executing the command 'python train.py --track 1 --train-mode pretrain --model base --ngpu 2'
| model transformer, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 60522496 (num. trained: 60522496)
| training on 2 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| WARNING: 1 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[5747045]
| no existing checkpoint found /home/pohjie/beast19/helo_word/track1/ckpt/pretrain-base-lr0.0005-dr0.3/checkpoint_last.pt
However, after the following messages above, I encounter this error:
Traceback (most recent call last):
File "/home/pohjie/anaconda3/envs/helo/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 392, in cli_main
distributed_main(args.device_id, args)
File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 380, in distributed_main
main(args, init_distributed=True)
File "/home/pohjie/beast19/helo_word/fairseq/fairseq_cli/train.py", line 94, in main
trainer.dummy_train_step([dummy_batch])
File "/home/pohjie/beast19/helo_word/fairseq/fairseq/trainer.py", line 350, in dummy_train_step
self.train_step(dummy_batch, dummy_batch=True)
File "/home/pohjie/beast19/helo_word/fairseq/fairseq/trainer.py", line 163, in train_step
self.model.train()
File "/home/pohjie/beast19/helo_word/fairseq/fairseq/trainer.py", line 81, in model
self.args, self._model,
File "/home/pohjie/beast19/helo_word/fairseq/fairseq/models/distributed_fairseq_model.py", line 67, in DistributedFairseqModel
return _DistributedFairseqModel(**init_kwargs)
File "/home/pohjie/beast19/helo_word/fairseq/fairseq/models/distributed_fairseq_model.py", line 59, in __init__
super().__init__(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'bucket_cap_mb'
May I know how should I go about resolving this? Thank you!
when I replicate this model by the instruction on track1, I just got the f score about 28 on valid set, less than 10 on test set.
Have anyone replicate this model? Does it work? Or someone gives me some instructions?
After read your paper and code, I have a question about the spell correction phase.
Had you applied spell correction on pre-train dataset? Or you only applied this on fine-tune dataset?
Hi guys,
I fear that I cannot reproduce your impressive results as mentioned in your paper. May I know if your team can share the dev and test set outputs please?
Thank you.
Hi, thank you for releasing your code!
I ran the preprocessing code preprocess.py
and meet a runtime error.
INFO:root:skip this step as /workspace/helo_word/data/conll2014 is NOT empty
INFO:root:STEP 0-8. Download language model
INFO:root:skip this step as /workspace/helo_word/data/language_model/data-bin is NOT empty
INFO:root:STEP 1. Word-tokenize the original files and merge them
INFO:root:STEP 1-1. gutenberg
INFO:root:skip this step as /workspace/helo_word/data/gutenberg/gutenberg.txt already exists
INFO:root:STEP 1-2. tatoeba
INFO:root:skip this step as /workspace/helo_word/data/tatoeba/tatoeba.txt already exists
INFO:root:STEP 1-3. wiki103
INFO:root:skip this step as /workspace/helo_word/data/wiki103/wiki103.txt already exists
INFO:root:STEP 2. Train bpe model
INFO:root:skip this step as /workspace/helo_word/data/bpe-model/gutenberg.model already exists
INFO:root:STEP 3. Split wi.dev into wi.dev.3k and wi.dev.1k
INFO:root:skip this step as /workspace/helo_word/data/bea19/wi+locness/m2/ABCN.dev.gold.bea19.3k.m2 already exists
INFO:root:STEP 4. Perturb and make parallel files
INFO:root:Track 1
INFO:root:STEP 4-1. writing perturbation scenario
INFO:root:STEP 4-2. gutenberg
# multiprocessing settings
# prepare inputs
# work
0%| | 0/1 [00:00<?, ?it/s]
--- SKIP ---
0%| | 0/1 [00:08<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/pattern3/text/__init__.py", line 412, in _read
raise StopIteration
StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/workspace/helo_word/gec/perturb.py", line 160, in make_parallel
perturbation = apply_perturbation(words, word2ptbs, word_change_prob, type_change_prob)
File "/workspace/helo_word/gec/perturb.py", line 121, in apply_perturbation
w = change_type(w, t, type_change_prob)
File "/workspace/helo_word/gec/perturb.py", line 34, in change_type
word = conjugate(word, verb_type)
File "/opt/conda/lib/python3.7/site-packages/pattern3/text/__init__.py", line 2123, in conjugate
b = self.lemma(verb, parse=kwargs.get("parse", True))
File "/opt/conda/lib/python3.7/site-packages/pattern3/text/__init__.py", line 2088, in lemma
self.load()
File "/opt/conda/lib/python3.7/site-packages/pattern3/text/__init__.py", line 2042, in load
for v in _read(self._path):
RuntimeError: generator raised StopIteration
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "preprocess.py", line 169, in <module>
args.word_change_prob, args.type_change_prob))
File "preprocess.py", line 15, in maybe_do
func(*inputs)
File "/workspace/helo_word/gec/perturb.py", line 183, in do
p.map(make_parallel, inputs_li)
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
RuntimeError: generator raised StopIteration
I tried to skip processing gutenberg corpus, but the same error raised when processing the next corpus.
How can I fix it?
I have tried to train the model with copy option on 4GPU (12GB each) but the training was very slow. I have tried tuning some hyperparameters ending up with almost the same issue. Am I missing something: here are my settings:
python train.py --track 1 --train-mode pretrain --model copy --ngpu 4
model_config = f"--ddp-backend=no_c10d --arch copy_augmented_transformer " \
f"--update-freq 4 --alpha-warmup 10000 --optimizer adam --lr {lr} " \
f"--dropout {dropout} --max-tokens 3000 --min-lr '1e-09' --save-interval-updates 5000 " \
f"--lr-scheduler inverse_sqrt --weight-decay 0.0001 --max-epoch {max_epoch} " \
f"--warmup-updates 4000 --warmup-init-lr '1e-07' --adam-betas '(0.9, 0.98)' "
I followed the instructions for the low-resource track. After the DAE step, I get 0.33 score with evaluate.py on the best model.
But on training with the 3k dev set and evaluating, the score becomes very low. I tried increasing the number of epochs, lowering learning rates. But I am unable to even retain the F score from the DAE pre-training.
Need help to replicate the train step.
Hi @hammouse, I am running preprocess.py. When it reaches step 6-6, I get the mentioned error above.
INFO:root:STEP 6-6. wi test
Namespace(cpu=False, data='/home/pohjie/beast19/helo_word/data/language_model/data-bin', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, fpath=None, future_target=False, gen_subset='test', lazy_load=False, log_format=None, log_interval=1000, max_sentences=None, max_tokens=None, memory_efficient_fp16=False, min_loss_scale=0.0001, model_overrides='{}', no_progress_bar=True, num_shards=1, num_workers=4, output_dictionary_size=-1, output_sent=False, past_target=False, path='/home/pohjie/beast19/helo_word/data/language_model/wiki103.pt', quiet=True, raw_text=False, remove_bpe=None, sample_break_mode=None, seed=1, self_target=False, shard_id=0, skip_invalid_size_inputs_valid_test=False, task='language_modeling', tensorboard_logdir='', threshold_loss_scale=None, tokens_per_sample=1024, user_dir=None)
| dictionary: 267744 types
| loading model(s) from /home/pohjie/beast19/helo_word/data/language_model/wiki103.pt
| dictionary: 267744 types
Traceback (most recent call last):
File "preprocess.py", line 237, in <module>
maybe_do(fp.WI_TEST_SP_ORI, spell.check, (fp.WI_TEST_ORI, fp.WI_TEST_SP_ORI))
File "preprocess.py", line 15, in maybe_do
func(*inputs)
File "/home/pohjie/beast19/helo_word/gec/spell.py", line 168, in check
spellcheck(model, fin, fout, speller=speller)
File "/home/pohjie/beast19/helo_word/gec/spell.py", line 103, in spellcheck
lines = open(fin, 'r', encoding='utf-8').read().splitlines()
FileNotFoundError: [Errno 2] No such file or directory: '/home/pohjie/beast19/helo_word/data/parallel/raw/ABCN.test.bea19.orig'
I've tried running the script a few times to ensure that the previous steps have been run correctly- may I know what am I doing wrongly? I've checked the original wi+locness folder and it does not have the mentioned file too.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.