Comments (21)
Hi
It's likely that you have a corrupted model file. This can happen if training was terminated while the checkpoint was being saved.
Solutions:
- On the terminal, try using torch.load on the model checkpoint to see if it loads. If it doesn't that's the issue.
- Inspect your previous pretraining logs to see whether everything was saved properly or not. Check the model file sizes and see if it looks ok or not.
- Try to use the model with the pure_model suffix. Hopefully it's not corrupt.
from yanmtt.
@prajdabre oh Thanks for your patience, and In fact, I have guessed before that it is the problem that the model is not fully saved. And I saw that If saved models successfully, in the pretrain_nmt.py
script, will save two models, one is MyModel
, which have state_dict、optimizer、scheduler and ctr; and the scale is very big as 6.9GB, after further pre trained with mBart-50;
And the other model is MyModel.pure_model
, which only have state_dict, similar as open source pre trained model's form; this model's scale is 2.3GB;
After I confirmed, the problem is that, If I run the train_mbart_model.sh
in multiple GPUs, the big model(6.9GB) is not saved completely , but the small one is save completely and I can reload with pure_model;
And as I understand, the big model(6.9GB) is actually no need to save? Can I save and reload only pure_model every 1000 steps? and as for the optimizer and scheduler, I can also use them from previous step? Maybe this way can resolve my error in multiple GPUs; and finally, I only save one(big model) in last step. Can I make the above changes? Or have other suggestions?
Actually I don't understand the reason for saving this large model
from yanmtt.
Hi
I'm not sure of the exact problem but I typically pass the flag: --save_intermediate_checkpoints
This will save a separate checkpoint every 10k iterations and this 10k can be set to any value using another flag --long_save_every . I then use an appropriate checkpoint.
I have never used the last checkpoint to be honest. I should look whether there's a bug in the last checkpoint saving or not. That being said using the .pure_model for fine tuning is not a problem. I designed the training so that the big checkpoint with the optimizer and scheduler and counter can be used to resume training a failed run. The pure_model checkpoint is the one that should be used for fine tuning on a downstream task where optimizer params are not needed or for sharing with someone or for uploading to huggingface.
Hope this makes sense.
from yanmtt.
@prajdabre Thank you for your reply, After many adjustments, I finally decide to Remove some intermediately model_load code parts as follows
checkpoint_dict = torch.load(CHECKPOINT_PATH, map_location=map_location) model.load_state_dict(checkpoint_dict['model']) optimizer.load_state_dict(checkpoint_dict['optimizer']) scheduler.load_state_dict(checkpoint_dict['scheduler']) del checkpoint_dict
and only keep the part of torch.save(); Hope this is ok
And another point that I haven't understand is that where is the stopping mechanism of this pre-training stage adjusted? I don't seem to find a parameter that can control the number of steps in training
from yanmtt.
Hi
Your modification is ok. There's no real reason to load a saved model. I just wanted to make sure that everything is hard synchronized.
The argument you are looking for is: --num_batches
from yanmtt.
Hi
Your modification is ok. There's no real reason to load a saved model. I just wanted to make sure that everything is hard synchronized.
The argument you are looking for is: --num_batches
Thanks a lot
from yanmtt.
@prajdabre by the way, I just want to ask that when I pre train based on more than one language, such as 8 langs,
how about the settings? Are there any other parameters to be aware of? for example, I see that
parser.add_argument('--num_domains_for_domain_classifier', type=int, default=1, help='If we have multiple domains then we should set this to a value higher than one.')
Is it necessary to adjust this parameter num_domains_for_domain_classifier
to be the same as the number of languages?
because my script errored out again, when I switched to multilingual pre-training, from one lang to 8 langs;
and this time's error is :
****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use
get_last_lr(). warnings.warn("To get the last learning rate computed by the scheduler, " ****miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of
lr_scheduler.step()before
optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order:
optimizer.step()before
lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
Traceback (most recent call last):
File "pretrain_nmt_new.py", line 970, in
run_demo()
File "pretrain_nmt_new.py", line 967, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "/mt_mbart/yanmtt/pretrain_nmt_new.py", line 523, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "***/mt_mbart/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss
smooth_loss.masked_fill_(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (27) must match the existing size (26) at non-singleton dimension 1. Target sizes: [15, 27, 1]. Tensor sizes: [15, 26, 1]`
and my script setting is : nohup python pretrain_nmt_new.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en,es,vi,id,th,pt,zh,ko --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &
from yanmtt.
No you don't need the domain classifier flags.
The reason for the failure is that the language ids you use should be corresponding to what is used in mbart.
en should be EN_XX
Look at the official mbart model repo and find the ids for other languages.
from yanmtt.
@prajdabre very thanks for your Quickly reply
this is the official format of langs in mBart-large-50
`Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)
` from https://huggingface.co/facebook/mbart-large-50
and what I don't understand is When I pre-trained a single language, I did just use "en", not "EN_XX", and it run successful, and in the project's examples I saw in example/train_mbart_model.sh
are all format similar to "--langs hi,en,vi",
from yanmtt.
That's because coincidentally the token en existed in the Mbart tokenizer. I'm betting that the token zh isn't present in the tokenizer and is split as "z# h". Whenever a language token is split into multiple parts my code crashes (intentionally).
from yanmtt.
That's because coincidentally the token en existed in the Mbart tokenizer. I'm betting that the token zh isn't present in the tokenizer and is split as "z# h". Whenever a language token is split into multiple parts my code crashes (intentionally).
ok, thanks, I will try it, and by the way, besides the --langs
, as for the format of --mono_src
, It is also necessary to strictly observe the format like zh_CN ? for example, transfer train.zh to train. zh_CN
from yanmtt.
No the training files can have any suffix.
Only during tokenizer training, the training files should have proper suffixes which act as language indicator tokens which you plan to use for model training and decoding.
from yanmtt.
No the training files can have any suffix.
Only during tokenizer training, the training files should have proper suffixes which act as language indicator tokens which you plan to use for model training and decoding.
Thank you very much. After some setting process, it has been successful;
And by the way, is that mean, if I continue pre train with new language, for example, I continue pre train based on mBart-50, but there is not Traditional Chinese in mBart50, only have Simplified Chinese, and if I add traditional Chinese, I have to pre train from scratch? @prajdabre
from yanmtt.
To my understanding, mbart-50 does not officially support traditional Chinese. Firstly you will have to check if the mbart-50 tokenizer can handle all traditional characters or not.
If it does then you may directly train.
If not you have 2 strategies:
- Convert from traditional to simplified using some mapping table.
- Pretrain from scratch.
from yanmtt.
To my understanding, mbart-50 does not officially support traditional Chinese. Firstly you will have to check if the mbart-50 tokenizer can handle all traditional characters or not.
If it does then you may directly train. If not you have 2 strategies:
- Convert from traditional to simplified using some mapping table.
- Pretrain from scratch.
Thanks a lot @prajdabre
from yanmtt.
@prajdabre oh, I have another confusion,I'm running continue pre training task train_mbart_model.sh
in 2-GPUs based on 8languages Training for: ['en_XX', 'es_XX', 'vi_VN', 'id_ID', 'th_TH', 'pt_XX', 'zh_CN', 'ko_KR']
, and the setting as follow:
nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &
and the strange thing is that after 380k batches, the run_train.log shows only one language's training(ko_KR), and it seems that another language not appear in the run_train.log, That is to say, they have not participated in the training yet,
Finished epoch 10 for language: ko_KR 379500 6.06 42.11 seconds for 100 batches. Memory used post forward / backward passes: 11.9 / 13.32 GB. 379600 5.85 42.44 seconds for 100 batches. Memory used post forward / backward passes: 11.92 / 13.33 GB. 379700 5.74 42.34 seconds for 100 batches. Memory used post forward / backward passes: 11.98 / 13.36 GB. 379800 5.44 38.15 seconds for 100 batches. Memory used post forward / backward passes: 10.72 / 12.69 GB. 379900 6.02 42.54 seconds for 100 batches. Memory used post forward / backward passes: 12.08 / 13.37 GB.
do you know the reason?
from yanmtt.
@prajdabre oh, I have another confusion,I'm running continue pre training task
train_mbart_model.sh
in 2-GPUs based on 8languagesTraining for: ['en_XX', 'es_XX', 'vi_VN', 'id_ID', 'th_TH', 'pt_XX', 'zh_CN', 'ko_KR']
, and the setting as follow:nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &
and the strange thing is that after 380k batches, the run_train.log shows only one language's training(ko_KR), and it seems that another language not appear in the run_train.log, That is to say, they have not participated in the training yet,Finished epoch 10 for language: ko_KR 379500 6.06 42.11 seconds for 100 batches. Memory used post forward / backward passes: 11.9 / 13.32 GB. 379600 5.85 42.44 seconds for 100 batches. Memory used post forward / backward passes: 11.92 / 13.33 GB. 379700 5.74 42.34 seconds for 100 batches. Memory used post forward / backward passes: 11.98 / 13.36 GB. 379800 5.44 38.15 seconds for 100 batches. Memory used post forward / backward passes: 10.72 / 12.69 GB. 379900 6.02 42.54 seconds for 100 batches. Memory used post forward / backward passes: 12.08 / 13.37 GB.
do you know the reason?
by the way, supplement information of data
lang|raws
en | 84747184
es | 7923542
vi | 21776227
pt | 3865782
th | 1343809
id | 14221194
zh_cn | 15662924
ko | 38642
and --num_batches = 2000000, bach_size = 512
from yanmtt.
Hi
Your supplementary information about corpora sizes answers it all.
Since Korean has the smallest data it will finish far more epochs before others finish one epoch. This is because there is a data sampling hyperparam called --data_sampling_temperature which is set to 5. This means that smaller data will be seen more often to keep training from focusing on the higher resource languages. I think you will see 1 epoch for Thai after 20 or so epochs for Korean.
from yanmtt.
Hi
Your supplementary information about corpora sizes answers it all.
Since Korean has the smallest data it will finish far more epochs before others finish one epoch. This is because there is a data sampling hyperparam called --data_sampling_temperature which is set to 5. This means that smaller data will be seen more often to keep training from focusing on the higher resource languages. I think you will see 1 epoch for Thai after 20 or so epochs for Korean.
oh thank you very much, I see I have neglected the parameter --data_sampling_temperature, and If that's the case, then I'll have to thinking about resetting --num_batches
to another larger value,
Before, I would have thought that if there was no oversampling, the whole epoch number of all language is 7:
num of batch in per epoch is: Sun_raws(0.14billion) / batch_size(512) = 273437
so num of epoch is: num_batches(2000000) / num of batch in per epoch(273437)= 7
but now after 380k batches, haven't even finished one epoch for another 7 languages, so probably after 2000000 batches, some high resource language will not be fully trained, I don't know if I'm right in thinking this way @prajdabre
from yanmtt.
A correction: The batch size doesn't indicate lines but number of tokens. If your corpus contains paragraphs then each batch per GPU contains only 2 or 3 entries for a batch of 512 tokens. If it's sentences then probably 8 or 10 sentences. Note that you should set --hard_truncate_length to 512 and --max_length to 512 as well. Else your training will skip all the data in case of paragraphs.
Anyway, with just 2 GPUs, you are going to need several tens of millions of steps. I'm afraid that with the scale of data you want to work with you need more GPUs to get results quickly. I recommend filtering the dataset to a more manageable size like 14 million examples across all languages.
from yanmtt.
A correction: The batch size doesn't indicate lines but number of tokens. If your corpus contains paragraphs then each batch per GPU contains only 2 or 3 entries for a batch of 512 tokens. If it's sentences then probably 8 or 10 sentences. Note that you should set --hard_truncate_length to 512 and --max_length to 512 as well. Else your training will skip all the data in case of paragraphs.
Anyway, with just 2 GPUs, you are going to need several tens of millions of steps. I'm afraid that with the scale of data you want to work with you need more GPUs to get results quickly. I recommend filtering the dataset to a more manageable size like 14 million examples across all languages.
@prajdabre thank you very much, it seems that my previous setting doesn't seem reasonable, actually my training set in all language is all about sentence, not paragraphs, I will set --hard_truncate_length to 512 and --max_length to 512, and a higher value of --num_batches, and set reduce the size of the training set to less than half of the current size
from yanmtt.
Related Issues (20)
- Improve documentation
- Improve examples
- Binary executables for all python scripts
- CPU support
- Add support for latest version of transformers repo
- Display more information during training
- RuntimeError: The expanded size of the tensor (22) must match the existing size (21) at non-singleton dimension 1. Target sizes: [178, 22, 1] . Tensor sizes: [178, 21, 1]
- Add PEP8 style guide checker workflow
- Add post-norm to the model
- Mixtures of denoisers
- Support all optimizers and schedulers
- Error in BART Monolingual Pre-training. HOT 5
- Evaluation during training BARTforConditionalGeneration pre-training on English corpora HOT 1
- Alternative to installing sentencpiece HOT 11
- Extending IndicBART or IndicBERT HOT 7
- Pretrain Donut model HOT 1
- Problem with __future__ annotation HOT 1
- Could not find the version : tensorflow-gpu==2.3.0
- Disable shared sentencepiece libraries in installation instructions HOT 1
- Error: Invalid new-expression of abstract class type torchdistx::detail::{anonymous}::ProxyVariableHooks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yanmtt.