Hi, When I use the train_mbart_model.sh to get furthe

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed,about prajdabre/yanmtt

Comments (21)

prajdabre commented on August 21, 2024

It's likely that you have a corrupted model file. This can happen if training was terminated while the checkpoint was being saved.

Solutions:

On the terminal, try using torch.load on the model checkpoint to see if it loads. If it doesn't that's the issue.
Inspect your previous pretraining logs to see whether everything was saved properly or not. Check the model file sizes and see if it looks ok or not.
Try to use the model with the pure_model suffix. Hopefully it's not corrupt.

from yanmtt.

raullese commented on August 21, 2024

@prajdabre oh Thanks for your patience, and In fact, I have guessed before that it is the problem that the model is not fully saved. And I saw that If saved models successfully, in the pretrain_nmt.py script, will save two models, one is MyModel, which have state_dict、optimizer、scheduler and ctr; and the scale is very big as 6.9GB, after further pre trained with mBart-50;
And the other model is MyModel.pure_model, which only have state_dict, similar as open source pre trained model's form; this model's scale is 2.3GB;

After I confirmed, the problem is that, If I run the train_mbart_model.sh in multiple GPUs, the big model(6.9GB) is not saved completely , but the small one is save completely and I can reload with pure_model;

And as I understand, the big model(6.9GB) is actually no need to save? Can I save and reload only pure_model every 1000 steps? and as for the optimizer and scheduler, I can also use them from previous step? Maybe this way can resolve my error in multiple GPUs; and finally, I only save one(big model) in last step. Can I make the above changes? Or have other suggestions?

Actually I don't understand the reason for saving this large model

from yanmtt.

prajdabre commented on August 21, 2024

I'm not sure of the exact problem but I typically pass the flag: --save_intermediate_checkpoints

This will save a separate checkpoint every 10k iterations and this 10k can be set to any value using another flag --long_save_every . I then use an appropriate checkpoint.

I have never used the last checkpoint to be honest. I should look whether there's a bug in the last checkpoint saving or not. That being said using the .pure_model for fine tuning is not a problem. I designed the training so that the big checkpoint with the optimizer and scheduler and counter can be used to resume training a failed run. The pure_model checkpoint is the one that should be used for fine tuning on a downstream task where optimizer params are not needed or for sharing with someone or for uploading to huggingface.

Hope this makes sense.

from yanmtt.

raullese commented on August 21, 2024

@prajdabre Thank you for your reply, After many adjustments, I finally decide to Remove some intermediately model_load code parts as follows
checkpoint_dict = torch.load(CHECKPOINT_PATH, map_location=map_location) model.load_state_dict(checkpoint_dict['model']) optimizer.load_state_dict(checkpoint_dict['optimizer']) scheduler.load_state_dict(checkpoint_dict['scheduler']) del checkpoint_dict
and only keep the part of torch.save(); Hope this is ok

And another point that I haven't understand is that where is the stopping mechanism of this pre-training stage adjusted? I don't seem to find a parameter that can control the number of steps in training

from yanmtt.

prajdabre commented on August 21, 2024

Your modification is ok. There's no real reason to load a saved model. I just wanted to make sure that everything is hard synchronized.

The argument you are looking for is: --num_batches

from yanmtt.

raullese commented on August 21, 2024

Hi

Your modification is ok. There's no real reason to load a saved model. I just wanted to make sure that everything is hard synchronized.

The argument you are looking for is: --num_batches

Thanks a lot

from yanmtt.

raullese commented on August 21, 2024

@prajdabre by the way, I just want to ask that when I pre train based on more than one language, such as 8 langs,
how about the settings? Are there any other parameters to be aware of? for example, I see that
parser.add_argument('--num_domains_for_domain_classifier', type=int, default=1, help='If we have multiple domains then we should set this to a value higher than one.')

Is it necessary to adjust this parameter num_domains_for_domain_classifier to be the same as the number of languages?

because my script errored out again, when I switched to multilingual pre-training, from one lang to 8 langs;
and this time's error is :

****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr(). warnings.warn("To get the last learning rate computed by the scheduler, " ****miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()beforeoptimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()beforelr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
Traceback (most recent call last):
File "pretrain_nmt_new.py", line 970, in
run_demo()
File "pretrain_nmt_new.py", line 967, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "/mt_mbart/yanmtt/pretrain_nmt_new.py", line 523, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "***/mt_mbart/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss
smooth_loss.masked_fill_(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (27) must match the existing size (26) at non-singleton dimension 1. Target sizes: [15, 27, 1]. Tensor sizes: [15, 26, 1]`

and my script setting is : nohup python pretrain_nmt_new.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en,es,vi,id,th,pt,zh,ko --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &

from yanmtt.

prajdabre commented on August 21, 2024

No you don't need the domain classifier flags.

The reason for the failure is that the language ids you use should be corresponding to what is used in mbart.

en should be EN_XX

Look at the official mbart model repo and find the ids for other languages.

from yanmtt.

raullese commented on August 21, 2024

@prajdabre very thanks for your Quickly reply

this is the official format of langs in mBart-large-50
`Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)

` from https://huggingface.co/facebook/mbart-large-50

and what I don't understand is When I pre-trained a single language, I did just use "en", not "EN_XX", and it run successful, and in the project's examples I saw in example/train_mbart_model.sh are all format similar to "--langs hi,en,vi",

from yanmtt.

prajdabre commented on August 21, 2024

That's because coincidentally the token en existed in the Mbart tokenizer. I'm betting that the token zh isn't present in the tokenizer and is split as "z# h". Whenever a language token is split into multiple parts my code crashes (intentionally).

from yanmtt.

raullese commented on August 21, 2024

That's because coincidentally the token en existed in the Mbart tokenizer. I'm betting that the token zh isn't present in the tokenizer and is split as "z# h". Whenever a language token is split into multiple parts my code crashes (intentionally).

ok, thanks, I will try it, and by the way, besides the --langs, as for the format of --mono_src, It is also necessary to strictly observe the format like zh_CN ? for example, transfer train.zh to train. zh_CN

from yanmtt.

prajdabre commented on August 21, 2024

No the training files can have any suffix.

Only during tokenizer training, the training files should have proper suffixes which act as language indicator tokens which you plan to use for model training and decoding.

from yanmtt.

raullese commented on August 21, 2024

No the training files can have any suffix.

Only during tokenizer training, the training files should have proper suffixes which act as language indicator tokens which you plan to use for model training and decoding.

Thank you very much. After some setting process, it has been successful;

And by the way, is that mean, if I continue pre train with new language, for example, I continue pre train based on mBart-50, but there is not Traditional Chinese in mBart50, only have Simplified Chinese, and if I add traditional Chinese, I have to pre train from scratch? @prajdabre

from yanmtt.

prajdabre commented on August 21, 2024

To my understanding, mbart-50 does not officially support traditional Chinese. Firstly you will have to check if the mbart-50 tokenizer can handle all traditional characters or not.

If it does then you may directly train.
If not you have 2 strategies:

Convert from traditional to simplified using some mapping table.
Pretrain from scratch.

from yanmtt.

raullese commented on August 21, 2024

To my understanding, mbart-50 does not officially support traditional Chinese. Firstly you will have to check if the mbart-50 tokenizer can handle all traditional characters or not.

If it does then you may directly train. If not you have 2 strategies:

Convert from traditional to simplified using some mapping table.

Pretrain from scratch.

Thanks a lot @prajdabre

from yanmtt.

raullese commented on August 21, 2024

@prajdabre oh, I have another confusion，I'm running continue pre training task train_mbart_model.sh in 2-GPUs based on 8languages Training for: ['en_XX', 'es_XX', 'vi_VN', 'id_ID', 'th_TH', 'pt_XX', 'zh_CN', 'ko_KR'], and the setting as follow:
nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &
and the strange thing is that after 380k batches, the run_train.log shows only one language's training(ko_KR), and it seems that another language not appear in the run_train.log, That is to say, they have not participated in the training yet,
Finished epoch 10 for language: ko_KR 379500 6.06 42.11 seconds for 100 batches. Memory used post forward / backward passes: 11.9 / 13.32 GB. 379600 5.85 42.44 seconds for 100 batches. Memory used post forward / backward passes: 11.92 / 13.33 GB. 379700 5.74 42.34 seconds for 100 batches. Memory used post forward / backward passes: 11.98 / 13.36 GB. 379800 5.44 38.15 seconds for 100 batches. Memory used post forward / backward passes: 10.72 / 12.69 GB. 379900 6.02 42.54 seconds for 100 batches. Memory used post forward / backward passes: 12.08 / 13.37 GB.

do you know the reason？

from yanmtt.

raullese commented on August 21, 2024

@prajdabre oh, I have another confusion，I'm running continue pre training task train_mbart_model.sh in 2-GPUs based on 8languages Training for: ['en_XX', 'es_XX', 'vi_VN', 'id_ID', 'th_TH', 'pt_XX', 'zh_CN', 'ko_KR'], and the setting as follow: nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 & and the strange thing is that after 380k batches, the run_train.log shows only one language's training(ko_KR), and it seems that another language not appear in the run_train.log, That is to say, they have not participated in the training yet, Finished epoch 10 for language: ko_KR 379500 6.06 42.11 seconds for 100 batches. Memory used post forward / backward passes: 11.9 / 13.32 GB. 379600 5.85 42.44 seconds for 100 batches. Memory used post forward / backward passes: 11.92 / 13.33 GB. 379700 5.74 42.34 seconds for 100 batches. Memory used post forward / backward passes: 11.98 / 13.36 GB. 379800 5.44 38.15 seconds for 100 batches. Memory used post forward / backward passes: 10.72 / 12.69 GB. 379900 6.02 42.54 seconds for 100 batches. Memory used post forward / backward passes: 12.08 / 13.37 GB.

do you know the reason？

by the way, supplement information of data
lang|raws
en | 84747184
es | 7923542
vi | 21776227
pt | 3865782
th | 1343809
id | 14221194
zh_cn | 15662924
ko | 38642

and --num_batches = 2000000, bach_size = 512

from yanmtt.

prajdabre commented on August 21, 2024

Your supplementary information about corpora sizes answers it all.

Since Korean has the smallest data it will finish far more epochs before others finish one epoch. This is because there is a data sampling hyperparam called --data_sampling_temperature which is set to 5. This means that smaller data will be seen more often to keep training from focusing on the higher resource languages. I think you will see 1 epoch for Thai after 20 or so epochs for Korean.

from yanmtt.

raullese commented on August 21, 2024

Hi

Your supplementary information about corpora sizes answers it all.

Since Korean has the smallest data it will finish far more epochs before others finish one epoch. This is because there is a data sampling hyperparam called --data_sampling_temperature which is set to 5. This means that smaller data will be seen more often to keep training from focusing on the higher resource languages. I think you will see 1 epoch for Thai after 20 or so epochs for Korean.

oh thank you very much, I see I have neglected the parameter --data_sampling_temperature, and If that's the case, then I'll have to thinking about resetting --num_batches to another larger value,
Before, I would have thought that if there was no oversampling, the whole epoch number of all language is 7:
num of batch in per epoch is: Sun_raws(0.14billion) / batch_size(512) = 273437
so num of epoch is: num_batches(2000000) / num of batch in per epoch(273437)= 7

but now after 380k batches, haven't even finished one epoch for another 7 languages, so probably after 2000000 batches, some high resource language will not be fully trained, I don't know if I'm right in thinking this way @prajdabre

from yanmtt.

prajdabre commented on August 21, 2024

A correction: The batch size doesn't indicate lines but number of tokens. If your corpus contains paragraphs then each batch per GPU contains only 2 or 3 entries for a batch of 512 tokens. If it's sentences then probably 8 or 10 sentences. Note that you should set --hard_truncate_length to 512 and --max_length to 512 as well. Else your training will skip all the data in case of paragraphs.

Anyway, with just 2 GPUs, you are going to need several tens of millions of steps. I'm afraid that with the scale of data you want to work with you need more GPUs to get results quickly. I recommend filtering the dataset to a more manageable size like 14 million examples across all languages.

from yanmtt.

raullese commented on August 21, 2024

A correction: The batch size doesn't indicate lines but number of tokens. If your corpus contains paragraphs then each batch per GPU contains only 2 or 3 entries for a batch of 512 tokens. If it's sentences then probably 8 or 10 sentences. Note that you should set --hard_truncate_length to 512 and --max_length to 512 as well. Else your training will skip all the data in case of paragraphs.

Anyway, with just 2 GPUs, you are going to need several tens of millions of steps. I'm afraid that with the scale of data you want to work with you need more GPUs to get results quickly. I recommend filtering the dataset to a more manageable size like 14 million examples across all languages.

@prajdabre thank you very much, it seems that my previous setting doesn't seem reasonable, actually my training set in all language is all about sentence, not paragraphs, I will set --hard_truncate_length to 512 and --max_length to 512, and a higher value of --num_batches, and set reduce the size of the training set to less than half of the current size

from yanmtt.

RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed about yanmtt HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs