Comments (11)
Can you post full repro and logs from your run? Do you use default dataset?
from deeplearningexamples.
@jbaczek Yes, sure! I have checked the code and found that the problem occurs in line 392 of train.py:
sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case). I found that len(predictions) = 2799 but len(refs) = 1. This is why the error happened. Do you know how can I fix it? Thx!
Yes, I am using the default dataset, WMT2014 and the default pre-processing code.
Here is the log:
buted init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 3
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 1
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 0
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 2
| distributed init done!
| distributed init done!
| distributed init done!
| distributed init done!
| initialized host dc90e6f8bbcc as rank 0 and device id 0
Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-09, adaptive_softmax_cutoff=None, arch='transformer_wmt_en_de_big_t2t', attention_dropout=0.1, beam=4, bpe_codes=None, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data='/workspace/data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, device_id=0, distributed_backend='nccl', distributed_init_method='env://', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.1, enable_parallel_backward_allred_opt=False, enable_parallel_backward_allred_opt_correctness_check=False, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_laye MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 3
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 1
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 0
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 2
| distributed init done!
| distributed init done!
| distributed init done!
| distributed init done!
| initialized host dc90e6f8bbcc as rank 0 and device id 0
Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-09, adaptive_softmax_cutoff=None, arch='transformer_wmt_en_de_big_t2t', attention_dropout=0.1, beam=4, bpe_codes=None, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data='/workspace/data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, device_id=0, distributed_backend='nccl', distributed_init_method='env://', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.1, enable_parallel_backward_allred_opt=False, enable_parallel_backward_allred_opt_correctness_check=False, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=True, fp16=True, fuse_dropout_add=False, fuse_relu_dropout=False, gen_subset='test', ignore_case=True, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', lenpen=1, local_rank=0, log_format=None, log_interval=1000, lr=[0.0006], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_len_a=0, max_len_b=200, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=5120, max_update=0, min_len=1, min_loss_scale=0.0001, min_lr=0.0, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_shards=1, online_eval=False, optimizer='adam', pad_sequence=1, parallel_backward_allred_opt_threshold=0, path=None, prefix_size=0, print_alignment=False, profile=None, quiet=False, raw_text=False, relu_dropout=0.1, remove_bpe=None, replace_unk=None, restore_file='checkpoint_last.pt', sampling=False, sampling_temperature=1, sampling_topk=-1, save_dir='/workspace/checkpoints', save_interval=1, save_interval_updates=0, score_reference=False, seed=1, sentence_avg=False, sentencepiece=False, shard_id=0, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_bleu=28.3, target_lang=None, task='translation', train_subset='train', unkpen=0, unnormalized=False, update_freq=[1], valid_subset='valid', validate_interval=1, warmup_init_lr=0.0, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 33712 types
| [de] dictionary: 33712 types
| /workspace/data-bin/wmt14_en_de_joined_dict train 4575637 examples
| Sentences are being padded to multiples of: 1
| /workspace/data-bin/wmt14_en_de_joined_dict valid 3000 examples
| Sentences are being padded to multiples of: 1
| model transformer_wmt_en_de_big_t2t, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 210808832
| training on 4 GPUs
| max tokens per GPU = 5120 and max sentences per GPU = None
generated batches in 2.2408082485198975 s
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| epoch 001: 1000 / 7872 loss=10.728, nll_loss=10.037, ppl=1050.91, wps=112765, ups=6.1, wpb=18072, bsz=595, num_updates=998, lr=0.0001497, gnorm=88859.330, clip=100%, oom=0, loss_scale=16.000, wall=164
| epoch 001: 2000 / 7872 loss=9.220, nll_loss=8.283, ppl=311.46, wps=115974, ups=6.3, wpb=18106, bsz=588, num_updates=1998, lr=0.0002997, gnorm=71073.070, clip=100%, oom=0, loss_scale=16.000, wall=316
| epoch 001: 3000 / 7872 loss=8.221, nll_loss=7.122, ppl=139.25, wps=117589, ups=6.4, wpb=18097, bsz=586, num_updates=2998, lr=0.0004497, gnorm=75402.434, clip=100%, oom=0, loss_scale=32.000, wall=466
| WARNING: overflow detected, setting loss scale to: 16.0
| epoch 001: 4000 / 7872 loss=7.599, nll_loss=6.402, ppl=84.59, wps=118400, ups=6.5, wpb=18085, bsz=583, num_updates=3997, lr=0.00059955, gnorm=65879.740, clip=100%, oom=0, loss_scale=16.000, wall=615
| epoch 001: 5000 / 7872 loss=7.164, nll_loss=5.904, ppl=59.86, wps=118999, ups=6.5, wpb=18084, bsz=583, num_updates=4997, lr=0.000536817, gnorm=58532.792, clip=100%, oom=0, loss_scale=16.000, wall=764
| epoch 001: 6000 / 7872 loss=6.843, nll_loss=5.537, ppl=46.43, wps=119286, ups=6.6, wpb=18076, bsz=583, num_updates=5997, lr=0.00049002, gnorm=56855.371, clip=100%, oom=0, loss_scale=32.000, wall=913
| epoch 001: 7000 / 7872 loss=6.588, nll_loss=5.247, ppl=37.98, wps=119602, ups=6.6, wpb=18078, bsz=583, num_updates=6997, lr=0.000453655, gnorm=55107.830, clip=100%, oom=0, loss_scale=32.000, wall=1062
| WARNING: overflow detected, setting loss scale to: 32.0
Epoch time: 1187.9658298492432
| epoch 001 | loss 6.412 | nll_loss 5.048 | ppl 33.09 | wps 119756 | ups 6.6 | wpb 18076 | bsz 581 | num_updates 7867 | lr 0.000427835 | gnorm 57321.792 | clip 100% | oom 0 | loss_scale 32.000 | wall 1192
generated batches in 0.0007636547088623047 s
| epoch 001 | valid on 'valid' subset | valid_loss 4.55658 | valid_nll_loss 2.8718 | valid_ppl 7.32 | num_updates 7867
| /workspace/data-bin/wmt14_en_de_joined_dict test 3003 examples
| Sentences are being padded to multiples of: 1
generated batches in 0.0006475448608398438 s
Traceback (most recent call last):
File "/workspace/examples/transformer/train.py", line 525, in
distributed_main(args)
File "/workspace/examples/transformer/distributed_train.py", line 57, in main
single_process_main(args)
File "/workspace/examples/transformer/train.py", line 128, in main
current_bleu, current_sc_bleu = score(args, trainer, task, epoch_itr, args.gen_subset)
File "/workspace/examples/transformer/train.py", line 392, in score
sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case)
File "/opt/conda/lib/python3.6/site-packages/sacrebleu.py", line 1031, in corpus_bleu
raise EOFError("Source and reference streams have different lengths!")
EOFError: Source and reference streams have different lengths!
Here is what I run:
nohup python -m torch.distributed.launch --nproc_per_node 4 /workspace/examples/transformer/train.py /workspace/data-bin/wmt14_en_de_joined_dict
--arch transformer_wmt_en_de_big_t2t
--share-all-embeddings
--optimizer adam
--adam-betas '(0.9, 0.997)'
--adam-eps "1e-9"
--clip-norm 0.0
--lr-scheduler inverse_sqrt
--warmup-init-lr 0.0
--warmup-updates 4000
--lr 0.0006
--min-lr 0.0
--dropout 0.1
--weight-decay 0.0
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--max-tokens 5120
--seed 1
--target-bleu 28.3
--ignore-case
--fp16
--save-dir /workspace/checkpoints
--distributed-init-method env:// &
from deeplearningexamples.
I have printed predictions and refs out and found that predictions is a list (len 2997) each element being a sentence whereas refs[0] is a list with 3003 sentences. So, their shapes do not match.
from deeplearningexamples.
refs should be a list with one element. That is how sacrebleu handles arguments. I ran this code on DGX-1 16G and everything seems fine (I didn't use nohup though). What platform do you use? Have you tried to run training without nohup?
from deeplearningexamples.
I am using a DGX server with 8 V100 (use 4 of them), ubuntu 16.04 and cuda driver 384.111. I will think more about it but do you know why predictions has 2799 sentences and refs has 3003 sentences? Will that cause problems if the numbers do not match? On your machine do predictions and refs have the same amount of sentences?
from deeplearningexamples.
BTW, I just pulled the image nvcr.io/nvidia/pytorch:19.05-py3, built a container directly from the image and used the code in /workspace/examples/transformer. Is the code the same as the one on github https://github.com/NVIDIA/DeepLearningExamples.git ?
from deeplearningexamples.
It is known issue that on other than 8xV100 configurations this part of code can misbehave due to the memory limitations (this will be addressed in the next release). But this error is new to me, on my machine it doesn't appear. Try to run training on whole DGX.
Yes, the code inside the 19.05 container is the same as the one on github, but if you use code from the container you still have to install all dependencies.
from deeplearningexamples.
Thanks for your suggestions! I have got another 4 little questions. It will be very helpful if you could answer them a little bit:
- How to adjust the batch size? In /Transformer/fairseq/options.py, I didn't see how hyperparameters like batch size are set. I just see "group.add_argument('--max-tokens', type=int, metavar='N', help='maximum number of tokens in a batch')". If max_tokens/batch is 5120, then the real batch size in a common sense would be only like 300 sentences/batch or so?
- What do fp16 and fp32 mean?
- Since the readme file says that it can also achieve good performance with 4 GPUs, I am wondering how should I run the code with 4 GPUs. In addition to setting --nproc_per_node 4 in the command line, do you know how people, who provided the results in the readme file on 4 GPUs, set their command line to run the train.py? It is mentioned in the readme that "when training in FP32 mode on 4 GPUs, use the --update-freq=4 and --warmup-updates 16000 options", but how should we do that in fp16 mode?
- Do different GPUs update trainable variables Synchronously or Asynchronously? I mean, at each training step, those 8 GPUs (if we use 8 rather than 4) will take in different data, calculate 8 sets of gradients respectively, those GPUs will wait for each other utill all 8 gradient sets are calculated, and finally the optimizer will take the average over those 8 sets of gradients for backprop?
from deeplearningexamples.
- In NLP models you usually defines batch size in term of tokens, not whole sentences. Sentences can have different lengths, every token has its representation as a high dimensional vector in embedding space, thus amount of memory required batch will differ. Batching algorithm for transformer sorts dataset by sentence length and bathes sentences with similar length together to minimize number of padding tokens per batch. This means that some batches can have 500 sentences while others can have only 100.
--max-tokens
is the option to set batch size. - fp16 and fp32 are arithmetics types. When you don't specify
--fp16
option then computation is performed in regular 32 bit floating point format. When set, option--fp16
performs mixed precision training, meaning that nearly all computation is performed in half precision and only numerically vulnerable operators are computed in regular precision. For more info see Nvidia guidelines linked in the readme. - The best result was achieved with fp16, batch size 5120, 8 GPUs and linear warmup of 4000. If you want to train with lower number of GPUs and get similar results use
--update-freq
option with value of reciprocal ot the scaling factor. Training in fp32 mode takes nearly twice as much memory, so you need to divide batch size by 2 and use--update-freq 2
to simulate the same batch size. Also you need to scale number of warmup updates by the same amount. For example if you want to train with 4 GPUs fp16 use--update-freq 2
--warmup-updates 8000
. Also if you encounter problem with evaluation you can disable online evaluation and test model after a training withgenerate.py
script. - The only synchronization point is when gradients are gathered. After then each worker updates model separately averaging them
from deeplearningexamples.
Thanks a lot! Now I understand what is fp32 and fp16. For the batch size I think it is 5120 tokens per GPU per time step, so the more GPUs the larger the batch size. When using fp 32, 8 GPUs, we need to set "--update-freq 2" because fp32 takes double memories. However, when using fp16, 4 GPUs, the number of steps per epoch doubles, but do we still need to set "--update-freq 2" whereas each GPU still takes up to 5120 tokens per time step and I think the GPUs may not need to divide their batch?
from deeplearningexamples.
If you use 4 GPUs global batch size is 4x5120, which means that is half the size of the original one. --update-freq 2
virtualy doubles it
from deeplearningexamples.
Related Issues (20)
- [Model/Framework] What is the problem?
- [Model/Framework] What is the problem?
- NVIDIA
- [Model/Framework] What is the problem?
- [Model/Framework or something else] Feature requested
- [Model/Framework or something else] Feature requested
- [FastPitch] Why do you hierarchically predict the variance features (pitch and energy)? HOT 2
- [BERT/PyTorch] How can we use
- [BERT/PyTorch] How can we use create_datasets_from_start.sh for BERT pretraining HOT 1
- Seeking Help with Tacotron 2 Training for Telugu Language
- [Model/Framework or something else] Feature requested
- [ResNet-50/pytorch] FP32 and AMP Mode taking same time to complete 90 Epochs HOT 2
- [Model/Framework] in the model_zoo.py the torch.hub api use wrong
- Inconsistent librosa versions PyTorch/SpeechSynthesis/All and CUDA-Optimized/FastSpeech
- Support for Ada Lovelace Architecture
- [nnUNet] pytorch_lightning.utilities.exceptions.MisconfigurationException when training
- [nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid."
- How to train ResNet50 for ImageNet1k HOT 1
- [BERT/TF2] Global batch size not matching with the description
- [DLRM/PyTorch] repository name (library/image-machine-DGX-A100) must be lowercase
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deeplearningexamples.