GithubHelp home page GithubHelp logo

Comments (11)

jbaczek avatar jbaczek commented on May 21, 2024

Can you post full repro and logs from your run? Do you use default dataset?

from deeplearningexamples.

yaoyiran avatar yaoyiran commented on May 21, 2024

@jbaczek Yes, sure! I have checked the code and found that the problem occurs in line 392 of train.py:
sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case). I found that len(predictions) = 2799 but len(refs) = 1. This is why the error happened. Do you know how can I fix it? Thx!

Yes, I am using the default dataset, WMT2014 and the default pre-processing code.

Here is the log:

buted init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 3
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 1
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 0
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 2
| distributed init done!
| distributed init done!
| distributed init done!
| distributed init done!
| initialized host dc90e6f8bbcc as rank 0 and device id 0
Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-09, adaptive_softmax_cutoff=None, arch='transformer_wmt_en_de_big_t2t', attention_dropout=0.1, beam=4, bpe_codes=None, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data='/workspace/data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, device_id=0, distributed_backend='nccl', distributed_init_method='env://', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.1, enable_parallel_backward_allred_opt=False, enable_parallel_backward_allred_opt_correctness_check=False, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_laye MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 3
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 1
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 0
| distributed init (rank 0): env://
| distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 2
| distributed init done!
| distributed init done!
| distributed init done!
| distributed init done!
| initialized host dc90e6f8bbcc as rank 0 and device id 0
Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-09, adaptive_softmax_cutoff=None, arch='transformer_wmt_en_de_big_t2t', attention_dropout=0.1, beam=4, bpe_codes=None, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data='/workspace/data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, device_id=0, distributed_backend='nccl', distributed_init_method='env://', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.1, enable_parallel_backward_allred_opt=False, enable_parallel_backward_allred_opt_correctness_check=False, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=True, fp16=True, fuse_dropout_add=False, fuse_relu_dropout=False, gen_subset='test', ignore_case=True, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', lenpen=1, local_rank=0, log_format=None, log_interval=1000, lr=[0.0006], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_len_a=0, max_len_b=200, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=5120, max_update=0, min_len=1, min_loss_scale=0.0001, min_lr=0.0, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_shards=1, online_eval=False, optimizer='adam', pad_sequence=1, parallel_backward_allred_opt_threshold=0, path=None, prefix_size=0, print_alignment=False, profile=None, quiet=False, raw_text=False, relu_dropout=0.1, remove_bpe=None, replace_unk=None, restore_file='checkpoint_last.pt', sampling=False, sampling_temperature=1, sampling_topk=-1, save_dir='/workspace/checkpoints', save_interval=1, save_interval_updates=0, score_reference=False, seed=1, sentence_avg=False, sentencepiece=False, shard_id=0, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_bleu=28.3, target_lang=None, task='translation', train_subset='train', unkpen=0, unnormalized=False, update_freq=[1], valid_subset='valid', validate_interval=1, warmup_init_lr=0.0, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 33712 types
| [de] dictionary: 33712 types
| /workspace/data-bin/wmt14_en_de_joined_dict train 4575637 examples
| Sentences are being padded to multiples of: 1
| /workspace/data-bin/wmt14_en_de_joined_dict valid 3000 examples
| Sentences are being padded to multiples of: 1
| model transformer_wmt_en_de_big_t2t, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 210808832
| training on 4 GPUs
| max tokens per GPU = 5120 and max sentences per GPU = None
generated batches in 2.2408082485198975 s
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| epoch 001: 1000 / 7872 loss=10.728, nll_loss=10.037, ppl=1050.91, wps=112765, ups=6.1, wpb=18072, bsz=595, num_updates=998, lr=0.0001497, gnorm=88859.330, clip=100%, oom=0, loss_scale=16.000, wall=164
| epoch 001: 2000 / 7872 loss=9.220, nll_loss=8.283, ppl=311.46, wps=115974, ups=6.3, wpb=18106, bsz=588, num_updates=1998, lr=0.0002997, gnorm=71073.070, clip=100%, oom=0, loss_scale=16.000, wall=316
| epoch 001: 3000 / 7872 loss=8.221, nll_loss=7.122, ppl=139.25, wps=117589, ups=6.4, wpb=18097, bsz=586, num_updates=2998, lr=0.0004497, gnorm=75402.434, clip=100%, oom=0, loss_scale=32.000, wall=466
| WARNING: overflow detected, setting loss scale to: 16.0
| epoch 001: 4000 / 7872 loss=7.599, nll_loss=6.402, ppl=84.59, wps=118400, ups=6.5, wpb=18085, bsz=583, num_updates=3997, lr=0.00059955, gnorm=65879.740, clip=100%, oom=0, loss_scale=16.000, wall=615
| epoch 001: 5000 / 7872 loss=7.164, nll_loss=5.904, ppl=59.86, wps=118999, ups=6.5, wpb=18084, bsz=583, num_updates=4997, lr=0.000536817, gnorm=58532.792, clip=100%, oom=0, loss_scale=16.000, wall=764
| epoch 001: 6000 / 7872 loss=6.843, nll_loss=5.537, ppl=46.43, wps=119286, ups=6.6, wpb=18076, bsz=583, num_updates=5997, lr=0.00049002, gnorm=56855.371, clip=100%, oom=0, loss_scale=32.000, wall=913
| epoch 001: 7000 / 7872 loss=6.588, nll_loss=5.247, ppl=37.98, wps=119602, ups=6.6, wpb=18078, bsz=583, num_updates=6997, lr=0.000453655, gnorm=55107.830, clip=100%, oom=0, loss_scale=32.000, wall=1062
| WARNING: overflow detected, setting loss scale to: 32.0
Epoch time: 1187.9658298492432
| epoch 001 | loss 6.412 | nll_loss 5.048 | ppl 33.09 | wps 119756 | ups 6.6 | wpb 18076 | bsz 581 | num_updates 7867 | lr 0.000427835 | gnorm 57321.792 | clip 100% | oom 0 | loss_scale 32.000 | wall 1192
generated batches in 0.0007636547088623047 s
| epoch 001 | valid on 'valid' subset | valid_loss 4.55658 | valid_nll_loss 2.8718 | valid_ppl 7.32 | num_updates 7867
| /workspace/data-bin/wmt14_en_de_joined_dict test 3003 examples
| Sentences are being padded to multiples of: 1
generated batches in 0.0006475448608398438 s
Traceback (most recent call last):
File "/workspace/examples/transformer/train.py", line 525, in
distributed_main(args)
File "/workspace/examples/transformer/distributed_train.py", line 57, in main
single_process_main(args)
File "/workspace/examples/transformer/train.py", line 128, in main
current_bleu, current_sc_bleu = score(args, trainer, task, epoch_itr, args.gen_subset)
File "/workspace/examples/transformer/train.py", line 392, in score
sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case)
File "/opt/conda/lib/python3.6/site-packages/sacrebleu.py", line 1031, in corpus_bleu
raise EOFError("Source and reference streams have different lengths!")
EOFError: Source and reference streams have different lengths!

Here is what I run:

nohup python -m torch.distributed.launch --nproc_per_node 4 /workspace/examples/transformer/train.py /workspace/data-bin/wmt14_en_de_joined_dict
--arch transformer_wmt_en_de_big_t2t
--share-all-embeddings
--optimizer adam
--adam-betas '(0.9, 0.997)'
--adam-eps "1e-9"
--clip-norm 0.0
--lr-scheduler inverse_sqrt
--warmup-init-lr 0.0
--warmup-updates 4000
--lr 0.0006
--min-lr 0.0
--dropout 0.1
--weight-decay 0.0
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--max-tokens 5120
--seed 1
--target-bleu 28.3
--ignore-case
--fp16
--save-dir /workspace/checkpoints
--distributed-init-method env:// &

from deeplearningexamples.

yaoyiran avatar yaoyiran commented on May 21, 2024

I have printed predictions and refs out and found that predictions is a list (len 2997) each element being a sentence whereas refs[0] is a list with 3003 sentences. So, their shapes do not match.

from deeplearningexamples.

jbaczek avatar jbaczek commented on May 21, 2024

refs should be a list with one element. That is how sacrebleu handles arguments. I ran this code on DGX-1 16G and everything seems fine (I didn't use nohup though). What platform do you use? Have you tried to run training without nohup?

from deeplearningexamples.

yaoyiran avatar yaoyiran commented on May 21, 2024

I am using a DGX server with 8 V100 (use 4 of them), ubuntu 16.04 and cuda driver 384.111. I will think more about it but do you know why predictions has 2799 sentences and refs has 3003 sentences? Will that cause problems if the numbers do not match? On your machine do predictions and refs have the same amount of sentences?

from deeplearningexamples.

yaoyiran avatar yaoyiran commented on May 21, 2024

BTW, I just pulled the image nvcr.io/nvidia/pytorch:19.05-py3, built a container directly from the image and used the code in /workspace/examples/transformer. Is the code the same as the one on github https://github.com/NVIDIA/DeepLearningExamples.git ?

from deeplearningexamples.

jbaczek avatar jbaczek commented on May 21, 2024

It is known issue that on other than 8xV100 configurations this part of code can misbehave due to the memory limitations (this will be addressed in the next release). But this error is new to me, on my machine it doesn't appear. Try to run training on whole DGX.
Yes, the code inside the 19.05 container is the same as the one on github, but if you use code from the container you still have to install all dependencies.

from deeplearningexamples.

yaoyiran avatar yaoyiran commented on May 21, 2024

Thanks for your suggestions! I have got another 4 little questions. It will be very helpful if you could answer them a little bit:

  1. How to adjust the batch size? In /Transformer/fairseq/options.py, I didn't see how hyperparameters like batch size are set. I just see "group.add_argument('--max-tokens', type=int, metavar='N', help='maximum number of tokens in a batch')". If max_tokens/batch is 5120, then the real batch size in a common sense would be only like 300 sentences/batch or so?
  2. What do fp16 and fp32 mean?
  3. Since the readme file says that it can also achieve good performance with 4 GPUs, I am wondering how should I run the code with 4 GPUs. In addition to setting --nproc_per_node 4 in the command line, do you know how people, who provided the results in the readme file on 4 GPUs, set their command line to run the train.py? It is mentioned in the readme that "when training in FP32 mode on 4 GPUs, use the --update-freq=4 and --warmup-updates 16000 options", but how should we do that in fp16 mode?
  4. Do different GPUs update trainable variables Synchronously or Asynchronously? I mean, at each training step, those 8 GPUs (if we use 8 rather than 4) will take in different data, calculate 8 sets of gradients respectively, those GPUs will wait for each other utill all 8 gradient sets are calculated, and finally the optimizer will take the average over those 8 sets of gradients for backprop?

from deeplearningexamples.

jbaczek avatar jbaczek commented on May 21, 2024
  1. In NLP models you usually defines batch size in term of tokens, not whole sentences. Sentences can have different lengths, every token has its representation as a high dimensional vector in embedding space, thus amount of memory required batch will differ. Batching algorithm for transformer sorts dataset by sentence length and bathes sentences with similar length together to minimize number of padding tokens per batch. This means that some batches can have 500 sentences while others can have only 100. --max-tokens is the option to set batch size.
  2. fp16 and fp32 are arithmetics types. When you don't specify --fp16 option then computation is performed in regular 32 bit floating point format. When set, option --fp16 performs mixed precision training, meaning that nearly all computation is performed in half precision and only numerically vulnerable operators are computed in regular precision. For more info see Nvidia guidelines linked in the readme.
  3. The best result was achieved with fp16, batch size 5120, 8 GPUs and linear warmup of 4000. If you want to train with lower number of GPUs and get similar results use --update-freq option with value of reciprocal ot the scaling factor. Training in fp32 mode takes nearly twice as much memory, so you need to divide batch size by 2 and use --update-freq 2 to simulate the same batch size. Also you need to scale number of warmup updates by the same amount. For example if you want to train with 4 GPUs fp16 use --update-freq 2 --warmup-updates 8000. Also if you encounter problem with evaluation you can disable online evaluation and test model after a training with generate.py script.
  4. The only synchronization point is when gradients are gathered. After then each worker updates model separately averaging them

from deeplearningexamples.

yaoyiran avatar yaoyiran commented on May 21, 2024

Thanks a lot! Now I understand what is fp32 and fp16. For the batch size I think it is 5120 tokens per GPU per time step, so the more GPUs the larger the batch size. When using fp 32, 8 GPUs, we need to set "--update-freq 2" because fp32 takes double memories. However, when using fp16, 4 GPUs, the number of steps per epoch doubles, but do we still need to set "--update-freq 2" whereas each GPU still takes up to 5120 tokens per time step and I think the GPUs may not need to divide their batch?

from deeplearningexamples.

jbaczek avatar jbaczek commented on May 21, 2024

If you use 4 GPUs global batch size is 4x5120, which means that is half the size of the original one. --update-freq 2 virtualy doubles it

from deeplearningexamples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.