haoyuhu / bert-multi-gpu Goto Github PK

View Code? Open in Web Editor NEW

192.0 192.0 45.0 85 KB

Feel free to fine tune large BERT models with Multi-GPU and FP16 support.

License: Apache License 2.0

Python 98.64% Shell 1.36%

bert-multi-gpu's People

Contributors

Stargazers

Watchers

bert-multi-gpu's Issues

How to freeze bert layers?

I am using this repo on my Bert project and I hope to freeze some layers of Bert. May I know how to do it?
Thanks!

In prediction, only one gpu is available

First, Thanks for your amazing repository.

i have some problem.

In train all gpus are used, but in prediction only gpu:0 is used.

i saw someone else has raised this issue before.

thanks

##Not found: Key bert/embeddings/LayerNorm/beta/AdamWeightDecayOptimizer not found in checkpoint

W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key bert/embeddings/LayerNorm/beta/AdamWeightDecayOptimizer not found in checkpoint

It just happened when I was modifying run_squad.py to multi-GPU. I am very upset what it was like. can you help me fix the error. many thanks.

Is fp16 supported under multi-gpu ？I see that mixed_precision related code is commented in custom_optimization.py.

can I use it for multi gpu prediction?

can I use it for multi gpu prediction? there are only train_distribute and eval_distribute parameters in tf.estimator.RunConfig

Any benchmark results?

What GPUs are you using? What's the maximum batch_size for 11/12/16/32 GB GPUs (e.g.: GTX 1080 Ti, Titan X, V100) and the corresponding performance? Thanks.

run_pretraining.py script for multi-gpu

Hi. Very nice project! Is there a code equivalent to the original "run_pretraining.py" for pre-training BERT from scratch using new texts and your bert-multi-gpu code?
Thanks.

Pretrain with multi gpus

Thanks for your excellent work on multi gpu finetune of bert. Will you release multi gpu version of pretrain?

Do we see different results with different global_batch_size but same iteration_steps? How does Global Batch Size play a role? Will time taken to complete training change?

Situation 1 (multi-gpu): train_batch_size = 8, num_gpu_cores = 4, num_train_epochs = 1
global_batch_size = train_batch_size * num_gpu_cores = 32
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000

Situation 2 (single-gpu): train_batch_size = 8, num_gpu_cores = 1, num_train_epochs = 1
global_batch_size = train_batch_size * num_gpu_cores = 8
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000

Let me explain why I put forth these 2 situations.
Even though I increase the number of GPU cores when I try to fine-tune the model, I don't see any decrease in time?
It seems that the stability might have increased because of multi-gpu but the purpose of reducing total time taken is not achieved at all.
Is it because the iteration_steps does not change?
Or am I missing out on something?

The main difference between original bert and bert-muti-gpu are these lines below?

    tf.logging.info("Use normal RunConfig")
    dist_strategy = tf.contrib.distribute.MirroredStrategy(
        num_gpus=FLAGS.num_gpu_cores,
        cross_device_ops=AllReduceCrossDeviceOps('nccl', num_packs=FLAGS.num_gpu_cores),
    )
    log_every_n_steps = 8
    run_config = RunConfig(
        train_distribute=dist_strategy,
        eval_distribute=dist_strategy,
        log_step_count_steps=log_every_n_steps,
        model_dir=FLAGS.output_dir,
        save_checkpoints_steps=FLAGS.save_checkpoints_steps)

Reported error when using FP16

HI:

I tried to run the run_custom_classifier with fp16=true and with the following error:

I1126 13:48:27.714882 545560 saver.py:1284] Restoring parameters from E:\BERT\workspace\outputmodel\model.ckpt-0
2019-11-26 13:48:28.693963: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key bad_steps not found in checkpoint

I commented the fp16 part in optimizer then it can run successfully. I initialized from BERT base model from Google official version. My guess is that the init checkpoint does not have the bad_steps which is in used in fp16 mixed precision optimizer.
How to I fix this?

Thanks!

Question: Does it make sense to integrate tensorflow2?

Hey,
Thanks alot for the multi-gpu-support.
Since limited cash I've to work with 8GB GPU's for study.

Right now I wounder, if upgrading to tensorflow 2.0 would have any benefit.

Best regards
Andreas

能否使用多GPU进行evaluate？

我看代码中 evaluate的时候，换成了单GPU的estimator，请问能否使用多GPU进行评估呢？如果我要在多GPU使用train_and_evaluate，可以实现吗？

mlabel task 结果定义

hi，

我基于您的代码写了一个two label task，每个label是个multi-class问题，结果我看到在保存结果的output文件夹下的csv文件有两列，不太明白这两列的含义。在单label的task种，是生成n列，每列是每个class的概率。这里怎么理解呢？谢谢

支持多gpu预训练么

No OpKernel was registered to support Op 'NcclAllReduce' used by node NcclAllReduce

您好，我在使用的时候报下面的错误:
No OpKernel was registered to support Op 'NcclAllReduce' used by node NcclAllReduce（）with these attrs:[shared_name=c0,T=DT_FLOAT,num_devices=4,reduction="sum]
您知道是为什么吗，我猜想是不是NCCL没有安装的问题，但是同事用Mxnet是可以多卡训练的，然后也不知道怎么检测服务器上的NCCL的安装情况，所以不知道您是否知道上面的报错是什么原因？

望解答，多谢！

AttributeError: 'list' object has no attribute 'text_a'

I got this error while training using multi-GPU. I am getting good accuracy ~90% with BERT of single GPU, and now I am train the model using multi-GPU with extra data but getting this error:



I0927 12:31:52.653539 140638168917760 run_custom_classifier.py:1006]   Num examples = 2400
I0927 12:31:52.654157 140638168917760 run_custom_classifier.py:1007]   Batch size = 32
I0927 12:31:52.654301 140638168917760 run_custom_classifier.py:1008]   Num steps = 75
W0927 12:31:52.654463 140638168917760 deprecation_wrapper.py:119] From run_custom_classifier.py:580: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

I0927 12:31:52.661087 140638168917760 estimator.py:360] Skipping training since max_steps has already saved.
I0927 12:31:52.668826 140638168917760 run_custom_classifier.py:552] Writing example 0 of 2
Writing example 0 of 2
Traceback (most recent call last):
  File "run_custom_classifier.py", line 1139, in <module>
    tf.app.run()
  File "/home/serving/.virtualenvs/bert_env/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/serving/.virtualenvs/bert_env/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/serving/.virtualenvs/bert_env/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_custom_classifier.py", line 1078, in main
    predict_file)
  File "run_custom_classifier.py", line 556, in file_based_convert_examples_to_features
    max_seq_length, tokenizer)
  File "run_custom_classifier.py", line 458, in convert_single_example
    tokens_a = tokenizer.tokenize(example.text_a)
AttributeError: 'list' object has no attribute 'text_a'

This my training sample:

item_id            word_phrase                                                          value
445095147       Labrada Nutrition Charge Supershot - 2.5 fl oz - (Pack of 6)    Energy Drinks

Is it because my columns names are different? do I have to change it somewhere?

Help is appreciated.

Query regarding calculation of iteration_steps and role of global_batch_size.

I have 100000 observations(num_train_examples ), train_batch_size = 4, num_gpu_cores = 8,
I want to train with num_train_epochs = 1,
Could you please suggest do I need to go for
situation -1 iteration_steps = 100000/32 or
situation -2 iteration_steps = 100000/4

If I go with situation-2 what is the role of global_batch_size ?

some problems about multi gpu training

Thanks a lot for providing the multi-gpu-support for bert. I have some problems when running your bert-multi-gpu code "run_custom_classifier.py ".

line 929. log_every_n_steps = 8
When i use 2 gpus the log prints every 16 step. However when i use 4 gpus, the log still prints every 16 steps. Should it be 8*4 steps?
In addition, the running time for 16 steps is almost the same no matter using 2 or 4 gpus (4 gpus evern cost more time). So what does the log 16 steps meam? Does it mean each gpu will run a different 16 steps?
line 953. num_train_steps = int(len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
If I use 4 gpus, should num_train_steps = len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs / 4 ?

Is this a mixed precision version?

Prediction using Multi-GPU bert

Hi,
Thanks a lot for providing the multi-gpu-support for bert.
I am currently working on a bert project which usually takes 30 mins for prediction when using a single gpu. I have 8 gpus and I want to split the data into 8 parts and give one to each gpu. So I want to know if I would be able to run the predictions simultaneously in the 8 GPUS using your bert repo?

Best,
Aswin