haoyuhu / bert-multi-gpu Goto Github PK
View Code? Open in Web Editor NEWFeel free to fine tune large BERT models with Multi-GPU and FP16 support.
License: Apache License 2.0
Feel free to fine tune large BERT models with Multi-GPU and FP16 support.
License: Apache License 2.0
这个是显卡的哪个参数呀 num_gpu_cores
I am using this repo on my Bert project and I hope to freeze some layers of Bert. May I know how to do it?
Thanks!
First, Thanks for your amazing repository.
i have some problem.
In train all gpus are used, but in prediction only gpu:0 is used.
i saw someone else has raised this issue before.
thanks
W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key bert/embeddings/LayerNorm/beta/AdamWeightDecayOptimizer not found in checkpoint
It just happened when I was modifying run_squad.py to multi-GPU. I am very upset what it was like. can you help me fix the error. many thanks.
can I use it for multi gpu prediction? there are only train_distribute and eval_distribute parameters in tf.estimator.RunConfig
What GPUs are you using? What's the maximum batch_size for 11/12/16/32 GB GPUs (e.g.: GTX 1080 Ti, Titan X, V100) and the corresponding performance? Thanks.
Hi. Very nice project! Is there a code equivalent to the original "run_pretraining.py" for pre-training BERT from scratch using new texts and your bert-multi-gpu code?
Thanks.
Thanks for your excellent work on multi gpu finetune of bert. Will you release multi gpu version of pretrain?
Situation 1 (multi-gpu): train_batch_size = 8, num_gpu_cores = 4, num_train_epochs = 1
global_batch_size = train_batch_size * num_gpu_cores = 32
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000
Situation 2 (single-gpu): train_batch_size = 8, num_gpu_cores = 1, num_train_epochs = 1
global_batch_size = train_batch_size * num_gpu_cores = 8
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000
Let me explain why I put forth these 2 situations.
Even though I increase the number of GPU cores when I try to fine-tune the model, I don't see any decrease in time?
It seems that the stability might have increased because of multi-gpu but the purpose of reducing total time taken is not achieved at all.
Is it because the iteration_steps does not change?
Or am I missing out on something?
tf.logging.info("Use normal RunConfig")
dist_strategy = tf.contrib.distribute.MirroredStrategy(
num_gpus=FLAGS.num_gpu_cores,
cross_device_ops=AllReduceCrossDeviceOps('nccl', num_packs=FLAGS.num_gpu_cores),
)
log_every_n_steps = 8
run_config = RunConfig(
train_distribute=dist_strategy,
eval_distribute=dist_strategy,
log_step_count_steps=log_every_n_steps,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps)
HI:
I tried to run the run_custom_classifier with fp16=true and with the following error:
I1126 13:48:27.714882 545560 saver.py:1284] Restoring parameters from E:\BERT\workspace\outputmodel\model.ckpt-0
2019-11-26 13:48:28.693963: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key bad_steps not found in checkpoint
I commented the fp16 part in optimizer then it can run successfully. I initialized from BERT base model from Google official version. My guess is that the init checkpoint does not have the bad_steps which is in used in fp16 mixed precision optimizer.
How to I fix this?
Thanks!
Hey,
Thanks alot for the multi-gpu-support.
Since limited cash I've to work with 8GB GPU's for study.
Right now I wounder, if upgrading to tensorflow 2.0 would have any benefit.
Best regards
Andreas
我看代码中 evaluate的时候,换成了单GPU的estimator,请问能否使用多GPU进行评估呢?如果我要在多GPU使用train_and_evaluate,可以实现吗?
您好,我在使用的时候报下面的错误:
No OpKernel was registered to support Op 'NcclAllReduce' used by node NcclAllReduce()with these attrs:[shared_name=c0,T=DT_FLOAT,num_devices=4,reduction="sum]
您知道是为什么吗,我猜想是不是NCCL没有安装的问题,但是同事用Mxnet是可以多卡训练的,然后也不知道怎么检测服务器上的NCCL的安装情况,所以不知道您是否知道上面的报错是什么原因?
望解答,多谢!
I got this error while training using multi-GPU. I am getting good accuracy ~90% with BERT of single GPU, and now I am train the model using multi-GPU with extra data but getting this error:
I0927 12:31:52.653539 140638168917760 run_custom_classifier.py:1006] Num examples = 2400
I0927 12:31:52.654157 140638168917760 run_custom_classifier.py:1007] Batch size = 32
I0927 12:31:52.654301 140638168917760 run_custom_classifier.py:1008] Num steps = 75
W0927 12:31:52.654463 140638168917760 deprecation_wrapper.py:119] From run_custom_classifier.py:580: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.
I0927 12:31:52.661087 140638168917760 estimator.py:360] Skipping training since max_steps has already saved.
I0927 12:31:52.668826 140638168917760 run_custom_classifier.py:552] Writing example 0 of 2
Writing example 0 of 2
Traceback (most recent call last):
File "run_custom_classifier.py", line 1139, in <module>
tf.app.run()
File "/home/serving/.virtualenvs/bert_env/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/serving/.virtualenvs/bert_env/lib/python3.6/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/serving/.virtualenvs/bert_env/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_custom_classifier.py", line 1078, in main
predict_file)
File "run_custom_classifier.py", line 556, in file_based_convert_examples_to_features
max_seq_length, tokenizer)
File "run_custom_classifier.py", line 458, in convert_single_example
tokens_a = tokenizer.tokenize(example.text_a)
AttributeError: 'list' object has no attribute 'text_a'
This my training sample:
item_id word_phrase value
445095147 Labrada Nutrition Charge Supershot - 2.5 fl oz - (Pack of 6) Energy Drinks
Is it because my columns names are different? do I have to change it somewhere?
Help is appreciated.
I have 100000 observations(num_train_examples ), train_batch_size = 4, num_gpu_cores = 8,
I want to train with num_train_epochs = 1,
Could you please suggest do I need to go for
situation -1 iteration_steps = 100000/32 or
situation -2 iteration_steps = 100000/4
If I go with situation-2 what is the role of global_batch_size ?
Thanks a lot for providing the multi-gpu-support for bert. I have some problems when running your bert-multi-gpu code "run_custom_classifier.py ".
line 929. log_every_n_steps = 8
When i use 2 gpus the log prints every 16 step. However when i use 4 gpus, the log still prints every 16 steps. Should it be 8*4 steps?
In addition, the running time for 16 steps is almost the same no matter using 2 or 4 gpus (4 gpus evern cost more time). So what does the log 16 steps meam? Does it mean each gpu will run a different 16 steps?
line 953. num_train_steps = int(len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
If I use 4 gpus, should num_train_steps = len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs / 4 ?
Hi,
Thanks a lot for providing the multi-gpu-support for bert.
I am currently working on a bert project which usually takes 30 mins for prediction when using a single gpu. I have 8 gpus and I want to split the data into 8 parts and give one to each gpu. So I want to know if I would be able to run the predictions simultaneously in the 8 GPUS using your bert repo?
Best,
Aswin
When I ran the script run_custom_classifier.py, the error occured:
ImportError: No module named 'tensorflow.python.distribute.cross_device_ops'
python: 3.5.2
tensorflow-gpu: 1.11.0
ubuntu 16.04
How can I handle this?
Thanks
Hi, Haoyu !
I got this in when I run run_cust_classifier.py.
Is that normal?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.