i have training ASR tasks by 4 GPU sync mode and async mode, but the training was so s

that's tensorflow timeline profilling <a href="https://share.weiyun.com/5jGchyp" rel="

Sync mode and async mode all have been trying: <div class="snippet-clipboard-conte

Training time about lingvo HOT 23 OPEN

dingevin commented on July 24, 2024

Training time

from lingvo.

Comments (23)

jonathanasdf commented on July 24, 2024 1

The env looks fine. Is this in sync mode or async mode? I checked our run it was in async mode with 8 machines x 2 gpus and got 0.9 steps/s. Since you are using 4 gpus you should expect ~0.2 steps/s and it seems you are getting slightly lower than that.

from lingvo.

jonathanasdf commented on July 24, 2024 1

When you update lingvo version you almost always will need to update tf-nightly version as well.

from lingvo.

dingevin commented on July 24, 2024

this is the utilization rate of the GPU，sometime it was zero, but i have recorded time of data pipeline and found it very fast, about 0.2 s each step when i change data pipeline to mine.

from lingvo.

drpngx commented on July 24, 2024

Right, it seems that the GPU utilization is pretty low. Can you try some more tensorflow profiling to see where the time is spent?

from lingvo.

dingevin commented on July 24, 2024

that's tensorflow timeline profilling timeline_3.json

from lingvo.

drpngx commented on July 24, 2024

OK, I'm not an expert in reading traces. I don't really see a smoking gun. Can you follow the instructions to profile, especially trying "advise" and the profiler-ui to see if you find anything useful?

from lingvo.

dingevin commented on July 24, 2024

I once suspected the version of tensorflow and try many version of tf-nightly-gpu, but have no effect; now my env is : tf_nightly_gpu-1.13.0.dev20181210 , CUDA-9.0 , CUDNN: 7.3.0 . is right?

then, I have try tensorboard to watch compute time, however it did't display costed time, like

next i have refer to your advise to profile, watch the timeline and can't find any evidence; timeline_20.json

from lingvo.

dingevin commented on July 24, 2024

it looks like spend much time before Forward and backward, like op 'RandomStandardNormal' , but i don't konw how to repair it.

from lingvo.

dingevin commented on July 24, 2024

Sync mode and async mode all have been trying:

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false

bazel-bin/lingvo/trainer --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false --job=controller,trainer

Am I running the command correctly at local ?
my lingvo code version is Oct 2018 ，and it was different form now version. hence i recompile it， but error occured:

[dingzhenyou@gpu46 lingvo]$ bazel build -c opt //lingvo:trainer
INFO: Build options have changed, discarding analysis cache.
INFO: Analysed target //lingvo:trainer (37 packages loaded).
INFO: Found 1 target...
ERROR: /data/dingzhenyou/lingvo/lingvo/core/ops/BUILD:314:1: C++ compilation of rule '//lingvo/core/ops:hyps_proto' failed (Exit 1)
In file included from bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.cc:4:0:
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:10:40: fatal error: google/protobuf/port_def.inc: No such file or directory
 #include <google/protobuf/port_def.inc>
                                        ^
compilation terminated.
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 6.402s, Critical Path: 5.42s
INFO: 9 processes: 9 processwrapper-sandbox.
FAILED: Build did NOT complete successfully

that's my env:

== cat /etc/issue ===============================================
Linux gpu46 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== bazel =====================================================
Build label: 0.17.2
Build time: Fri Sep 21 10:31:42 2018 (1537525902)
Build timestamp: 1537525902
Build timestamp as int: 1537525902

== uname -a =====================================================
Linux gpu46 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.14.5)
numpydoc (0.6.0)
protobuf (3.6.1)
tensorflow-estimator (1.13.0)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.13.0-dev20181210
tf.GIT_VERSION = v1.12.0-rc0-4213-g54b110ae43
tf.COMPILER_VERSION = v1.12.0-rc0-4213-g54b110ae43
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /data/dingzhenyou/train_data_loader/deps/boost/lib/:/data/dingzhenyou/train_data_loader/deps/opencv/lib/:/data/dingzhenyou/cuda/lib64:/usr/local/cuda-9.0/lib:/usr/local/cuda-9.0/lib64:
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Wed Apr 17 07:15:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77                 Driver Version: 390.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |                  N/A |
| 26%   46C    P0    63W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:03:00.0 Off |                  N/A |
| 26%   47C    P0    61W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:82:00.0 Off |                  N/A |
| 24%   43C    P0    64W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   42C    P0    59W / 250W |      0MiB / 12196MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/lib64/libcudart_static.a
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-9.0/doc/man/man7/libcudart.7
/usr/local/cuda-9.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.0/lib64/libcudart_static.a
/usr/local/cuda-9.0/lib64/libcudart.so.9.0.176

from lingvo.

dingevin commented on July 24, 2024

i have updated tf-nightly, and the training speed still very slow. that's my env :
tf_env.txt
and here is the training log, each step still spent 7~8s，
nohup.txt

that training on async mode:

nohup bazel-bin/lingvo/trainer --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false --job=controller,trainer &

from lingvo.

jonathanasdf commented on July 24, 2024

Actually I think that might be the right speed. We get ~1.1s per step on 16 P100s which would be ~5s per step on 4 P100s and P100 is supposedly maybe 30% faster than Titan Xp

from lingvo.

dingevin commented on July 24, 2024

Thanks, I will try more advanced machines.

from lingvo.

datavizweb commented on July 24, 2024

@dingevin did it help to run on P100s/v100. I am getting around ~6-7 seconds per step (or Steps/second: 0.114680) on v100s (single host with 8 GPUs) in async mode. I was expecting it to be faster than P100s. From @jonathanasdf number, for 8 GPU it should be around ~3 seconds. Do you know any other reason for more time per step. I am using latest nightly build and lingvo and using default recipe without changes (expect for learning rate).

from lingvo.

jonathanasdf commented on July 24, 2024

Hm, that's not good news...

Unfortunately it's very hard to debug performance issues remotely. Please try out some tensorflow profiling options to see if the GPUs are being utilized efficiently and whether the input pipeline is keeping up.

from lingvo.

dingevin commented on July 24, 2024

@datavizweb i have training lingvo on Tesla V100 lately, but the training speeds not be faster.
@jonathanasdf from my experiments, the GPUs Utilization was very high, but each step still slow. that's very strange. this is the log

and i have find the cpu load is very high, the load average numbers always > 5;

the Run queue is also high:

$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
     48  Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
$ cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
2
$ cat /proc/cpuinfo| grep "cpu cores"| uniq
cpu cores       : 12
$ cat /proc/cpuinfo| grep "processor"| wc -l
48

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 161664 5572960 231776 232380256    0    0     2     4    0    0 18 12 70  0  0
 9  0 161664 5570836 231776 232390528    0    0 10240     0 38752 396294 25  2 73  0  0
62  0 161664 5572952 231776 232390176    0    0     0     0 35005 435967 13  2 85  0  0
 6  0 161664 5551236 231776 232405296    0    0 15104     8 107598 484743 19 11 69  0  0
 5  0 161664 5551484 231776 232405296    0    0     0     0 30623 508293 16  2 82  0  0
 2  0 161664 5551732 231776 232405296    0    0     0     0 31678 177303  7  1 92  0  0
10  0 161664 5551776 231776 232405296    0    0     0     0 43848 262920  9  2 89  0  0
 6  0 161664 5551904 231776 232405296    0    0     0     0 57625 445091 14  2 84  0  0
12  0 161664 5552776 231776 232405296    0    0     0    16 59159 443863 14  2 84  0  0
13  0 161664 5552280 231776 232405296    0    0     0     0 48835 525581 15  2 83  0  0
 5  0 161664 5534872 231776 232425952    0    0 20608     0 113457 338928 17 10 73  0  0
 4  0 161664 5534932 231776 232425952    0    0     0     0 37296 666458 18  3 79  0  0
 1  0 161664 5535120 231776 232425952    0    0     0     0 36611 494778 13  2 84  0  0
 8  0 161664 5541208 231776 232486576    0    0 61696     0 41747 370774 24  3 73  0  0
 9  0 161664 5540580 231776 232487808    0    0     0    16 36153 728851 19  3 78  0  0
 1  0 161664 5535660 231776 232488320    0    0     0     0 27509 145425  8  1 91  0  0
 3  0 161664 5537648 231776 232487968    0    0     0     0 35210 176295  7  1 92  0  0
 3  0 161664 5538180 231776 232487968    0    0     0     0 52778 319345 10  2 87  0  0
 4  0 161664 5538068 231776 232487968    0    0     0     0 55697 442565 13  2 85  0  0
 9  0 161664 5538332 231776 232487968    0    0     0     4 56004 445039 13  2 85  0  0
10  0 161664 5538956 231776 232487968    0    0     0     0 49435 502612 15  2 83  0  0
10  0 161664 5538716 231776 232488720    0    0   896     0 52583 245029 14  8 78  0  0
 8  0 161664 5531764 231776 232488864    0    0     0     0 35941 582441 17  3 80  0  0
12  0 161664 5532012 231776 232488864    0    0     0     0 25962 328696 12  2 86  0  0
64  0 161664 5532880 231776 232488864    0    0     0     0 32843 206023  8  1 91  0  0
10  0 161664 5511304 231776 232488864    0    0     0    39 61909 448895 14  2 84  0  0
 7  0 161664 5511056 231776 232488864    0    0     0     0 60732 441861 14  2 84  0  0

from lingvo.

jonathanasdf commented on July 24, 2024

It seems like there is a dip in the GPU graph every 6-8 seconds, which also corresponds to your step time. Could it be maybe the CPU/disk cannot keep up with the input processing?

from lingvo.

dingevin commented on July 24, 2024

I have test the disk IO, but didn't see any bottleneck, that' s the test info librispeech.log, and i have placed the training data in shared server to test network IO, i can't seen anything. this is the network IO:

from lingvo.

jonathanasdf commented on July 24, 2024

Sorry, I tried checking with a few people on our side and we didn't have any other hypothesis for why it's slower for you :(

from lingvo.

datavizweb commented on July 24, 2024

@dingevin we can generate random features (Batches in memory) and give it to model from input_generator. If it still takes 6-7 seconds per step, I/O is not an issue. We then need to profile and figure out exact component that is taking more time. As my next debugging step, I am planning to do that end of next week. If you have time, can you check it?

from lingvo.

patemotter commented on July 24, 2024

@jonathanasdf

Related to this issue and how the speeds are measured and compared in this thread: are the metrics printed during the run cumulative or per GPU or both? It seems that the steps/sec is fairly constant, but the number of examples per second grows linearly based on the GPUs. And clarification on this would be appreciated.

GPUs   Steps/sec    Examples/sec
1    0.128    6.484
2    0.111    11.513
4    0.103    21.002
8    0.105    42.436

from lingvo.

jonathanasdf commented on July 24, 2024

I believe that depends on whether you are running in sync mode (the default) or async mode.

In sync mode all GPUs need to complete their computation before the step is updated, so the step rate is constant. In async mode each GPU updates parameters independently so the step rate should scale with the number of GPUs.

from lingvo.

zh794390558 commented on July 24, 2024

watch this.

from lingvo.

adis98 commented on July 24, 2024

Could you guys give some tips on how to profile with GPipe? I've created a model thats GPipe-compatible (uses 4 gpus at the moment). Any particular tools?

from lingvo.

Training time about lingvo HOT 23 OPEN

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs