Comments (23)
The env looks fine. Is this in sync mode or async mode? I checked our run it was in async mode with 8 machines x 2 gpus and got 0.9 steps/s. Since you are using 4 gpus you should expect ~0.2 steps/s and it seems you are getting slightly lower than that.
from lingvo.
When you update lingvo version you almost always will need to update tf-nightly version as well.
from lingvo.
this is the utilization rate of the GPU,sometime it was zero, but i have recorded time of data pipeline and found it very fast, about 0.2 s each step when i change data pipeline to mine.
from lingvo.
Right, it seems that the GPU utilization is pretty low. Can you try some more tensorflow profiling to see where the time is spent?
from lingvo.
that's tensorflow timeline profilling timeline_3.json
from lingvo.
OK, I'm not an expert in reading traces. I don't really see a smoking gun. Can you follow the instructions to profile, especially trying "advise" and the profiler-ui to see if you find anything useful?
from lingvo.
I once suspected the version of tensorflow and try many version of tf-nightly-gpu, but have no effect; now my env is : tf_nightly_gpu-1.13.0.dev20181210 , CUDA-9.0 , CUDNN: 7.3.0 . is right?
then, I have try tensorboard to watch compute time, however it did't display costed time, like
next i have refer to your advise to profile, watch the timeline and can't find any evidence; timeline_20.json
from lingvo.
it looks like spend much time before Forward and backward, like op 'RandomStandardNormal' , but i don't konw how to repair it.
from lingvo.
Sync mode and async mode all have been trying:
bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false
bazel-bin/lingvo/trainer --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false --job=controller,trainer
Am I running the command correctly at local ?
my lingvo code version is Oct 2018 ,and it was different form now version. hence i recompile it, but error occured:
[dingzhenyou@gpu46 lingvo]$ bazel build -c opt //lingvo:trainer
INFO: Build options have changed, discarding analysis cache.
INFO: Analysed target //lingvo:trainer (37 packages loaded).
INFO: Found 1 target...
ERROR: /data/dingzhenyou/lingvo/lingvo/core/ops/BUILD:314:1: C++ compilation of rule '//lingvo/core/ops:hyps_proto' failed (Exit 1)
In file included from bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.cc:4:0:
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:10:40: fatal error: google/protobuf/port_def.inc: No such file or directory
#include <google/protobuf/port_def.inc>
^
compilation terminated.
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 6.402s, Critical Path: 5.42s
INFO: 9 processes: 9 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
that's my env:
== cat /etc/issue ===============================================
Linux gpu46 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== bazel =====================================================
Build label: 0.17.2
Build time: Fri Sep 21 10:31:42 2018 (1537525902)
Build timestamp: 1537525902
Build timestamp as int: 1537525902
== uname -a =====================================================
Linux gpu46 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy (1.14.5)
numpydoc (0.6.0)
protobuf (3.6.1)
tensorflow-estimator (1.13.0)
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.13.0-dev20181210
tf.GIT_VERSION = v1.12.0-rc0-4213-g54b110ae43
tf.COMPILER_VERSION = v1.12.0-rc0-4213-g54b110ae43
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH /data/dingzhenyou/train_data_loader/deps/boost/lib/:/data/dingzhenyou/train_data_loader/deps/opencv/lib/:/data/dingzhenyou/cuda/lib64:/usr/local/cuda-9.0/lib:/usr/local/cuda-9.0/lib64:
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Wed Apr 17 07:15:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 26% 46C P0 63W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:03:00.0 Off | N/A |
| 26% 47C P0 61W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:82:00.0 Off | N/A |
| 24% 43C P0 64W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:83:00.0 Off | N/A |
| 23% 42C P0 59W / 250W | 0MiB / 12196MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/lib64/libcudart_static.a
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-9.0/doc/man/man7/libcudart.7
/usr/local/cuda-9.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.0/lib64/libcudart_static.a
/usr/local/cuda-9.0/lib64/libcudart.so.9.0.176
from lingvo.
i have updated tf-nightly, and the training speed still very slow. that's my env :
tf_env.txt
and here is the training log, each step still spent 7~8s,
nohup.txt
that training on async mode:
nohup bazel-bin/lingvo/trainer --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false --job=controller,trainer &
from lingvo.
Actually I think that might be the right speed. We get ~1.1s per step on 16 P100s which would be ~5s per step on 4 P100s and P100 is supposedly maybe 30% faster than Titan Xp
from lingvo.
Thanks, I will try more advanced machines.
from lingvo.
@dingevin did it help to run on P100s/v100. I am getting around ~6-7 seconds per step (or Steps/second: 0.114680) on v100s (single host with 8 GPUs) in async mode. I was expecting it to be faster than P100s. From @jonathanasdf number, for 8 GPU it should be around ~3 seconds. Do you know any other reason for more time per step. I am using latest nightly build and lingvo and using default recipe without changes (expect for learning rate).
from lingvo.
Hm, that's not good news...
Unfortunately it's very hard to debug performance issues remotely. Please try out some tensorflow profiling options to see if the GPUs are being utilized efficiently and whether the input pipeline is keeping up.
from lingvo.
@datavizweb i have training lingvo on Tesla V100 lately, but the training speeds not be faster.
@jonathanasdf from my experiments, the GPUs Utilization was very high, but each step still slow. that's very strange. this is the log
and i have find the cpu load is very high, the load average numbers always > 5;
the Run queue is also high:
$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
48 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
$ cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
2
$ cat /proc/cpuinfo| grep "cpu cores"| uniq
cpu cores : 12
$ cat /proc/cpuinfo| grep "processor"| wc -l
48
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 161664 5572960 231776 232380256 0 0 2 4 0 0 18 12 70 0 0
9 0 161664 5570836 231776 232390528 0 0 10240 0 38752 396294 25 2 73 0 0
62 0 161664 5572952 231776 232390176 0 0 0 0 35005 435967 13 2 85 0 0
6 0 161664 5551236 231776 232405296 0 0 15104 8 107598 484743 19 11 69 0 0
5 0 161664 5551484 231776 232405296 0 0 0 0 30623 508293 16 2 82 0 0
2 0 161664 5551732 231776 232405296 0 0 0 0 31678 177303 7 1 92 0 0
10 0 161664 5551776 231776 232405296 0 0 0 0 43848 262920 9 2 89 0 0
6 0 161664 5551904 231776 232405296 0 0 0 0 57625 445091 14 2 84 0 0
12 0 161664 5552776 231776 232405296 0 0 0 16 59159 443863 14 2 84 0 0
13 0 161664 5552280 231776 232405296 0 0 0 0 48835 525581 15 2 83 0 0
5 0 161664 5534872 231776 232425952 0 0 20608 0 113457 338928 17 10 73 0 0
4 0 161664 5534932 231776 232425952 0 0 0 0 37296 666458 18 3 79 0 0
1 0 161664 5535120 231776 232425952 0 0 0 0 36611 494778 13 2 84 0 0
8 0 161664 5541208 231776 232486576 0 0 61696 0 41747 370774 24 3 73 0 0
9 0 161664 5540580 231776 232487808 0 0 0 16 36153 728851 19 3 78 0 0
1 0 161664 5535660 231776 232488320 0 0 0 0 27509 145425 8 1 91 0 0
3 0 161664 5537648 231776 232487968 0 0 0 0 35210 176295 7 1 92 0 0
3 0 161664 5538180 231776 232487968 0 0 0 0 52778 319345 10 2 87 0 0
4 0 161664 5538068 231776 232487968 0 0 0 0 55697 442565 13 2 85 0 0
9 0 161664 5538332 231776 232487968 0 0 0 4 56004 445039 13 2 85 0 0
10 0 161664 5538956 231776 232487968 0 0 0 0 49435 502612 15 2 83 0 0
10 0 161664 5538716 231776 232488720 0 0 896 0 52583 245029 14 8 78 0 0
8 0 161664 5531764 231776 232488864 0 0 0 0 35941 582441 17 3 80 0 0
12 0 161664 5532012 231776 232488864 0 0 0 0 25962 328696 12 2 86 0 0
64 0 161664 5532880 231776 232488864 0 0 0 0 32843 206023 8 1 91 0 0
10 0 161664 5511304 231776 232488864 0 0 0 39 61909 448895 14 2 84 0 0
7 0 161664 5511056 231776 232488864 0 0 0 0 60732 441861 14 2 84 0 0
from lingvo.
It seems like there is a dip in the GPU graph every 6-8 seconds, which also corresponds to your step time. Could it be maybe the CPU/disk cannot keep up with the input processing?
from lingvo.
I have test the disk IO, but didn't see any bottleneck, that' s the test info librispeech.log, and i have placed the training data in shared server to test network IO, i can't seen anything. this is the network IO:
from lingvo.
Sorry, I tried checking with a few people on our side and we didn't have any other hypothesis for why it's slower for you :(
from lingvo.
@dingevin we can generate random features (Batches in memory) and give it to model from input_generator. If it still takes 6-7 seconds per step, I/O is not an issue. We then need to profile and figure out exact component that is taking more time. As my next debugging step, I am planning to do that end of next week. If you have time, can you check it?
from lingvo.
Related to this issue and how the speeds are measured and compared in this thread: are the metrics printed during the run cumulative or per GPU or both? It seems that the steps/sec is fairly constant, but the number of examples per second grows linearly based on the GPUs. And clarification on this would be appreciated.
GPUs Steps/sec Examples/sec
1 0.128 6.484
2 0.111 11.513
4 0.103 21.002
8 0.105 42.436
from lingvo.
I believe that depends on whether you are running in sync mode (the default) or async mode.
In sync mode all GPUs need to complete their computation before the step is updated, so the step rate is constant. In async mode each GPU updates parameters independently so the step rate should scale with the number of GPUs.
from lingvo.
watch this.
from lingvo.
Could you guys give some tips on how to profile with GPipe? I've created a model thats GPipe-compatible (uses 4 gpus at the moment). Any particular tools?
from lingvo.
Related Issues (20)
- How can I get the logits for one whole sequence in the asr task? HOT 3
- when will the deepfusion code be released? HOT 1
- Car models seem to be disabled for now
- Bazel build failure
- Learnable Align Attention Implementation HOT 1
- DeepFusion Readme HOT 4
- DeepFusion reproduce HOT 11
- Cannot run trainer.py with --model=car.waymo_deepfusion.DeepFusionCenterPointPed, undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb HOT 4
- Cannot import py_camera_model_ops from waymo_open_dataset.camera.ops HOT 3
- images
- Question about DeepFusion
- question about RandomVectorQuantizer
- cannot import name "hyperparams_pb2' from lingvo.core' how to deal with HOT 1
- Switch from prebuilt protoc to build from source
- Raw dependency on "//third_party/py/flax/training:checkpoints"
- unreplicate_metrics=True fails on my training
- Feature request: lingvo.jax.asserts.HasShape HOT 1
- RFC: lingvo.jax exception flag mechanism
- how to intall lingvo on mac m1
- Failed to compile with bazel-7.2.0 (on arm64) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lingvo.