Hi, I've been training models for almost two days. Today, the GPU utilization dropped

The docker installs tf-nightly (here <div

GPU utilization down to 0% without any error infos about lingvo HOT 13 OPEN

tensorflow commented on July 4, 2024

GPU utilization down to 0% without any error infos

from lingvo.

Comments (13)

jonathanasdf commented on July 4, 2024

Can you kill the job and restart it? It should resume training.

I've never seen it just stop progressing without any error messages, so I have no idea what could be going on.

from lingvo.

iamxiaoyubei commented on July 4, 2024

Thank you~
I have experienced this situation twice, and it almost happened that after two or three days of running. I stopped it and restarted, it can resume training.

from lingvo.

iamxiaoyubei commented on July 4, 2024

Could you please have a look at this: I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!

I have tried many ways to solve it, but it still not works.
Besides, I did not see the installation of tensorflow in the dockerfile. However, the 1.14.1 version of tensorflow will appear after installing the environment via docker. But, without docker the installation of the 1.14.1 version of tensorflow needs to be installed from the source, because pypi library doesn't have it. I wanna know why docker can install that version directly. Is my problem related to the version of tensorflow?

from lingvo.

jonathanasdf commented on July 4, 2024

The docker installs tf-nightly (here

lingvo/docker/dev.dockerfile

Line 75 in e649e65

 RUN pip --no-cache-dir install tf-nightly$(test "$base_image" != "$cpu_base_image" && echo "-gpu") 

)

That might be the source of problem if you have a different tensorflow version.

from lingvo.

iamxiaoyubei commented on July 4, 2024

That's right! It’s really the problem. Thanks!

from lingvo.

datavizweb commented on July 4, 2024

I am seeing the same issue with libri recipe.

While running libri grapheme recipe (with default params, no changes to recipe) I see that loss starts reducing over steps. But after some times, I see that losses are not computed at all. I also see that it remains in same step (same step is check pointed again). It happened first time and killed and restarted. After few more steps, GPU utilization became zero again.

** runs with decent GPU utilization, loss reduces, steps per second seems okay too **

I0505 21:07:48.797380 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137719
I0505 21:07:53.444988 140580814300928 trainer.py:520] step: 11449 fraction_of_correct_next_step_preds:0.98058969 fraction_of_correct_next_step_preds/logits:0.98058969 grad_norm/all:1.6253868 grad_scale_all:0.61523813 log_pplx:0.062722519 log_pplx/logits:0.062722519 loss:0.062722519 loss/logits:0.06272
2519 num_samples_in_batch:384 var_norm/all:608.57135
I0505 21:07:58.806945 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137722
I0505 21:08:02.775715 140580814300928 trainer.py:520] step: 11450 fraction_of_correct_next_step_preds:0.98301238 fraction_of_correct_next_step_preds/logits:0.98301238 grad_norm/all:1.4966037 grad_scale_all:0.66817957 log_pplx:0.054031234 log_pplx/logits:0.054031234 loss:0.054031234 loss/logits:0.05403
1234 num_samples_in_batch:384 var_norm/all:608.56183

from here on loss are not computed. GPU usage becomes zero

I0505 21:08:08.816323 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137720
I0505 21:08:18.826544 140580822693632 trainer.py:371] Steps/second: 0.117506, Examples/second: 48.132058
I0505 21:08:28.836873 140580822693632 trainer.py:371] Steps/second: 0.117492, Examples/second: 48.126397
I0505 21:08:38.846771 140580822693632 trainer.py:371] Steps/second: 0.117479, Examples/second: 48.120738
I0505 21:08:48.856947 140580822693632 trainer.py:371] Steps/second: 0.117465, Examples/second: 48.115080
I0505 21:08:58.866631 140580822693632 trainer.py:371] Steps/second: 0.117451, Examples/second: 48.109423
I0505 21:09:08.877096 140580822693632 trainer.py:371] Steps/second: 0.117437, Examples/second: 48.103767
I0505 21:09:18.887014 140580822693632 trainer.py:371] Steps/second: 0.117423, Examples/second: 48.098113

** Same chekpoint saved again **
I0505 22:16:35.073483 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0505 22:16:42.773545 140580822693632 trainer.py:371] Steps/second: 0.112100, Examples/second: 45.917725
I0505 22:16:52.783426 140580822693632 trainer.py:371] Steps/second: 0.112088, Examples/second: 45.912573
I0505 22:17:02.793808 140580822693632 trainer.py:371] Steps/second: 0.112075, Examples/second: 45.907422

I0505 22:26:35.644196 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0505 22:26:43.352466 140580822693632 trainer.py:371] Steps/second: 0.111351, Examples/second: 45.610651
I0505 22:26:53.362185 140580822693632 trainer.py:371] Steps/second: 0.111338, Examples/second: 45.605568
I0505 22:27:03.372393 140580822693632 trainer.py:371] Steps/second: 0.111326, Examples/second: 45.600485

I0506 04:06:55.412421 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450
I0506 04:07:03.148743 140580822693632 trainer.py:371] Steps/second: 0.090723, Examples/second: 37.161114
I0506 04:07:13.158670 140580822693632 trainer.py:371] Steps/second: 0.090714, Examples/second: 37.157740
I0506 04:07:23.168779 140580822693632 trainer.py:371] Steps/second: 0.090706, Examples/second: 37.154366
I0506 04:07:33.178835 140580822693632 trainer.py:371] Steps/second: 0.090698, Examples/second: 37.150993

from lingvo.

iamxiaoyubei commented on July 4, 2024

It's very strange. My experiment didn't continue to display any info and also not save ckpt.

from lingvo.

datavizweb commented on July 4, 2024

In the async mode I am seeing the same issue. It stops after say 14k steps and GPU utilization becomes zero. Memory usage remains the same. Unlike sync mode (previous post) I don't see any progress here. It runs normal after I kill it and restart.

from lingvo.

AaronSeunghi commented on July 4, 2024

In the async mode with two trainers for Librispeech960Wpm, I observed exactly the same phenomenon as datavizweb reported. The same step (33299) is checkpointed again and again.

from lingvo.

jonathanasdf commented on July 4, 2024

I wonder if it is some kind of threading issue / race condition due to running controller and trainer in the same binary. Internally we always run the jobs as separate binaries and have never observed this problem. That is the only real difference I can think of.

from lingvo.

NiHaoUCAS commented on July 4, 2024

@iamxiaoyubei Did you solve the problem? I meet the same problem.

from lingvo.

iamxiaoyubei commented on July 4, 2024

I didn't solve this problem. I just restart running again to deal with it.😂

from lingvo.

NiHaoUCAS commented on July 4, 2024

I solve the problem by set: export TF_CUDNN_USE_AUTOTUNE=0

from lingvo.

GPU utilization down to 0% without any error infos about lingvo HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs