I'm having issues trying to get train_ctc_parallel.sh to run successfully (I'm using t

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Consistent GPU crashes in nnet training about eesen HOT 12 CLOSED

srvk commented on May 16, 2024

Consistent GPU crashes in nnet training

from eesen.

Comments (12)

yajiemiao commented on May 16, 2024

It's clearly still a problem with the memory. Probably you can simply remove the last mini-batch (utterance set), or the longest several utterances, from your training set. I am redesigning the data IO manner, which hopefully can solve the issue more elegantly.

from eesen.

jlerouge commented on May 16, 2024

Hi @rightfront, are you using Linux ? If so, could you please provide the output of the command "free -h" before and after the execution of the script train_ctc_parallel.sh (preferably just after having rebooted the machine) ?

I'm also experiencing memory errors with eesen, where RAM is allocated by the program but weirdly not given back to the OS when one iteration of train-ctc-parallel is over.

from eesen.

rightfront commented on May 16, 2024

@jlerouge - I've actually removed the last few thousand utterances from my set as @yajiemiao suggested, and that seems to have 'fixed' my issue - I've been able to run several iterations of train-ctc-parallel now.

If it fails again, I will let you know the results of the free memory test.

from eesen.

ZhixiuYe commented on May 16, 2024

@rightfront I have the same problem with you. I just use more than one hundred utterances,but memory allocation failure.Can you give me some suggestions or thoughts?

from eesen.

yajiemiao commented on May 16, 2024

as a follow up, there were updates last week which hopefully resolve this issue. now the size of the buffer is determined prior to (rather than after) adding the next utterance

from eesen.

lifeiteng commented on May 16, 2024

@jlerouge FYI kaldi-asr/kaldi#473

from eesen.

jlerouge commented on May 16, 2024

Thank you very much! This is exactly what I've encountered.
Le 7 juin 2016 05:01, "Feiteng Li" [email protected] a écrit :

@jlerouge https://github.com/jlerouge FYI kaldi-asr/kaldi#473
kaldi-asr/kaldi#473

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABwMRlfHgTY9h_9lZYBxiDLLXttCKsyNks5qJN8GgaJpZM4HVkoc
.

from eesen.

migueljette commented on May 16, 2024

Hi there, I am trying to run the TEDLIUM (v2) recipe on a g2.2 AWS instance (4GB RAM on a K520) and I get the same (or similar error). What should be the memory requirement to run this recipe without issues?

VLOG1 After 35000 sequences (53.5094Hr): Obj(log[Pzx]) = -229.912 TokenAcc = 53.9965%
VLOG1 After 36000 sequences (56.0337Hr): Obj(log[Pzx]) = -235.598 TokenAcc = 54.5475%
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:658) Allocation of 18620 rows, each of size 2560 bytes failed, releasing cached memory and retrying.
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:665) Allocation failed for the second time. Printing device memory usage and exiting
LOG (train-ctc-parallel:PrintMemoryUsage():cuda-device.cc:334) Memory used: 4180164608 bytes.
ERROR (train-ctc-parallel:MallocInternal():cuda-device.cc:668) Memory allocation failure
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe copy-feats scp:exp/train_char_l5_c320/train_local.scp

I am running with the default settings of:
input_feat_dim=120 # dimension of the input features; we will use 40-dimensional fbanks with deltas and double deltas
lstm_layer_num=5 # number of LSTM layers
lstm_cell_dim=320 # number of memory cells in every LSTM layer

As I said, I have a AWS instance (g2.2) with 4GB RAM on K520. I tried with both CUDA 6.5 and 7.5. I seemed to recall that there was some weird memory leak in CUDA 7... but I get the same error.

To try to get one iteration to run, I have changed the nnet topo to
input_feat_dim=120
lstm_layer_num=3
lstm_cell_dim=240

Any thoughts?
Thank you!

from eesen.

migueljette commented on May 16, 2024

@yajiemiao I was able to successfully run one iteration with the smaller network topology. So, maybe it's just that I need more than 4Gb of RAM on the GPU to run the recipe. Do you have a "rule of thumb" on how much RAM is needed for a given task? For example, this is just about 200 hours of speech... what if I run with 2000 hours? What about 10,000 hours? Thanks for your help!

from eesen.

fmetze commented on May 16, 2024

The RAM you need depends on the amount of data that you attempt to hold in memory for one mini-batch update, not the total amount of data that you process. You should also play with the num_sequence and frame_num_limit (set to 10000, maybe) parameters, which determine how many utterances you process in parallel. By reducing these, you can keep the model size the same, but process less utterances in parallel, thus requiring less memory. Unfortunately we do not have a real rule of thumb for this, but we are also working successfully on AWS instances.

from eesen.

migueljette commented on May 16, 2024

@fmetze thank you for your response! I will give that a try in the next few days!

from eesen.

migueljette commented on May 16, 2024

Yep, that worked. I used '--frame-num-limit 10000' instead of the 25000 used in the recipe by default. Thank you!

from eesen.

Consistent GPU crashes in nnet training about eesen HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs