GithubHelp home page GithubHelp logo

Comments (12)

yajiemiao avatar yajiemiao commented on May 16, 2024

It's clearly still a problem with the memory. Probably you can simply remove the last mini-batch (utterance set), or the longest several utterances, from your training set. I am redesigning the data IO manner, which hopefully can solve the issue more elegantly.

from eesen.

jlerouge avatar jlerouge commented on May 16, 2024

Hi @rightfront, are you using Linux ? If so, could you please provide the output of the command "free -h" before and after the execution of the script train_ctc_parallel.sh (preferably just after having rebooted the machine) ?

I'm also experiencing memory errors with eesen, where RAM is allocated by the program but weirdly not given back to the OS when one iteration of train-ctc-parallel is over.

from eesen.

rightfront avatar rightfront commented on May 16, 2024

@jlerouge - I've actually removed the last few thousand utterances from my set as @yajiemiao suggested, and that seems to have 'fixed' my issue - I've been able to run several iterations of train-ctc-parallel now.

If it fails again, I will let you know the results of the free memory test.

from eesen.

ZhixiuYe avatar ZhixiuYe commented on May 16, 2024

@rightfront I have the same problem with you. I just use more than one hundred utterances,but memory allocation failure.Can you give me some suggestions or thoughts?

from eesen.

yajiemiao avatar yajiemiao commented on May 16, 2024

as a follow up, there were updates last week which hopefully resolve this issue. now the size of the buffer is determined prior to (rather than after) adding the next utterance

from eesen.

lifeiteng avatar lifeiteng commented on May 16, 2024

@jlerouge FYI kaldi-asr/kaldi#473

from eesen.

jlerouge avatar jlerouge commented on May 16, 2024

Thank you very much! This is exactly what I've encountered.
Le 7 juin 2016 05:01, "Feiteng Li" [email protected] a Γ©crit :

@jlerouge https://github.com/jlerouge FYI kaldi-asr/kaldi#473
kaldi-asr/kaldi#473

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABwMRlfHgTY9h_9lZYBxiDLLXttCKsyNks5qJN8GgaJpZM4HVkoc
.

from eesen.

migueljette avatar migueljette commented on May 16, 2024

Hi there, I am trying to run the TEDLIUM (v2) recipe on a g2.2 AWS instance (4GB RAM on a K520) and I get the same (or similar error). What should be the memory requirement to run this recipe without issues?

VLOG1 After 35000 sequences (53.5094Hr): Obj(log[Pzx]) = -229.912 TokenAcc = 53.9965%
VLOG1 After 36000 sequences (56.0337Hr): Obj(log[Pzx]) = -235.598 TokenAcc = 54.5475%
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:658) Allocation of 18620 rows, each of size 2560 bytes failed, releasing cached memory and retrying.
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:665) Allocation failed for the second time. Printing device memory usage and exiting
LOG (train-ctc-parallel:PrintMemoryUsage():cuda-device.cc:334) Memory used: 4180164608 bytes.
ERROR (train-ctc-parallel:MallocInternal():cuda-device.cc:668) Memory allocation failure
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe copy-feats scp:exp/train_char_l5_c320/train_local.scp

I am running with the default settings of:
input_feat_dim=120 # dimension of the input features; we will use 40-dimensional fbanks with deltas and double deltas
lstm_layer_num=5 # number of LSTM layers
lstm_cell_dim=320 # number of memory cells in every LSTM layer

As I said, I have a AWS instance (g2.2) with 4GB RAM on K520. I tried with both CUDA 6.5 and 7.5. I seemed to recall that there was some weird memory leak in CUDA 7... but I get the same error.

To try to get one iteration to run, I have changed the nnet topo to
input_feat_dim=120
lstm_layer_num=3
lstm_cell_dim=240

Any thoughts?
Thank you!

from eesen.

migueljette avatar migueljette commented on May 16, 2024

@yajiemiao I was able to successfully run one iteration with the smaller network topology. So, maybe it's just that I need more than 4Gb of RAM on the GPU to run the recipe. Do you have a "rule of thumb" on how much RAM is needed for a given task? For example, this is just about 200 hours of speech... what if I run with 2000 hours? What about 10,000 hours? Thanks for your help!

from eesen.

fmetze avatar fmetze commented on May 16, 2024

The RAM you need depends on the amount of data that you attempt to hold in memory for one mini-batch update, not the total amount of data that you process. You should also play with the num_sequence and frame_num_limit (set to 10000, maybe) parameters, which determine how many utterances you process in parallel. By reducing these, you can keep the model size the same, but process less utterances in parallel, thus requiring less memory. Unfortunately we do not have a real rule of thumb for this, but we are also working successfully on AWS instances.

from eesen.

migueljette avatar migueljette commented on May 16, 2024

@fmetze thank you for your response! I will give that a try in the next few days!

from eesen.

migueljette avatar migueljette commented on May 16, 2024

Yep, that worked. I used '--frame-num-limit 10000' instead of the 25000 used in the recipe by default. Thank you!

from eesen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.