Using lhotse when training a hybrid fast conformer model fails about nemo HOT 7 OPEN

dhoore123 commented on July 23, 2024 1

Using lhotse when training a hybrid fast conformer model fails

from nemo.

Comments (7)

qmgzhao commented on July 23, 2024

I have the same problem.

from nemo.

pzelasko commented on July 23, 2024

Can you also set max_steps to something else than -1? E.g. 100000. Let us know if this helps.

from nemo.

dhoore123 commented on July 23, 2024

Setting max_steps as suggested seems to do the trick. Training now runs. Thanks!
I'll close the ticket once I see some epochs completing successfully.

from nemo.

dhoore123 commented on July 23, 2024

I finally got a training running for a few (pseudo-)epochs now. Even though I am running on 2 80GB GPUs, I had to tune down the batch_duration to 750, with batch_size removed from the configuration. The GPU ran out of RAM with higher values. I did not expect this as the example in the nvidia docs suggests using a batch_duration of 1100 for a 32GB GPU.

from nemo.

pzelasko commented on July 23, 2024

I had to tune down the batch_duration to 750, with batch_size removed from the configuration.

It seems that your actual batch sizes became larger after removing batch_size constraint, leading to this outcome. This is a net benefit - despite decreasing batch_duration, you are still enjoying larger batch sizes.

I did not expect this as the example in the nvidia docs suggests using a batch_duration of 1100 for a 32GB GPU.

The maximum possible batch_duration setting is determined by several factors:

available GPU RAM
model size
objective function
data duration distribution / max_duration / number of buckets / optional quadratic_duration penalty

The setting of 1100s was specific to FastConformer-L CTC+RNN-T trained on ASRSet 3. It is expected that with a different model, data, objective function, etc. you may need to tune it again. I am hoping to simplify the tuning process in the future.

from nemo.

dhoore123 commented on July 23, 2024

Thanks for your reply, pzelasko. It reassures me that this batch_duration value does not seem odd to you, and does not point to something I did wrong.
On a different note: the effective batch size is normally defined by batch_size x accumulate_grad_batches (or fused_batch_size in case of hybrid training?) x nr_of_gpus. This causes the number of steps per epoch to be a function of the number of GPUs.
When using lhotse, the number of steps in a "pseudo" epoch looks to be the same, independent of the number of GPUs. Does this mean that the amount of data seen in one "pseudo" epoch depends on the number of GPUs one uses, or is lhotse spreading the same amount of data over fewer effective batches when running on more GPUs with each step?

from nemo.

pzelasko commented on July 23, 2024

It means that if you keep the “pseudoepoch” size constant, the amount of data seen during a “pseudoepoch” is proportional to the number of GPUs. Generally I don’t encourage thinking in epochs in this flavor of data loading, the only thing that counts is the number of updates. And yeah the total batch duration is the product of num GPUs, batch duration, and grad accumulation factor.

from nemo.

Using lhotse when training a hybrid fast conformer model fails about nemo HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs