Question: why not dividing by target length in CTC loss about nemo HOT 2 CLOSED

nvidia commented on May 13, 2024

Question: why not dividing by target length in CTC loss

from nemo.

Comments (2)

okuchaiev commented on May 13, 2024

Yes, this is intentional. Basically, there are 2 options which I think make sense for CTCLoss:

"mean" - average everything across sequence length and batch (Notice that this is the default behavior for Pytorch)
Sum losses over sequence lengths and then average over the batch.

We found out empirically that option (2) works best. While longer sequences do make greater impact, in this case, keep in mind that in our setup: (1) we randomly shuffle examples and (2) cap the max duration to 16.7 seconds.

But, perhaps, we should expose (1) as an option.

from nemo.

vadimkantorov commented on May 13, 2024

(1) we randomly shuffle examples

Don't you sort by duration (so that duration is similar within the batch) by default?
https://github.com/NVIDIA/NeMo/blob/master/collections/nemo_asr/nemo_asr/parts/manifest.py#L129

But, perhaps, we should expose ("mean") as an option.

Yeah, I wonder if longer sequences indeed provide more reliable gradients. If it's not the case, then rising learning rate should have somewhat similar impact.

from nemo.

Question: why not dividing by target length in CTC loss about nemo HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs