Hi, We're trying to finetune the stt_en fastconformer_ctc large mode

Fastconformer-CTC crashing with Watchdog caught collective operation timeout about nemo HOT 2 OPEN

duhtapioca commented on July 17, 2024

Fastconformer-CTC crashing with Watchdog caught collective operation timeout

from nemo.

Comments (2)

titu1994 commented on July 17, 2024 1

First, to triage whether it's the model or data store as the problem, run with a subset of data, maybe 50 hours of so. What is the max duration of the data ? Reduce it to at most 40 seconds, preferably 30 sec. We have some tools to segment data automatically.

Next, nccl timeout is hard to debug because NeMo code mostly uses pytorch, we don't do much at nccl level so it can be due to many different reasons. See if model fine-tuning on single gpu with small bs is working first then try two gpus.

LR and optimizer State is preserved in the ckpt files saved by Lightning during training. If you use exp manager, resuming a job is quite easy, see the docs for exp manager and tutorials showcasing training with it (just run the same script again with same output dir if you have set the two resume flags in exp manager).

We don't have much information about hardware effects on certain operation in our team, we rely on pytorch and pytorch lightning to provide stable training engine

from nemo.

zhang7346 commented on July 17, 2024

I have the same issue. Have you solved it? How can we avoid it？

from nemo.

Fastconformer-CTC crashing with Watchdog caught collective operation timeout about nemo HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs