Comments (7)
I have the same problem.
from nemo.
Can you also set max_steps
to something else than -1
? E.g. 100000
. Let us know if this helps.
from nemo.
Setting max_steps as suggested seems to do the trick. Training now runs. Thanks!
I'll close the ticket once I see some epochs completing successfully.
from nemo.
I finally got a training running for a few (pseudo-)epochs now. Even though I am running on 2 80GB GPUs, I had to tune down the batch_duration to 750, with batch_size removed from the configuration. The GPU ran out of RAM with higher values. I did not expect this as the example in the nvidia docs suggests using a batch_duration of 1100 for a 32GB GPU.
from nemo.
I had to tune down the batch_duration to 750, with batch_size removed from the configuration.
It seems that your actual batch sizes became larger after removing batch_size constraint, leading to this outcome. This is a net benefit - despite decreasing batch_duration, you are still enjoying larger batch sizes.
I did not expect this as the example in the nvidia docs suggests using a batch_duration of 1100 for a 32GB GPU.
The maximum possible batch_duration setting is determined by several factors:
- available GPU RAM
- model size
- objective function
- data duration distribution / max_duration / number of buckets / optional quadratic_duration penalty
The setting of 1100s was specific to FastConformer-L CTC+RNN-T trained on ASRSet 3. It is expected that with a different model, data, objective function, etc. you may need to tune it again. I am hoping to simplify the tuning process in the future.
from nemo.
Thanks for your reply, pzelasko. It reassures me that this batch_duration value does not seem odd to you, and does not point to something I did wrong.
On a different note: the effective batch size is normally defined by batch_size x accumulate_grad_batches (or fused_batch_size in case of hybrid training?) x nr_of_gpus. This causes the number of steps per epoch to be a function of the number of GPUs.
When using lhotse, the number of steps in a "pseudo" epoch looks to be the same, independent of the number of GPUs. Does this mean that the amount of data seen in one "pseudo" epoch depends on the number of GPUs one uses, or is lhotse spreading the same amount of data over fewer effective batches when running on more GPUs with each step?
from nemo.
It means that if you keep the “pseudoepoch” size constant, the amount of data seen during a “pseudoepoch” is proportional to the number of GPUs. Generally I don’t encourage thinking in epochs in this flavor of data loading, the only thing that counts is the number of updates. And yeah the total batch duration is the product of num GPUs, batch duration, and grad accumulation factor.
from nemo.
Related Issues (20)
- Can we add emotions to the produced audio? HOT 1
- LM on Parakeet models HOT 1
- to support deepseekv2 HOT 1
- How to use a pre-trained model for cache-aware FastConformer-Hybrid model? HOT 3
- When Trying to import nlp collections in the Nemo Primer getting error "No Module named megatron"
- How to export SLUIntentSlotBPEModel to ONNX HOT 1
- issue about self attention with mask
- Converting megatron checkpoint to .nemo without the same environment
- Nemo container for Nemotron 340B inference fails pytorch_lightning import HOT 1
- Can you support DoRA? HOT 1
- Unable to reproduce cache aware streaming results for Conformer that were there for Fastconformer.
- Issue: TimeError Occurring During Training on node 16 or more
- Speaker Diarization goes haywire due to small segments of audio
- MCore slower than NeMo native implementation
- FSDP CPU offloading errors out due to device placements
- Getting empty results from online streaming asr. Please help me!!!!! thanks a lot.
- Failed to generate timestamp for parakeet-tdt-1.1b
- Citrinet CTC Decoder Alphabet size mismatch.
- Segmentation fault when fine-tuning Ambernet HOT 3
- Fastconformer-CTC crashing with Watchdog caught collective operation timeout HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nemo.