Comments (2)
First, to triage whether it's the model or data store as the problem, run with a subset of data, maybe 50 hours of so. What is the max duration of the data ? Reduce it to at most 40 seconds, preferably 30 sec. We have some tools to segment data automatically.
Next, nccl timeout is hard to debug because NeMo code mostly uses pytorch, we don't do much at nccl level so it can be due to many different reasons. See if model fine-tuning on single gpu with small bs is working first then try two gpus.
LR and optimizer State is preserved in the ckpt files saved by Lightning during training. If you use exp manager, resuming a job is quite easy, see the docs for exp manager and tutorials showcasing training with it (just run the same script again with same output dir if you have set the two resume flags in exp manager).
We don't have much information about hardware effects on certain operation in our team, we rely on pytorch and pytorch lightning to provide stable training engine
from nemo.
I have the same issue. Have you solved it? How can we avoid it?
from nemo.
Related Issues (20)
- Citrinet CTC Decoder Alphabet size mismatch.
- Segmentation fault when fine-tuning Ambernet HOT 3
- Internal error when running model.transcribe() on FastConformer-Hybrid-Transducer-CTC-BPE model.
- [question] What datasets are used in the training of stt eu model?
- Very poor WER for Conformer_CTC_large model in streaming mode
- RuntimeError "Unexpected key" when running checkpoint_converters script convert_got_nemo_to_mcore.py
- How to adapt myself speaker model into the diarization pipeline?
- More complete example of using S3CheckpointIO
- Object shard /models/Nemotron-4-340B-Reward/model_weights/model.rm_head._extra_state/shard_0_1.pt not found
- When should mcore_gpt: True be used?
- Add sequence packing and proper attention masking support for LLM pretraining? HOT 1
- Util for measuring MFU? HOT 2
- RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel after ANY fine tuning HOT 1
- Add KV-Cache for MegatronLMEncoderDecoderModel
- Question: Which decoder are we supposed to use on parakeet-tdt_ctc-1.1b model?
- Not work even use the official docker when multiple GPU training LLM
- CPU memory keeps increasing in every step during training LLM with Nemo framework? HOT 1
- Unable to reproduce cache aware streaming results for Conformer that were there for Fastconformer.
- I wanted to train a multitask model, like canary. needed more information on how to build the tokenizer, and data manifest file.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nemo.