Comments (13)
Does this only happen in decode.2.log
, and others are fine? Then, I'm expecting this is just a problem of some accidental data access. If this always happens, it would be due to some bugs. Can you just re-run only decoding by adding the option of --stage 5
?
from espnet.
No, this happen in all decode.*.log. After I re-run it (adding the option of --stage 5), and it happened again.
from espnet.
Thanks.
I'll check it.
from espnet.
I did not test the completely same setup, but I did not observe the issue. The training may have some issues. Can you take a look at exp/.../train.log? Also, can you check the model exists at exp/.../results/model.acc.best ?
from espnet.
@chiayuli @sw005320
This is caused by the torch.nn.DataParallel
.
We have to change the saving function when using DataParallel
as follows:
(This is from my another project's codes)
if args.n_gpus > 1:
torch.save({"model": model.module.state_dict()}, args.expdir + "/checkpoint-final.pkl")
else:
torch.save({"model": model.state_dict()}, args.expdir + "/checkpoint-final.pkl")
@bobchennan Coud you fix it?
from espnet.
Thanks, I'll try it and feedback to you.
from espnet.
Yes that is caused by DataParallel. I will fix it soon.
from espnet.
Hi all,
I modified the code in asr_pytorch.py as #157
But it occurs other error (torch_load) during training acoustic model. Is there any modification to torch_load(path, obj) function? Many Thanks
=== commands ===
export CUDA_VISIBLE_DEVICES=0,2,3 ; nohup ./run.sh --ngpu 3 --stage 4 --backend pytorch --etype blstmp >> run.log&
=== log ===
Exception in main training loop: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'
Traceback (most recent call last):
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot
_restore_snapshot(model, snapshot, load_fn)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot
load_fn(snapshot, model)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load
model.load_state_dict(torch.load(path))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 196, in
main()
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 190, in main
train(args)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 308, in train
trainer.run()
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot
_restore_snapshot(model, snapshot, load_fn)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot
load_fn(snapshot, model)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load
model.load_state_dict(torch.load(path))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
KeyError: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'
from espnet.
@chiayuli new updates of #157 should fix it.
Still for PyTorch Multi-GPU I think there are some problems. I would suggest to merge with #155 and we may test it as soon as possible.
from espnet.
@bobchennan There is still an error when loading the model trained with multi-gpu.
def remove_dataparallel(state_dict):
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
if k.startswith("module."):
name = k[7:]
new_state_dict[name] = v
return new_state_dict
This should be
def remove_dataparallel(state_dict):
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
if k.startswith("module."):
name = k[7:]
new_state_dict[name] = v
else:
new_state_dict[k] = v
return new_state_dict
I will make PR to fix it.
from espnet.
It is included in #173 :
for k, v in state_dict.items():
if k.startswith("module."):
k = k[7:]
new_state_dict[k] = v
but I agree it is better to make a separate pull request and merge it as soon as possible.
from espnet.
Oh sorry, I overlooked it.
I will merge fixing PR.
from espnet.
Now fixed.
from espnet.
Related Issues (20)
- Traceback (most recent call last): File "pyscripts/utils/calculate_rtf.py", line 113, in <module> main() File "pyscripts/utils/calculate_rtf.py", line 90, in main assert len(audio_durations) == len(end_times), ( AssertionError: (0, 151) # Accounting: time=0 threads=1 # Ended (code 1) at Wed Jan 10 02:02:22 IST 2024, elapsed time 0 seconds HOT 5
- How to Log Additional Variables in Tensorboard ?
- Diarization, missing example HOT 1
- Error while using auxiliary CTC objective for multilingual ASR HOT 5
- Installation cannot be proceeded properly with latest/default python version on Honda HOT 3
- SpecAug/MaskAlongAxis: RuntimeError: random_ expects 'from' to be less than 'to' HOT 2
- Cannot find CUDA after installation under CUDA 11.2 HOT 3
- How to asr_inference using LoRA HOT 6
- EEND-EDA Der's performance in libri2mix is much higher than EEND-SS's paper HOT 9
- Request for training log HOT 3
- While trying to espnet2.bin.asr_inference import Speech2Text, "Namespace' object has no attribute 'token_list'" HOT 1
- failed to inference using whisper(./evaluate_asr.sh: invalid option --whisper_tag) HOT 2
- espent whisper inference use_streaming=true
- [ASR] Lack of optimization on BeamSearch HOT 2
- CUDA out of memory appears when fine-tuning wav2vec2-base model HOT 7
- finetune whisper, preprocessor_conf, preprocessor_class, an unexpected keyword argument 'tokenizer_language' HOT 4
- Problems with install_phonemizer.sh
- Slue voxpopuli HOT 9
- Support for new typeguard version HOT 3
- How to properly prepare data for jTubeSpeech ASR training? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from espnet.