Hi all, I'm the beginner for ESpnet and I followed the instructions from ESpnet/eg

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi all, I modified the code in asr_pytorch.py as <a class="issue-link js-issue-lin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

unexpected keyError during decoding about espnet HOT 13 CLOSED

espnet commented on May 13, 2024

unexpected keyError during decoding

from espnet.

Comments (13)

sw005320 commented on May 13, 2024

Does this only happen in decode.2.log, and others are fine? Then, I'm expecting this is just a problem of some accidental data access. If this always happens, it would be due to some bugs. Can you just re-run only decoding by adding the option of --stage 5?

from espnet.

chiayuli commented on May 13, 2024

No, this happen in all decode.*.log. After I re-run it (adding the option of --stage 5), and it happened again.

from espnet.

sw005320 commented on May 13, 2024

Thanks.
I'll check it.

from espnet.

sw005320 commented on May 13, 2024

I did not test the completely same setup, but I did not observe the issue. The training may have some issues. Can you take a look at exp/.../train.log? Also, can you check the model exists at exp/.../results/model.acc.best ?

from espnet.

kan-bayashi commented on May 13, 2024

@chiayuli @sw005320
This is caused by the torch.nn.DataParallel.
We have to change the saving function when using DataParallel as follows:
(This is from my another project's codes)

    if args.n_gpus > 1:
        torch.save({"model": model.module.state_dict()}, args.expdir + "/checkpoint-final.pkl")
    else:
        torch.save({"model": model.state_dict()}, args.expdir + "/checkpoint-final.pkl")

@bobchennan Coud you fix it?

from espnet.

chiayuli commented on May 13, 2024

Thanks, I'll try it and feedback to you.

from espnet.

bobchennan commented on May 13, 2024

Yes that is caused by DataParallel. I will fix it soon.

from espnet.

chiayuli commented on May 13, 2024

Hi all,
I modified the code in asr_pytorch.py as #157
But it occurs other error (torch_load) during training acoustic model. Is there any modification to torch_load(path, obj) function? Many Thanks

=== commands ===
export CUDA_VISIBLE_DEVICES=0,2,3 ; nohup ./run.sh --ngpu 3 --stage 4 --backend pytorch --etype blstmp >> run.log&
=== log ===
Exception in main training loop: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'
Traceback (most recent call last):
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot
_restore_snapshot(model, snapshot, load_fn)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot
load_fn(snapshot, model)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load
model.load_state_dict(torch.load(path))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 196, in
main()
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 190, in main
train(args)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 308, in train
trainer.run()
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot
_restore_snapshot(model, snapshot, load_fn)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot
load_fn(snapshot, model)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load
model.load_state_dict(torch.load(path))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
KeyError: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'

from espnet.

bobchennan commented on May 13, 2024

@chiayuli new updates of #157 should fix it.

Still for PyTorch Multi-GPU I think there are some problems. I would suggest to merge with #155 and we may test it as soon as possible.

from espnet.

kan-bayashi commented on May 13, 2024

@bobchennan There is still an error when loading the model trained with multi-gpu.

    def remove_dataparallel(state_dict):
        from collections import OrderedDict
        new_state_dict = OrderedDict()
        for k, v in state_dict.items():
            if k.startswith("module."):
                name = k[7:]
                new_state_dict[name] = v
        return new_state_dict

This should be

    def remove_dataparallel(state_dict):
        from collections import OrderedDict
        new_state_dict = OrderedDict()
        for k, v in state_dict.items():
            if k.startswith("module."):
                name = k[7:]
                new_state_dict[name] = v
            else:
                new_state_dict[k] = v
        return new_state_dict

I will make PR to fix it.

from espnet.

bobchennan commented on May 13, 2024

It is included in #173 :

    for k, v in state_dict.items():
        if k.startswith("module."):
            k = k[7:]
        new_state_dict[k] = v

but I agree it is better to make a separate pull request and merge it as soon as possible.

from espnet.

kan-bayashi commented on May 13, 2024

Oh sorry, I overlooked it.
I will merge fixing PR.

from espnet.

kan-bayashi commented on May 13, 2024

Now fixed.

from espnet.

unexpected keyError during decoding about espnet HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs