GithubHelp home page GithubHelp logo

Comments (13)

sw005320 avatar sw005320 commented on May 13, 2024

Does this only happen in decode.2.log, and others are fine? Then, I'm expecting this is just a problem of some accidental data access. If this always happens, it would be due to some bugs. Can you just re-run only decoding by adding the option of --stage 5?

from espnet.

chiayuli avatar chiayuli commented on May 13, 2024

No, this happen in all decode.*.log. After I re-run it (adding the option of --stage 5), and it happened again.

from espnet.

sw005320 avatar sw005320 commented on May 13, 2024

Thanks.
I'll check it.

from espnet.

sw005320 avatar sw005320 commented on May 13, 2024

I did not test the completely same setup, but I did not observe the issue. The training may have some issues. Can you take a look at exp/.../train.log? Also, can you check the model exists at exp/.../results/model.acc.best ?

from espnet.

kan-bayashi avatar kan-bayashi commented on May 13, 2024

@chiayuli @sw005320
This is caused by the torch.nn.DataParallel.
We have to change the saving function when using DataParallel as follows:
(This is from my another project's codes)

    if args.n_gpus > 1:
        torch.save({"model": model.module.state_dict()}, args.expdir + "/checkpoint-final.pkl")
    else:
        torch.save({"model": model.state_dict()}, args.expdir + "/checkpoint-final.pkl")

@bobchennan Coud you fix it?

from espnet.

chiayuli avatar chiayuli commented on May 13, 2024

Thanks, I'll try it and feedback to you.

from espnet.

bobchennan avatar bobchennan commented on May 13, 2024

Yes that is caused by DataParallel. I will fix it soon.

from espnet.

chiayuli avatar chiayuli commented on May 13, 2024

Hi all,
I modified the code in asr_pytorch.py as #157
But it occurs other error (torch_load) during training acoustic model. Is there any modification to torch_load(path, obj) function? Many Thanks

=== commands ===
export CUDA_VISIBLE_DEVICES=0,2,3 ; nohup ./run.sh --ngpu 3 --stage 4 --backend pytorch --etype blstmp >> run.log&
=== log ===
Exception in main training loop: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'
Traceback (most recent call last):
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot
_restore_snapshot(model, snapshot, load_fn)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot
load_fn(snapshot, model)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load
model.load_state_dict(torch.load(path))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 196, in
main()
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 190, in main
train(args)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 308, in train
trainer.run()
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot
_restore_snapshot(model, snapshot, load_fn)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot
load_fn(snapshot, model)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load
model.load_state_dict(torch.load(path))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
KeyError: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'

from espnet.

bobchennan avatar bobchennan commented on May 13, 2024

@chiayuli new updates of #157 should fix it.

Still for PyTorch Multi-GPU I think there are some problems. I would suggest to merge with #155 and we may test it as soon as possible.

from espnet.

kan-bayashi avatar kan-bayashi commented on May 13, 2024

@bobchennan There is still an error when loading the model trained with multi-gpu.

    def remove_dataparallel(state_dict):
        from collections import OrderedDict
        new_state_dict = OrderedDict()
        for k, v in state_dict.items():
            if k.startswith("module."):
                name = k[7:]
                new_state_dict[name] = v
        return new_state_dict

This should be

    def remove_dataparallel(state_dict):
        from collections import OrderedDict
        new_state_dict = OrderedDict()
        for k, v in state_dict.items():
            if k.startswith("module."):
                name = k[7:]
                new_state_dict[name] = v
            else:
                new_state_dict[k] = v
        return new_state_dict

I will make PR to fix it.

from espnet.

bobchennan avatar bobchennan commented on May 13, 2024

It is included in #173 :

    for k, v in state_dict.items():
        if k.startswith("module."):
            k = k[7:]
        new_state_dict[k] = v

but I agree it is better to make a separate pull request and merge it as soon as possible.

from espnet.

kan-bayashi avatar kan-bayashi commented on May 13, 2024

Oh sorry, I overlooked it.
I will merge fixing PR.

from espnet.

kan-bayashi avatar kan-bayashi commented on May 13, 2024

Now fixed.

from espnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.