I think we have fixed many issues, and we can add a version 1.0 (or 0.1) as a stable v

Frankly, I don't have so much experience, and if <a class="user-mention notranslate" d

The implementation of the end detection is finished <a class="issue-link js-issue-link

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Toward a stable version about espnet HOT 26 CLOSED

espnet commented on May 13, 2024

Toward a stable version

from espnet.

Comments (26)

sw005320 commented on May 13, 2024 2

Guys, by combining LSTMLM and joint attention/CTC decoding, we finally get CER 5.3 -> 3.8, WER 14.7 -> 9.3 in the WSJ task!!! The nice thing is that we don't have to set min/maxlength and penalty (all set to 0.0), while we might need to tune the CTC and LM weights (0.3 and 1.0, respectively, see #76).
@kan-bayashi, can you play with LSTMLM and joint decoding with the TEDLIUM recipe? You can train LSTMLM by using text data by referring tools/kaldi/egs/tedlium/s5_r2/local/ted_train_lm.sh and simply using

gunzip -c db/TEDLIUM_release2/LM/*.en.gz | sed 's/ <\/s>//g' | local/join_suffix.py | gzip -c  > ${dir}/data/text/train.txt.gz

from espnet.

kan-bayashi commented on May 13, 2024 1

The results of tedlium with ctc joint decoding and lm rescoring are as follows:

exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_dev_beam20_eacc.best_p0.1_len0.0-0.0_ctcw0.3_rnnlm1.0/result.txt:|        Sum/Avg                          |         507                 95429        |        91.8                  4.2                  4.0                  2.7                 10.8                 89.3        |
exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_test_beam20_eacc.best_p0.1_len0.0-0.0_ctcw0.3_rnnlm1.0/result.txt:|        Sum/Avg                       |        1155                145066         |        92.2                  3.7                   4.1                  2.4                  10.1                 85.3         |
exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_dev_beam20_eacc.best_p0.1_len0.0-0.0_ctcw0.3_rnnlm1.0/result.wrd.txt:|        Sum/Avg                           |         507                17783         |        83.2                 13.7                   3.1                  3.0                  19.8                 89.3         |
exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_test_beam20_eacc.best_p0.1_len0.0-0.0_ctcw0.3_rnnlm1.0/result.wrd.txt:|        Sum/Avg                       |        1155                 27500         |        84.0                   12.3                   3.7                   2.6                  18.6                  85.3         |

for dev set, CER 12.6 -> 10.8, WER 24.8 -> 19.8
for test set, CER 11.9 -> 10.1, WER 23.4 -> 18.6

from espnet.

kan-bayashi commented on May 13, 2024

At least, we should add docstring for src/nets.

from espnet.

ShigekiKarita commented on May 13, 2024

I agree. Type (class) and shape are essential information for everyone.

from espnet.

sw005320 commented on May 13, 2024

Frankly, I don't have so much experience, and if @kan-bayashi initiates this, I'll follow you and add/modify/enhance the document.
Also, we should make a webpage somewhere.
Do you have any idea (just using github website host service?)?

from espnet.

sw005320 commented on May 13, 2024

The implementation of the end detection is finished #46

The performance was (really) slightly decreased, and this is quite effective by considering the fact that we don't have to tune the maxlenratio parameter.
We can make this (maxlenratio=0.0 then the end detection works) as default in future.

Manual setting (maxlenratio=0.8)

$ grep Avg exp/tr_it_a03_pt_enddetect/decode_*_it_beam20_eacc.best_p0_len0.0-0.8/result.txt
exp/tr_it_a03_pt_enddetect/decode_dt_it_beam20_eacc.best_p0_len0.0-0.8/result.txt:| Sum/Avg               | 1080   78951 | 84.2    7.3    8.5    3.7   19.4   99.1 |
exp/tr_it_a03_pt_enddetect/decode_et_it_beam20_eacc.best_p0_len0.0-0.8/result.txt:| Sum/Avg               | 1050   77586 | 84.2    7.1    8.7    3.5   19.3   98.9 |

Automatic with end detection (maxlenratio=0.0)

$ grep Avg exp/tr_it_a03_pt_enddetect/decode_*_it_beam20_eacc.best_p0_len0.0-0.0/result.txt
exp/tr_it_a03_pt_enddetect/decode_dt_it_beam20_eacc.best_p0_len0.0-0.0/result.txt:| Sum/Avg               | 1080   78951 | 84.3    7.3    8.5    3.8   19.5   99.1 |
exp/tr_it_a03_pt_enddetect/decode_et_it_beam20_eacc.best_p0_len0.0-0.0/result.txt:| Sum/Avg               | 1050   77586 | 84.2    7.1    8.7    3.5   19.3   98.9 |

from espnet.

sw005320 commented on May 13, 2024

@ShigekiKarita I'm thinking of implementing the LM integration. This is performed by modifying an existing chainer's ptb recipe to train an LSTMLM (https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb.py). Then, integrate LSTMLM with our main E2E. Can I ask you to make pytorch version of the training part later? Once you make the LSTMLM training part, I can implement the pytorch integration part. If you agree, I'll start the chainer-based implementation. If you think we should implement LSTMLM training part more seamless way for both pytorch and chainer rather than the above separate ways, I'm happy to do so, and want to discuss with you more about it.

from espnet.

ShigekiKarita commented on May 13, 2024

@sw005320 It sounds nice. I like the separate ways because I'll be little bit away from here on a few weeks around Jan 1st but I will keep on watching and discussing with you.

And you can find PTB example in pytorch here https://github.com/pytorch/examples/tree/master/word_language_model

from espnet.

sw005320 commented on May 13, 2024

Which is easier for you to port to the pytorch-backend LSTMLM training?
(chainer trainer based) https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb.py
or
(manual training loop) https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb_custom_loop.py

from espnet.

ShigekiKarita commented on May 13, 2024

I prefer the manual training loop because this trainer uses device operations inside it unlike e2e_asr_train.py (instead of the trainer, model.__call__ does)

from espnet.

sw005320 commented on May 13, 2024

Thanks.
This is my expectation.
I'll work on it.

from espnet.

sw005320 commented on May 13, 2024

@ShigekiKarita, @takaaki-hori and I discussed the possibility of implementing attention/CTC joint decoding, but it seems that warp_ctc does not provide enough interface to compute CTC scores during decoding efficiently. @takaaki-hori will explain it a bit more detail, but we may think to implement re-scoring rather than joint decoding.

from espnet.

takaaki-hori commented on May 13, 2024

@sw005320 , @ShigekiKarita , I added attention/CTC joint decoding and tested with Voxforge and WSJ.
I got some CER reduction (14.7->12.5 in Voxforge and 5.9->5.5 in WSJ), where I used decoding options "--minlenratio 0.0 --maxlenratio 0.0 --ctc-weight 0.3"
Can you take a look at the code and try it with other tasks? To test it, you first need to move to "joint-decoding" branch, and add "--ctc-weight" option in run.sh like egs/wsj/asr1/run.sh.

from espnet.

sw005320 commented on May 13, 2024

Great Hori-san. I'll review it. BTW, I'm also about to finish the LM integration and prepare to commit it (CER 5.9 -> 5.3, WER 18.0 -> 14.7 in the WSJ task).

from espnet.

kan-bayashi commented on May 13, 2024

@sw005320 Great result! I will do it.

from espnet.

sw005320 commented on May 13, 2024

I just added the fisher_swbd recipe. The results will be added later. Also, I finished Librespeech experiments with pytorch, and we got 7.7% WER for clean conditions. This is not bad. I'll work on making a language model training script for pytorch. Then, we'll have some more improvements in the Librespeech and fisher_swbd recipes, like the WSJ case.

from espnet.

sw005320 commented on May 13, 2024

It seems that #85 solves randomness issues in the pytorch backend.

from espnet.

kan-bayashi commented on May 13, 2024

Updated CSJ recipe results (#91).

# Deep VGGBLSTMP (elayers=6) with chainer backend
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:|        SPKR            |        # Snt                 # Wrd        |        Corr                  Sub                   Del                  Ins                  Err                 S.Err        |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:|        Sum/Avg         |        1272                  43897        |        91.4                  6.4                   2.3                  1.6                 10.2                  67.6        |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:|        SPKR            |        # Snt                 # Wrd        |        Corr                  Sub                   Del                  Ins                  Err                 S.Err        |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:|        Sum/Avg         |        1292                  43623        |        93.7                  5.1                   1.3                  1.2                  7.5                  65.2        |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:|        SPKR            |        # Snt                 # Wrd        |        Corr                  Sub                   Del                  Ins                  Err                 S.Err        |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:|        Sum/Avg         |        1385                  28225        |        93.6                  5.0                   1.4                  1.6                  8.0                  47.9        |

# Deep VGGBLSTMP (elayers=6) with chainer backend + CTC joint decoding
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:|         SPKR             |         # Snt                  # Wrd          |         Corr                    Sub                    Del                    Ins                     Err                  S.Err         |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:|         Sum/Avg          |         1272                   43897          |         91.6                    6.0                    2.3                    1.4                     9.7                   66.5         |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:|         SPKR             |         # Snt                  # Wrd          |         Corr                    Sub                    Del                    Ins                     Err                  S.Err         |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:|         Sum/Avg          |         1292                   43623          |         94.1                    4.6                    1.3                    1.0                     6.9                   64.5         |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:|         SPKR             |         # Snt                  # Wrd          |         Corr                    Sub                    Del                    Ins                     Err                  S.Err         |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:|         Sum/Avg          |         1385                   28225          |         93.9                    4.7                    1.4                    1.4                     7.5                   47.7         |

+# Deep VGGBLSTMP (elayers=6) with chainer backend + CTC joint decoding + LM rescoreing
 +exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:|         SPKR             |         # Snt                  # Wrd          |         Corr                    Sub                    Del                    Ins                     Err                  S.Err         |
 +exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:|         Sum/Avg          |         1272                   43897          |         92.5                    5.3                    2.2                    1.3                     8.8                   63.4         |
 +exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:|         SPKR             |         # Snt                  # Wrd          |         Corr                    Sub                    Del                    Ins                     Err                  S.Err         |
 +exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:|         Sum/Avg          |         1292                   43623          |         94.7                    4.1                    1.2                    0.9                     6.2                   60.7         |
 +exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:|         SPKR             |         # Snt                  # Wrd          |         Corr                    Sub                    Del                    Ins                     Err                  S.Err         |
 +exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:|         Sum/Avg          |         1385                   28225          |         94.3                    4.2                    1.5                    1.2                     7.0                   45.2         |

task: vggblmsp -> +ctc joint -> ++ lm rescoring
eval1 : 10.2 -> 9.7 -> 8.8
eval2: 7.5 -> 6.9 -> 6.2
eval3: 8.0 -> 7.5 -> 7.0

from espnet.

sw005320 commented on May 13, 2024

I think we almost finish all our targets except for VGG (@ShigekiKarita is this still difficult?). After refactoring #102, we can move on to a next development plan.

from espnet.

ShigekiKarita commented on May 13, 2024

Yes, it is still difficult. I hope that someone else also takes a look at the connection part between VGG and BLSTMP.

from espnet.

sw005320 commented on May 13, 2024

@ShigekiKarita, did you specify the that problem is the connection part? You mean VGG and BLSTMP themselves are correct?

from espnet.

ShigekiKarita commented on May 13, 2024

Yes. For BLSTMP, experimental results are equal as seen in #9 . For VGG, chainer/pytorch impls show equal activations and gradients when all the parameters initialized with contantats w[:]=1 and b[:]=0 (without random values) #47. https://github.com/ShigekiKarita/espnet/blob/2a44f292d44c9e23c6ac8e24ea7eb9e2c64b0cb8/test/test_vgg.py

Hence, the last part is the connetion between VGG and BLSTMP.

from espnet.

kan-bayashi commented on May 13, 2024

To use current version, we have to update warp-ctc (#105).
Please try following commands to update.

cd tools
rm -r warp-ctc
make warp-ctc

from espnet.

geniki commented on May 13, 2024

Is the pytorch version of the LM integration a priority for this stable version?

Many thanks for the great repo.

from espnet.

sw005320 commented on May 13, 2024

@kan-bayashi is considering it. It is not super high priority for now, but anyway we're working on it.

from espnet.

sw005320 commented on May 13, 2024

@geniki We have finished the pytorch LM integration #114

We have finished the most of action items. I'll close this issue.

from espnet.

Toward a stable version about espnet HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs