Comments (26)
Guys, by combining LSTMLM and joint attention/CTC decoding, we finally get CER 5.3 -> 3.8, WER 14.7 -> 9.3 in the WSJ task!!! The nice thing is that we don't have to set min/maxlength and penalty (all set to 0.0), while we might need to tune the CTC and LM weights (0.3 and 1.0, respectively, see #76).
@kan-bayashi, can you play with LSTMLM and joint decoding with the TEDLIUM recipe? You can train LSTMLM by using text data by referring tools/kaldi/egs/tedlium/s5_r2/local/ted_train_lm.sh
and simply using
gunzip -c db/TEDLIUM_release2/LM/*.en.gz | sed 's/ <\/s>//g' | local/join_suffix.py | gzip -c > ${dir}/data/text/train.txt.gz
from espnet.
The results of tedlium with ctc joint decoding and lm rescoring are as follows:
exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_dev_beam20_eacc.best_p0.1_len0.0-0.0_ctcw0.3_rnnlm1.0/result.txt:| Sum/Avg | 507 95429 | 91.8 4.2 4.0 2.7 10.8 89.3 |
exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_test_beam20_eacc.best_p0.1_len0.0-0.0_ctcw0.3_rnnlm1.0/result.txt:| Sum/Avg | 1155 145066 | 92.2 3.7 4.1 2.4 10.1 85.3 |
exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_dev_beam20_eacc.best_p0.1_len0.0-0.0_ctcw0.3_rnnlm1.0/result.wrd.txt:| Sum/Avg | 507 17783 | 83.2 13.7 3.1 3.0 19.8 89.3 |
exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_test_beam20_eacc.best_p0.1_len0.0-0.0_ctcw0.3_rnnlm1.0/result.wrd.txt:| Sum/Avg | 1155 27500 | 84.0 12.3 3.7 2.6 18.6 85.3 |
for dev set, CER 12.6 -> 10.8, WER 24.8 -> 19.8
for test set, CER 11.9 -> 10.1, WER 23.4 -> 18.6
from espnet.
At least, we should add docstring for src/nets
.
from espnet.
I agree. Type (class) and shape are essential information for everyone.
from espnet.
Frankly, I don't have so much experience, and if @kan-bayashi initiates this, I'll follow you and add/modify/enhance the document.
Also, we should make a webpage somewhere.
Do you have any idea (just using github website host service?)?
from espnet.
The implementation of the end detection is finished #46
The performance was (really) slightly decreased, and this is quite effective by considering the fact that we don't have to tune the maxlenratio parameter.
We can make this (maxlenratio=0.0 then the end detection works) as default in future.
Manual setting (maxlenratio=0.8)
$ grep Avg exp/tr_it_a03_pt_enddetect/decode_*_it_beam20_eacc.best_p0_len0.0-0.8/result.txt
exp/tr_it_a03_pt_enddetect/decode_dt_it_beam20_eacc.best_p0_len0.0-0.8/result.txt:| Sum/Avg | 1080 78951 | 84.2 7.3 8.5 3.7 19.4 99.1 |
exp/tr_it_a03_pt_enddetect/decode_et_it_beam20_eacc.best_p0_len0.0-0.8/result.txt:| Sum/Avg | 1050 77586 | 84.2 7.1 8.7 3.5 19.3 98.9 |
Automatic with end detection (maxlenratio=0.0)
$ grep Avg exp/tr_it_a03_pt_enddetect/decode_*_it_beam20_eacc.best_p0_len0.0-0.0/result.txt
exp/tr_it_a03_pt_enddetect/decode_dt_it_beam20_eacc.best_p0_len0.0-0.0/result.txt:| Sum/Avg | 1080 78951 | 84.3 7.3 8.5 3.8 19.5 99.1 |
exp/tr_it_a03_pt_enddetect/decode_et_it_beam20_eacc.best_p0_len0.0-0.0/result.txt:| Sum/Avg | 1050 77586 | 84.2 7.1 8.7 3.5 19.3 98.9 |
from espnet.
@ShigekiKarita I'm thinking of implementing the LM integration. This is performed by modifying an existing chainer's ptb recipe to train an LSTMLM (https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb.py). Then, integrate LSTMLM with our main E2E. Can I ask you to make pytorch version of the training part later? Once you make the LSTMLM training part, I can implement the pytorch integration part. If you agree, I'll start the chainer-based implementation. If you think we should implement LSTMLM training part more seamless way for both pytorch and chainer rather than the above separate ways, I'm happy to do so, and want to discuss with you more about it.
from espnet.
@sw005320 It sounds nice. I like the separate ways because I'll be little bit away from here on a few weeks around Jan 1st but I will keep on watching and discussing with you.
And you can find PTB example in pytorch here https://github.com/pytorch/examples/tree/master/word_language_model
from espnet.
Which is easier for you to port to the pytorch-backend LSTMLM training?
(chainer trainer based) https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb.py
or
(manual training loop) https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb_custom_loop.py
from espnet.
I prefer the manual training loop because this trainer uses device operations inside it unlike e2e_asr_train.py (instead of the trainer, model.__call__
does)
from espnet.
Thanks.
This is my expectation.
I'll work on it.
from espnet.
@ShigekiKarita, @takaaki-hori and I discussed the possibility of implementing attention/CTC joint decoding, but it seems that warp_ctc does not provide enough interface to compute CTC scores during decoding efficiently. @takaaki-hori will explain it a bit more detail, but we may think to implement re-scoring rather than joint decoding.
from espnet.
@sw005320 , @ShigekiKarita , I added attention/CTC joint decoding and tested with Voxforge and WSJ.
I got some CER reduction (14.7->12.5 in Voxforge and 5.9->5.5 in WSJ), where I used decoding options "--minlenratio 0.0 --maxlenratio 0.0 --ctc-weight 0.3"
Can you take a look at the code and try it with other tasks? To test it, you first need to move to "joint-decoding" branch, and add "--ctc-weight" option in run.sh like egs/wsj/asr1/run.sh.
from espnet.
Great Hori-san. I'll review it. BTW, I'm also about to finish the LM integration and prepare to commit it (CER 5.9 -> 5.3, WER 18.0 -> 14.7 in the WSJ task).
from espnet.
@sw005320 Great result! I will do it.
from espnet.
I just added the fisher_swbd recipe. The results will be added later. Also, I finished Librespeech experiments with pytorch, and we got 7.7% WER for clean conditions. This is not bad. I'll work on making a language model training script for pytorch. Then, we'll have some more improvements in the Librespeech and fisher_swbd recipes, like the WSJ case.
from espnet.
It seems that #85 solves randomness issues in the pytorch backend.
from espnet.
Updated CSJ recipe results (#91).
# Deep VGGBLSTMP (elayers=6) with chainer backend
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:| Sum/Avg | 1272 43897 | 91.4 6.4 2.3 1.6 10.2 67.6 |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:| Sum/Avg | 1292 43623 | 93.7 5.1 1.3 1.2 7.5 65.2 |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.1_len0.1-0.5/result.txt:| Sum/Avg | 1385 28225 | 93.6 5.0 1.4 1.6 8.0 47.9 |
# Deep VGGBLSTMP (elayers=6) with chainer backend + CTC joint decoding
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:| Sum/Avg | 1272 43897 | 91.6 6.0 2.3 1.4 9.7 66.5 |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:| Sum/Avg | 1292 43623 | 94.1 4.6 1.3 1.0 6.9 64.5 |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.0/result.txt:| Sum/Avg | 1385 28225 | 93.9 4.7 1.4 1.4 7.5 47.7 |
+# Deep VGGBLSTMP (elayers=6) with chainer backend + CTC joint decoding + LM rescoreing
+exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
+exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:| Sum/Avg | 1272 43897 | 92.5 5.3 2.2 1.3 8.8 63.4 |
+exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
+exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:| Sum/Avg | 1292 43623 | 94.7 4.1 1.2 0.9 6.2 60.7 |
+exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
+exp/train_nodup_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm0.3/result.txt:| Sum/Avg | 1385 28225 | 94.3 4.2 1.5 1.2 7.0 45.2 |
task: vggblmsp -> +ctc joint -> ++ lm rescoring
eval1 : 10.2 -> 9.7 -> 8.8
eval2: 7.5 -> 6.9 -> 6.2
eval3: 8.0 -> 7.5 -> 7.0
from espnet.
I think we almost finish all our targets except for VGG (@ShigekiKarita is this still difficult?). After refactoring #102, we can move on to a next development plan.
from espnet.
Yes, it is still difficult. I hope that someone else also takes a look at the connection part between VGG and BLSTMP.
from espnet.
@ShigekiKarita, did you specify the that problem is the connection part? You mean VGG and BLSTMP themselves are correct?
from espnet.
Yes. For BLSTMP, experimental results are equal as seen in #9 . For VGG, chainer/pytorch impls show equal activations and gradients when all the parameters initialized with contantats w[:]=1 and b[:]=0 (without random values) #47. https://github.com/ShigekiKarita/espnet/blob/2a44f292d44c9e23c6ac8e24ea7eb9e2c64b0cb8/test/test_vgg.py
Hence, the last part is the connetion between VGG and BLSTMP.
from espnet.
To use current version, we have to update warp-ctc (#105).
Please try following commands to update.
cd tools
rm -r warp-ctc
make warp-ctc
from espnet.
Is the pytorch version of the LM integration a priority for this stable version?
Many thanks for the great repo.
from espnet.
@kan-bayashi is considering it. It is not super high priority for now, but anyway we're working on it.
from espnet.
@geniki We have finished the pytorch LM integration #114
We have finished the most of action items. I'll close this issue.
from espnet.
Related Issues (20)
- Availability of OWSM-CTC HOT 5
- unclear librispeech data prepare scripts for owsm_v1/s2t1 HOT 3
- [QUESTION] [TTS] 'num_elements_batch_sampler' loses the randomness of the samples HOT 1
- how to set padding_idx in conformer_ctc? set padding_idx = -1 may be wrong ? HOT 5
- Probably a bug in saving checkpoints and loading for inference HOT 1
- How to inference with s4 decoder? HOT 1
- Upgrade typeguard version HOT 1
- Changes that requires to be made while using wav2vec2.0(CLSRIL-23.pt) features for training CTC/Attention based training HOT 6
- Error when training VITS model for vctk dataset HOT 3
- No such parameter e_branchformer_ctc in encoder parameter HOT 3
- X-vector based TTS model packaging broken in tts.sh HOT 1
- USES `ref_channel` usage HOT 4
- Question regarding switching speakers, weights during runtime.
- Question about asr2.sh and its options to reproduce the librispeech_100 recipe. HOT 5
- An error when using LoRA for s3prl frontend. HOT 1
- TSE with Librimix: mismatch in number of speakers HOT 4
- Streaming ASR model latency issue HOT 6
- asr_train.py: error: unrecognized arguments: use_lora HOT 1
- Espnet Collect stats: s3prl Upstream 'hubert-large-ll60k' HOT 5
- How to use 960h LM? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from espnet.