kaituoxu / listen-attend-spell Goto Github PK

A PyTorch implementation of Listen, Attend and Spell (LAS), an End-to-End ASR framework.

Shell 19.38% Python 79.84% Makefile 0.78%

asr listen-attend-and-spell end-to-end pytorch

listen-attend-spell's Introduction

Listen, Attend and Spell

A PyTorch implementation of Listen, Attend and Spell (LAS) [1], an end-to-end automatic speech recognition framework, which directly converts acoustic features to character sequence using only one nueral network.

Install

Python3 (Recommend Anaconda)
PyTorch 0.4.1+
Kaldi (Just for feature extraction)
pip install -r requirements.txt
cd tools; make KALDI=/path/to/kaldi
If you want to run egs/aishell/run.sh, download aishell dataset for free.

Usage

$ cd egs/aishell and modify aishell data path to your path in run.sh.
$ bash run.sh, that's all!

You can change hyper-parameter by $ bash run.sh --parameter_name parameter_value, egs, $ bash run.sh --stage 3. See parameter name in egs/aishell/run.sh before . utils/parse_options.sh.

More detail

$ cd egs/aishell/
$ . ./path.sh

Train

$ train.py -h

Decode

$ recognize.py -h

Workflow

Workflow of egs/aishell/run.sh:

Stage 0: Data Preparation
Stage 1: Feature Generation
Stage 2: Dictionary and Json Data Preparation
Stage 3: Network Training
Stage 4: Decoding

Visualize loss

If you want to visualize your loss, you can use visdom to do that:

Open a new terminal in your remote server (recommend tmux) and run $ visdom.
Open a new terminal and run $ bash run.sh --visdom 1 --visdom_id "<any-string>" or $ train.py ... --visdom 1 --vidsdom_id "<any-string>".
Open your browser and type <your-remote-server-ip>:8097, egs, 127.0.0.1:8097.
In visdom website, chose <any-string> in Environment to see your loss.

Results

Model	CER	Config
LSTMP	9.85	4x(1024-512)
Listen, Attend and Spell	13.2	See egs/aishell/run.sh

Reference

[1] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP 2016. (https://arxiv.org/abs/1508.01211v2)

listen-attend-spell's People

Contributors

Stargazers

Watchers

listen-attend-spell's Issues

作者能否添加一个语音识别的接口，成功训练后想尝试使用模型

Mistakes in training

Hello!
For the first time running LAS, I want to use your script to learn. The following errors have been made in the training of Python 3.6 and torch 1.7. Is it because which of the two versions is too high?

Looking forward to your reply
Thank you!

Time Resolution Question.

Hi. @kaituoxu
thanks to your good project.
if i don't use bucketing input data, shouldn't i use time resolution?
i think that time resolution makes unbalance split if not bucketing.

Some error in stage=3

Thank you for sharing your code.
But I have some trouble in stage=3.
TypeError lstm() received an invalid combination of arguments -got(Tensor,Tensor,tuple,list,bool,int,float,bool,int) but expected one of (Tensor data, Tensor batch_sizes,tuple of Tensor hs,tuple of paras, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional)
Have you met this problem before?Thanks again.

train from previous checkpoint

I tried to train model from previous checkpoint

For example, I trained the model during 100 epochs and got the final.pth.tar file.
I put the abs path to it in the run.sh in lines:

...
# logging and visualize
checkpoint=0
continue_from="/home/karina/Listen-Attend-Spell/egs/aishell/exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch100_norm5_bs64_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/final.pth.tar"
print_freq=10
visdom=0
visdom_id="LAS Training"
...

but training exiting with this log:

# train.py --train_json dump/train/deltatrue/data.json --valid_json dump/dev/deltatrue/data.json --dict data/lang_1char/train_chars.txt --einput 240 --ehidden 256 --elayer 3 --edropout 0.2 --ebidirectional 1 --etype lstm --atype dot --dembed 512 --dhidden 512 --dlayer 1 --epochs 10 --half_lr 1 --early_stop 0 --max_norm 5 --batch_size 64 --maxlen_in 800 --maxlen_out 150 --optimizer adam --lr 1e-3 --momentum 0 --l2 1e-5 --save_folder exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch10_norm5_bs64_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta --checkpoint 1 --continue_from /home/karina/Listen-Attend-Spell/egs/aishell/exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch100_norm5_bs64_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/final.pth.tar --print_freq 10 --visdom 0 --visdom_id "LAS Training" 
# Started at Fri Sep 13 03:00:41 MSK 2019
#
Namespace(atype='dot', batch_size=64, checkpoint=1, continue_from='/home/karina/Listen-Attend-Spell/egs/aishell/exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch100_norm5_bs64_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/final.pth.tar', dembed=512, dhidden=512, dict='data/lang_1char/train_chars.txt', dlayer=1, early_stop=0, ebidirectional=1, edropout=0.2, ehidden=256, einput=240, elayer=3, epochs=10, etype='lstm', half_lr=1, l2=1e-05, lr=0.001, max_norm=5.0, maxlen_in=800, maxlen_out=150, model_path='final.pth.tar', momentum=0.0, num_workers=4, optimizer='adam', print_freq=10, save_folder='exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch10_norm5_bs64_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta', train_json='dump/train/deltatrue/data.json', valid_json='dump/dev/deltatrue/data.json', visdom=0, visdom_id='LAS Training')
Seq2Seq(
  (encoder): Encoder(
    (rnn): LSTM(240, 256, num_layers=3, batch_first=True, dropout=0.2, bidirectional=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(38, 512)
    (rnn): ModuleList(
      (0): LSTMCell(1024, 512)
    )
    (attention): DotProductAttention()
    (mlp): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): Tanh()
      (2): Linear(in_features=512, out_features=38, bias=True)
    )
  )
)
Loading checkpoint model /home/karina/Listen-Attend-Spell/egs/aishell/exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch100_norm5_bs64_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/final.pth.tar
Traceback (most recent call last):
  File "/home/karina/Listen-Attend-Spell/egs/aishell/../../src/bin/train.py", line 146, in <module>
    main(args)
  File "/home/karina/Listen-Attend-Spell/egs/aishell/../../src/bin/train.py", line 139, in main
    solver = Solver(data, model, optimizier, args)
  File "/home/karina/Listen-Attend-Spell/src/solver/solver.py", line 43, in __init__
    self._reset()
  File "/home/karina/Listen-Attend-Spell/src/solver/solver.py", line 53, in _reset
    self.tr_loss[:self.start_epoch] = package['tr_loss'][:self.start_epoch]
RuntimeError: The expanded size of the tensor (10) must match the existing size (13) at non-singleton dimension 0.  Target sizes: [10].  Tensor sizes: [13]
# Accounting: time=4 threads=1
# Ended (code 1) at Fri Sep 13 03:00:45 MSK 2019, elapsed time 4 seconds

what object can give this tensor size problem?
do I correctly use training from checkpoint?

can't run aishell_data_prep.sh on original data, organised the same way with aishell

Hello again!

Even if I didn't succeed with training and decoding stages on original dataset, I decided try to run run.sh on my own data (annotations in russian), but I organised it exactly the same way that aishell data organised.

I created dir rus_data with transcript and wav dirs inside. I put inside trancript dir txt file in format: filename-without-extention, space, transcription
In wav dir I created train, test and dev dirs, and puted inside them dirs with wavs.

After that I changed pathes in run.sh that aishell_data_prep.sh requires.

But run.sh throws the error:

Stage 0: Data Preparation
Error: local/aishell_data_prep.sh requires two directory arguments

if I run separately aishell_data_prep.sh with 2 pathes I face the same error.

Do you have any ideas why is it? What should I change?

how to use this model on VCTK corpus

I do not know how to transform the VCTK data into kaldi format; the model seems to be a fit for aishell which is another (mandarin) corpus.

Here is more detail: https://docs.google.com/document/d/1jkljv9BlOkVwP7E78EpSh7nSiYqrY2tXXyWOduPD74g/edit#

Can your code train on librispeech?

Hi~ Have you tried to train on librispeech? If you have tried, could you tell me the WER on test? Thank you!

Can this code be used on the aishell2 dataset? What code should I modify?

python编码问题

非常感谢你分享的代码！我用你的代码跑了一下实验，有些文件不是unix编码风格，需要使用set ff=unix命令转换。你建议使用python3来运行代码，但是text2token.py是python2风格，导致之后data2json.sh的输出json文件是空。请问该如何修改代码？感谢

At which point in the model is the temporal dimension of the input feature reduced ?

In the original LAS paper, we can read :

In our model, we stack 3 pBLSTMs on top of the bottom BLSTM layer to reduce the time resolution
2
3 = 8 times. This allows the attention model (see next section) to extract the relevant information
from a smaller number of times steps.

My understanding is that this temporal squishing is performed by the encoder. However, when I pass a tensor of size [4, 963, 128] along to the length tensor, to the encoder (bs = 4, max_length = 963, num_features = 128'), I get an output of size [4, 963, 1024]` 1024 is the size of the hidden layer so this makes sense but 963 is the original max_length. Is this supposed to happen ? Should the length be smaller after passing through the encoder, or is this perfectly normal and something happens in the decoder?

Alternate dataset

Can we make use of the TIMIT dataset instead of the Aishell dataset. If so, what vocabulary dictionary should we make use in doing so? I am having some problems with the aishell dataset. That is why I have to proceed with the TImit dataset. Moreover, I need the English dataset to better understand the models.

what's the SOTA result of LAS on librispeech and aishell?

decoding error after successful aishell train

Hi! I managed to train LAS on aishell data without errors. This is the end of the log:

Epoch 20 | Iter 441 | Average Loss 0.406 | Current Loss 0.505424 | 64.8 ms/batch
Epoch 20 | Iter 451 | Average Loss 0.409 | Current Loss 0.383116 | 64.1 ms/batch
-------------------------------------------------------------------------------------
Valid Summary | End of Epoch 20 | Time 956.81s | Valid Loss 0.410
-------------------------------------------------------------------------------------
Learning rate adjusted to: 0.000000
Find better validated model, saving to exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/final.pth.tar
# Accounting: time=21312 threads=1
# Ended (code 0) at Fri Aug 30 17:15:39 MSK 2019, elapsed time 21312 seconds

but decoding stage gave an error:

Stage 4: Decoding
run.pl: job failed, log is in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/decode.log
2019-08-30 17:15:39,608 (json2trn:24) INFO: reading exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json
Traceback (most recent call last):
 File “/home/karina/Listen-Attend-Spell/egs/aishell/../../src/utils/json2trn.py”, line 25, in <module>
   with open(args.json, ‘r’) as f:
FileNotFoundError: [Errno 2] No such file or directory: ‘exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json’
write a CER (or TER) result in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/result.txt
|      SPKR        |         # Snt                   # Wrd         |      Corr              Sub              Del              Ins              Err            S.Err      |
|      Sum/Avg     |             0                       0         |       0.0              0.0              0.0              0.0              0.0              0.0      |

I don't understand why there are no some file in that directory. I thought everything that run.pl need are generated by themself there

can't run train and decode stages on aishell dataset

Hello! I tried to run run.sh for aishell dataset to test the usage of your code/algo and I didn't meet success in this task. I faced a problem on a stage 3:

Stage 3: Network Training
run.pl: job failed, log is in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/train.log
Stage 4: Decoding
run.pl: job failed, log is in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/decode.log
2019-08-23 20:05:08,030 (json2trn:24) INFO: reading exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json
Traceback (most recent call last):
  File "/home/karina/Listen-Attend-Spell/egs/aishell/../../src/utils/json2trn.py", line 25, in <module>
    with open(args.json, 'r') as f:
IOError: [Errno 2] No such file or directory: 'exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json'
cp: cannot stat 'exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/ref.trn': No such file or directory
cp: cannot stat 'exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/hyp.trn': No such file or directory
Traceback (most recent call last):
  File "/home/karina/Listen-Attend-Spell/egs/aishell/../../src/utils/filt.py", line 21, in <module>
    with open(args.infile) as textfile:
IOError: [Errno 2] No such file or directory: 'exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/ref.trn.org'
Traceback (most recent call last):
  File "/home/karina/Listen-Attend-Spell/egs/aishell/../../src/utils/filt.py", line 21, in <module>
    with open(args.infile) as textfile:
IOError: [Errno 2] No such file or directory: 'exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/hyp.trn.org'
write a CER (or TER) result in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/result.txt
|      SPKR        |         # Snt                   # Wrd         |      Corr              Sub              Del              Ins              Err            S.Err      |
|      Sum/Avg     |             0                       0         |       0.0              0.0              0.0              0.0              0.0              0.0      |

what should I do to fix it?

What parameters are used in your model, CER reaches 13.2%？

About train_chars.txt and non_lang_syms.txt

Thank you for sharing your code, But lexcion is missing in your Github.
I wonder if it is convenient for you to show the format of these two files?

Unable to extract aishell data

I downloaded the data_aishell.tgz file from opensrl website. It showed to be around 15GB in size. Later on, when I tried to extract files from it, it displays an error that says : "Invalid Compressed Data: Unable to Inflate". I used WinZip for the purpose. Can someone please help me with this?
I can proceed only when I have extracted the data. Thanks in advance.