ai-adv-lab / deepspeech.mxnet Goto Github PK

View Code? Open in Web Editor NEW

83.0 6.0 34.0 278 KB

A MXNet implementation of Baidu's DeepSpeech architecture

License: Apache License 2.0

Python 99.82% Shell 0.18%

mxnet warp-ctc speech baidu deepspeech arch stt speech-recognition speech-to-text

deepspeech.mxnet's Introduction

deepSpeech.mxnet: Rich Speech Example

This example based on DeepSpeech2 of Baidu helps you to build Speech-To-Text (STT) models at scale using

CNNs, fully connected networks, (Bi-) RNNs, (Bi-) LSTMs, and (Bi-) GRUs for network layers,
batch-normalization and drop-outs for training efficiency,
and a Warp CTC for loss calculations.

In order to make your own STT models, besides, all you need is to just edit a configuration file not actual codes.

Motivation

This example is intended to guide people who want to making practical STT models with MXNet. With rich functionalities and convenience explained above, you can build your own speech recognition models with it easier than former examples.

Environments

MXNet version: 0.9.5+
GPU memory size: 2.4GB+
Install tensorboard for logging

pip install tensorboard

SoundFile for audio preprocessing (If encounter errors about libsndfile, follow this tutorial.)

pip install soundfile

Warp CTC: Follow this instruction to install Baidu's Warp CTC.
We strongly recommend that you first test a model of small networks.

How it works

Preparing data

Input data are described in a JSON file Libri_sample.json as followed.

{"duration": 2.9450625, "text": "and sharing her house which was near by", "key": "./Libri_sample/3830-12531-0030.wav"}
{"duration": 3.94, "text": "we were able to impart the information that we wanted", "key": "./Libri_sample/3830-12529-0005.wav"}

You can download two wave files above from this. Put them under /path/to/yourproject/Libri_sample/.

Setting the configuration file

[Notice] The configuration file "default.cfg" included describes DeepSpeech2 with slight changes. You can test the original DeepSpeech2("deepspeech.cfg") with a few line changes to the cfg file:


[common]
...
learning_rate = 0.0003
# constant learning rate annealing by factor
learning_rate_annealing = 1.1
optimizer = sgd
...
is_bi_graphemes = True
...
[arch]
...
num_rnn_layer = 7
num_hidden_rnn_list = [1760, 1760, 1760, 1760, 1760, 1760, 1760]
num_hidden_proj = 0
num_rear_fc_layers = 1
num_hidden_rear_fc_list = [1760]
act_type_rear_fc_list = ["relu"]
...
[train]
...
learning_rate = 0.0003
# constant learning rate annealing by factor
learning_rate_annealing = 1.1
optimizer = sgd
...

Run the example

Train

cd /path/to/your/project/
mkdir checkpoints
mkdir log
python main.py --configfile default.cfg

Checkpoints of the model will be saved at every n-th epoch.

Load

You can (re-) train (saved) models by loading checkpoints (starting from 0). For this, you need to modify only two lines of the file "default.cfg".

...
[common]
# mode can be one of the followings - train, predict, load
mode = load
...
model_file = 'file name of your model saved'
...

Predict

You can predict (or test) audios by specifying the mode, model, and test data in the file "default.cfg".

...
[common]
# mode can be one of the followings - train, predict, load
mode = predict
...
model_file = 'file name of your model to be tested'
...
[data]
...
test_json = 'a json file described test audios'
...

Run the following line after all modification explained above.

python main.py --configfile default.cfg

Train and test your own models

Train and test your own models by preparing two files.

A new configuration file, i.e., custom.cfg, corresponding to the file 'default.cfg'. The new file should specify the items below the '[arch]' section of the original file.
A new implementation file, i.e., arch_custom.py, corresponding to the file 'arch_deepspeech.py'. The new file should implement two functions, prepare_data() and arch(), for building networks described in the new configuration file.

Run the following line after preparing the files.

python main.py --configfile custom.cfg --archfile arch_custom

Further more

You can prepare full LibriSpeech dataset by following the instruction on https://github.com/baidu-research/ba-dls-deepspeech
Change flac_to_wav.sh script of baidu to flac_to_wav.sh in repository to avoid bug

git clone https://github.com/baidu-research/ba-dls-deepspeech
cd ba-dls-deepspeech
./download.sh
cp -f /path/to/example/flac_to_wav.sh ./
./flac_to_wav.sh
python create_desc_json.py /path/to/ba-dls-deepspeech/LibriSpeech/train-clean-100 train_corpus.json
python create_desc_json.py /path/to/ba-dls-deepspeech/LibriSpeech/dev-clean validation_corpus.json
python create_desc_json.py /path/to/ba-dls-deepspeech/LibriSpeech/test-clean test_corpus.json

deepspeech.mxnet's People

Contributors

Stargazers

Watchers

deepspeech.mxnet's Issues

hard to understand the gru implementation, could you explain?

hi,
me again. I have a question regarding to the implementation of gru.
how your implementation is equal to formulas below? I have a hard time understanding the equivalence of these two. Could you help enlighten me?

from tensorboard import SummaryWriter failed in train.py

from tensorboard import SummaryWriter
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name SummaryWriter

tensorboard (1.6.0) failed import SummaryWriter
I use tensorboardX (1.1) instead, and changed it to from tensorboardX import SummaryWriter.
It works well.

Will this branch merge into master branch?

Thanks in advance! 😆

How many epochs does DeepSpeech2 need to converge on LibriSpeech

I read the issue regarding the performance of DeepSpeech2 and noticed the CER result reported by @Soonhwan-Kwon is 0.15648 at epoch 3.

It seems really promising so I'm trying to reproduce the result. But right now I'm at epoch 5 and my validation CER (vali-clean and val-other) is still 0.3122.... So I'm wondering whether I did anything wrong or was that the intended result.

Also, the test other CER on LibriSpeech reported in DeepSpeech2 paper was 0.1325. Have you guys ever come close to this number? And if so, how many epochs do you need to get there?

Thanks in advance!

Can't reproduce cer,

We train this model using one NVidia P100 GPU, but we can't reproduce the result you provided.

Could you kindly help to provide the pretrain model ?

Or your Epoch 0 is our Epoch19 ?

Thanks very much!

This is our cer
[ INFO][2018/03/06 16:52:02.162] Epoch[0] val cer=0.538507 (63447 / 137482)
[ INFO][2018/03/06 22:41:14.731] Epoch[1] val cer=0.388775 (83967 / 137375)
[ INFO][2018/03/07 04:30:03.758] Epoch[2] val cer=0.319315 (93501 / 137363)
[ INFO][2018/03/07 10:18:48.315] Epoch[3] val cer=0.279020 (99178 / 137560)
[ INFO][2018/03/07 16:08:08.181] Epoch[4] val cer=0.254146 (102584 / 137539)
[ INFO][2018/03/07 21:56:56.019] Epoch[5] val cer=0.239708 (104525 / 137480)
[ INFO][2018/03/08 03:45:09.380] Epoch[6] val cer=0.225611 (106401 / 137400)
[ INFO][2018/03/08 09:33:38.506] Epoch[7] val cer=0.210001 (108515 / 137361)
[ INFO][2018/03/08 15:22:11.360] Epoch[8] val cer=0.209912 (108664 / 137534)
[ INFO][2018/03/08 21:10:18.962] Epoch[9] val cer=0.195206 (110664 / 137506)
[ INFO][2018/03/09 02:58:42.151] Epoch[10] val cer=0.192939 (110899 / 137411)
[ INFO][2018/03/09 08:47:19.200] Epoch[11] val cer=0.188388 (111569 / 137466)
[ INFO][2018/03/09 14:35:52.631] Epoch[12] val cer=0.183532 (112266 / 137502)
[ INFO][2018/03/09 20:24:17.168] Epoch[13] val cer=0.187081 (111852 / 137593)
[ INFO][2018/03/10 02:12:39.140] Epoch[14] val cer=0.183611 (112180 / 137410)
[ INFO][2018/03/10 08:00:59.198] Epoch[15] val cer=0.180528 (112697 / 137524)
[ INFO][2018/03/10 13:49:28.977] Epoch[16] val cer=0.180026 (112771 / 137530)
[ INFO][2018/03/10 19:37:55.186] Epoch[17] val cer=0.182596 (112505 / 137637)
[ INFO][2018/03/11 01:25:56.847] Epoch[18] val cer=0.177797 (113048 / 137494)
[ INFO][2018/03/11 07:14:02.295] Epoch[19] val cer=0.178752 (112874 / 137442)
[ INFO][2018/03/11 13:02:28.010] Epoch[20] val cer=0.176924 (113233 / 137573)
[ INFO][2018/03/11 18:50:45.359] Epoch[21] val cer=0.172450 (113784 / 137495)
[ INFO][2018/03/12 00:39:22.560] Epoch[22] val cer=0.178593 (112996 / 137564)
[ INFO][2018/03/12 06:28:00.156] Epoch[23] val cer=0.172375 (113575 / 137230)
[ INFO][2018/03/12 12:16:44.326] Epoch[24] val cer=0.176732 (113010 / 137270)
[ INFO][2018/03/12 18:05:27.260] Epoch[25] val cer=0.173333 (113751 / 137602)
[ INFO][2018/03/12 23:54:03.362] Epoch[26] val cer=0.179838 (112865 / 137613)
[ INFO][2018/03/13 05:42:37.309] Epoch[27] val cer=0.173278 (113666 / 137490)

and this is yours
Epoch[0] unfortunately powered off when 19 * 3000 batches(batch size is 12)
Epoch[1] 0.177740 (we restarted from checkpoint of 19*3000th batch)
Epoch[2] 0.144390
Epoch[3] 0.126324
Epoch[4] 0.121056
Epoch[5] 0.110635
Epoch[6] 0.102347
Epoch[7] 0.100333
Epoch[8] 0.098945
(test clean dataset and wav file with shorter than 16 second limitation).

TEDLUIM

Any plans to add support for the TEDLUIM dataset? It would be good to benchmark it against other DeepSpeech2 implementations, eg. Mozilla DeepSpeech and PyTorch DeepSpeech.

Potential memory leak?

As I'm training the model (1000 hours of data, batch size is 100 on 4 GPUs, code stays pretty much unchanged), I noticed the CPU memory usage gets higher and higher and eventually the system has to kill the process.

So I'm wondering whether others have had this problem too, and if so, could there be a potential memory leak in the code?

How do I use dropout?

I found that there are several places where you can dropout, what experience do you have?

keyError

When I training LibriSpeech after validation I got an error below:
Traceback (most recent call last):
File "main.py", line 306, in
do_training(args=args, module=module, data_train=data_train, data_val=data_val)
File "/export/fanlu/deepspeech.mxnet/train.py", line 141, in do_training
for nbatch, data_batch in enumerate(data_val):
File "/export/fanlu/deepspeech.mxnet/stt_io_bucketingiter.py", line 132, in next
save_feature_as_csvfile=self.save_feature_as_csvfile)
File "/export/fanlu/deepspeech.mxnet/stt_datagenerator.py", line 185, in prepare_minibatch
label = labelUtil.convert_bi_graphemes_to_num(label)
File "/export/fanlu/deepspeech.mxnet/label_util.py", line 89, in convert_bi_graphemes_to_num
label_num.append(int(self.byChar[char]))
KeyError: u'pz'

threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed

same bug as
apache/mxnet#12024

src/storage/storage.cc:119: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: unknown error

I run below command, and it throws error as follow:
python main.py --configfile default.cfg

[ INFO][2018/05/18 17:42:58.046] load_optimizer_states = True
[ INFO][2018/05/18 17:42:58.046] is_start_from_batch = False
[ INFO][2018/05/18 17:42:58.046]
[ INFO][2018/05/18 17:42:58.047] [optimizer]
[ INFO][2018/05/18 17:42:58.047] optimizer = adam
[ INFO][2018/05/18 17:42:58.047] optimizer_params_dictionary = {"beta1":0.9,"beta2":0.999}
[ INFO][2018/05/18 17:42:58.047] clip_gradient = 0
[ INFO][2018/05/18 17:42:58.047] weight_decay = 0.
[ INFO][2018/05/18 17:42:58.047]
Traceback (most recent call last):
File "main.py", line 305, in
do_training(args=args, module=module, data_train=data_train, data_val=data_val)
File "/gruntdata/zhimo.bmz/deepspeech.mxnet/train.py", line 93, in do_training
for_training=True)
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/module/module.py", line 430, in bind
state_names=self._state_names)
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/module/executor_group.py", line 265, in init
self.bind_exec(data_shapes, label_shapes, shared_group)
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/module/executor_group.py", line 361, in bind_exec
shared_group))
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/module/executor_group.py", line 639, in _bind_ith_exec
shared_buffer=shared_data_arrays, **input_shapes)
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/symbol/symbol.py", line 1524, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (2, 393, 161)
forward_l0_init_h: (2, 1760)
backward_l0_init_h: (2, 1760)
label: (2, 53)
[17:42:58] src/storage/storage.cc:119: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: unknown error

Stack trace returned 10 entries:
[bt] (0) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x4a) [0x7fc1be6dce0a]
[bt] (1) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x21) [0x7fc1be6dd411]
[bt] (2) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(+0x35cf6f0) [0x7fc1c103f6f0]
[bt] (3) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(+0x35d03d9) [0x7fc1c10403d9]
[bt] (4) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x3f) [0x7fc1c10417bf]
[bt] (5) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(+0x301a753) [0x7fc1c0a8a753]
[bt] (6) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::InitZeros(mxnet::NDArrayStorageType, nnvm::TShape const&, mxnet::Context const&, int)+0x3d) [0x7fc1c0aa233d]
[bt] (7) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(+0x3021211) [0x7fc1c0a91211]
[bt] (8) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::InitArguments(nnvm::IndexedGraph const&, std::vector<nnvm::TShape, std::allocatornnvm::TShape > const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::unordered_set<std::string, std::hashstd::string, std::equal_tostd::string, std::allocatorstd::string > const&, mxnet::Executor const*, std::unordered_map<std::string, mxnet::NDArray, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, mxnet::NDArray> > >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >)+0xa0c) [0x7fc1c0a9567c]
[bt] (9) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::lessstd::string, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::unordered_map<std::string, nnvm::TShape, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, nnvm::TShape> > > const&, std::unordered_map<std::string, int, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::unordered_set<std::string, std::hashstd::string, std::equal_tostd::string, std::allocatorstd::string > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::unordered_map<std::string, mxnet::NDArray, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, mxnet::NDArray> > >, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x842) [0x7fc1c0a9dd42]

distribute training

when I use launcher.py to training distributed
I got an error below
Traceback (most recent call last):
File "main.py", line 327, in
do_training(args=args, module=module, data_train=data_train, data_val=data_val)
File "/export/fanlu/deepspeech_zh_word_dist/train.py", line 166, in do_training
module.save_checkpoint(prefix=get_checkpoint_path(args), epoch=n_epoch, save_optimizer_states=save_optimizer_states)
File "/mxnet/python/mxnet/module/module.py", line 154, in save_checkpoint
self.save_optimizer_states(state_name)
File "/mxnet/python/mxnet/module/module.py", line 738, in save_optimizer_states
self._kvstore.save_optimizer_states(fname)
File "/mxnet/python/mxnet/kvstore.py", line 315, in save_optimizer_states
assert self._updater is not None, "Cannot save states for distributed training"

GRU implementation

Hi,

I want to do some optimization for this model using mkldnn backend library.

But I find that GRU.py seems not implemented as standardized one. The difference is that Batch Norm for i2h is used in this model after indata*weight+bias.

Can you kindly show the reference paper ?

Thanks very much.

about contrib.sym.ctc_loss?

Hi,there
how to change warpctc with mx.contrib.sym.ctc_loss?
thx!

Does it have cpu support, only gpu?

As the environments shows GPU memory size requirement and the .cfg file only has the option :
#ex: gpu0,gpu1,gpu2,gpu3
context = gpu0
I want to make sure if the program only support GPU mode.

beg for experience of setting parameters

Hi,
I am trying to feed deepspeech with my own dataset, but now sure how to set certain parameters like width, height, channel etc.
Could you help? Also, could you share the experience with setting the training parameters as well?
Thanks in advance!

How much training time does it usually take?

Hi,

Thank you for developing this very nice benchmark for the speech recognition application. However, could you please tell me how much time does it usually take for the training to converge (I am training on a single machine with a 1080 Ti GPU card)?

Currently I am on the 24k training iteration of the first epoch (which takes around 3 days to reach) and I am still getting prediction outputs that are very different compared with the label.

[    INFO][2019/06/13 12:20:35.564] label: in mail their horses clad yet fleet and strong prauncing their riders bore the flower and choice of many provinces from bound to bound from arachosia from candaor east 
[    INFO][2019/06/13 12:20:35.564] pred : th th th th th th  th the the th th    th th  th the th the the  th th th th  e  , cer: 0.711538 (distance: 74/ label length: 104)
[    INFO][2019/06/13 12:20:35.564] Epoch[0] Batch[23999] SAVE CHECKPOINT

Thanks in advance.

cer data on full LibriSpeech dataset GPU&&CPU

Hello, can you provide cer data on full LibriSpeech dataset ? Both GPU and CPU are welcome.
Thanks very much