GithubHelp home page GithubHelp logo

ai-adv-lab / deepspeech.mxnet Goto Github PK

View Code? Open in Web Editor NEW
83.0 6.0 34.0 278 KB

A MXNet implementation of Baidu's DeepSpeech architecture

License: Apache License 2.0

Python 99.82% Shell 0.18%
mxnet warp-ctc speech baidu deepspeech arch stt speech-recognition speech-to-text

deepspeech.mxnet's Introduction

deepSpeech.mxnet: Rich Speech Example

This example based on DeepSpeech2 of Baidu helps you to build Speech-To-Text (STT) models at scale using

  • CNNs, fully connected networks, (Bi-) RNNs, (Bi-) LSTMs, and (Bi-) GRUs for network layers,
  • batch-normalization and drop-outs for training efficiency,
  • and a Warp CTC for loss calculations.

In order to make your own STT models, besides, all you need is to just edit a configuration file not actual codes.


Motivation

This example is intended to guide people who want to making practical STT models with MXNet. With rich functionalities and convenience explained above, you can build your own speech recognition models with it easier than former examples.


Environments

  • MXNet version: 0.9.5+
  • GPU memory size: 2.4GB+
  • Install tensorboard for logging
pip install tensorboard
pip install soundfile
  • Warp CTC: Follow this instruction to install Baidu's Warp CTC.
  • We strongly recommend that you first test a model of small networks.

How it works

Preparing data

Input data are described in a JSON file Libri_sample.json as followed.

{"duration": 2.9450625, "text": "and sharing her house which was near by", "key": "./Libri_sample/3830-12531-0030.wav"}
{"duration": 3.94, "text": "we were able to impart the information that we wanted", "key": "./Libri_sample/3830-12529-0005.wav"}

You can download two wave files above from this. Put them under /path/to/yourproject/Libri_sample/.

Setting the configuration file

[Notice] The configuration file "default.cfg" included describes DeepSpeech2 with slight changes. You can test the original DeepSpeech2("deepspeech.cfg") with a few line changes to the cfg file:


[common]
...
learning_rate = 0.0003
# constant learning rate annealing by factor
learning_rate_annealing = 1.1
optimizer = sgd
...
is_bi_graphemes = True
...
[arch]
...
num_rnn_layer = 7
num_hidden_rnn_list = [1760, 1760, 1760, 1760, 1760, 1760, 1760]
num_hidden_proj = 0
num_rear_fc_layers = 1
num_hidden_rear_fc_list = [1760]
act_type_rear_fc_list = ["relu"]
...
[train]
...
learning_rate = 0.0003
# constant learning rate annealing by factor
learning_rate_annealing = 1.1
optimizer = sgd
...

Run the example

Train

cd /path/to/your/project/
mkdir checkpoints
mkdir log
python main.py --configfile default.cfg

Checkpoints of the model will be saved at every n-th epoch.

Load

You can (re-) train (saved) models by loading checkpoints (starting from 0). For this, you need to modify only two lines of the file "default.cfg".

...
[common]
# mode can be one of the followings - train, predict, load
mode = load
...
model_file = 'file name of your model saved'
...

Predict

You can predict (or test) audios by specifying the mode, model, and test data in the file "default.cfg".

...
[common]
# mode can be one of the followings - train, predict, load
mode = predict
...
model_file = 'file name of your model to be tested'
...
[data]
...
test_json = 'a json file described test audios'
...

Run the following line after all modification explained above.
python main.py --configfile default.cfg

Train and test your own models

Train and test your own models by preparing two files.

  1. A new configuration file, i.e., custom.cfg, corresponding to the file 'default.cfg'. The new file should specify the items below the '[arch]' section of the original file.
  2. A new implementation file, i.e., arch_custom.py, corresponding to the file 'arch_deepspeech.py'. The new file should implement two functions, prepare_data() and arch(), for building networks described in the new configuration file.

Run the following line after preparing the files.

python main.py --configfile custom.cfg --archfile arch_custom

Further more

You can prepare full LibriSpeech dataset by following the instruction on https://github.com/baidu-research/ba-dls-deepspeech
Change flac_to_wav.sh script of baidu to flac_to_wav.sh in repository to avoid bug

git clone https://github.com/baidu-research/ba-dls-deepspeech
cd ba-dls-deepspeech
./download.sh
cp -f /path/to/example/flac_to_wav.sh ./
./flac_to_wav.sh
python create_desc_json.py /path/to/ba-dls-deepspeech/LibriSpeech/train-clean-100 train_corpus.json
python create_desc_json.py /path/to/ba-dls-deepspeech/LibriSpeech/dev-clean validation_corpus.json
python create_desc_json.py /path/to/ba-dls-deepspeech/LibriSpeech/test-clean test_corpus.json

deepspeech.mxnet's People

Contributors

minsoo-jade-kim avatar soonhwan-kwon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

deepspeech.mxnet's Issues

keyError

When I training LibriSpeech after validation I got an error below:
Traceback (most recent call last):
File "main.py", line 306, in
do_training(args=args, module=module, data_train=data_train, data_val=data_val)
File "/export/fanlu/deepspeech.mxnet/train.py", line 141, in do_training
for nbatch, data_batch in enumerate(data_val):
File "/export/fanlu/deepspeech.mxnet/stt_io_bucketingiter.py", line 132, in next
save_feature_as_csvfile=self.save_feature_as_csvfile)
File "/export/fanlu/deepspeech.mxnet/stt_datagenerator.py", line 185, in prepare_minibatch
label = labelUtil.convert_bi_graphemes_to_num(label)
File "/export/fanlu/deepspeech.mxnet/label_util.py", line 89, in convert_bi_graphemes_to_num
label_num.append(int(self.byChar[char]))
KeyError: u'pz'

from tensorboard import SummaryWriter failed in train.py

from tensorboard import SummaryWriter
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name SummaryWriter

tensorboard (1.6.0) failed import SummaryWriter
I use tensorboardX (1.1) instead, and changed it to from tensorboardX import SummaryWriter.
It works well.

Potential memory leak?

As I'm training the model (1000 hours of data, batch size is 100 on 4 GPUs, code stays pretty much unchanged), I noticed the CPU memory usage gets higher and higher and eventually the system has to kill the process.

So I'm wondering whether others have had this problem too, and if so, could there be a potential memory leak in the code?

Does it have cpu support, only gpu?

As the environments shows GPU memory size requirement and the .cfg file only has the option :
#ex: gpu0,gpu1,gpu2,gpu3
context = gpu0
I want to make sure if the program only support GPU mode.

beg for experience of setting parameters

Hi,
I am trying to feed deepspeech with my own dataset, but now sure how to set certain parameters like width, height, channel etc.
Could you help? Also, could you share the experience with setting the training parameters as well?
Thanks in advance!

distribute training

when I use launcher.py to training distributed
I got an error below
Traceback (most recent call last):
File "main.py", line 327, in
do_training(args=args, module=module, data_train=data_train, data_val=data_val)
File "/export/fanlu/deepspeech_zh_word_dist/train.py", line 166, in do_training
module.save_checkpoint(prefix=get_checkpoint_path(args), epoch=n_epoch, save_optimizer_states=save_optimizer_states)
File "/mxnet/python/mxnet/module/module.py", line 154, in save_checkpoint
self.save_optimizer_states(state_name)
File "/mxnet/python/mxnet/module/module.py", line 738, in save_optimizer_states
self._kvstore.save_optimizer_states(fname)
File "/mxnet/python/mxnet/kvstore.py", line 315, in save_optimizer_states
assert self._updater is not None, "Cannot save states for distributed training"

How much training time does it usually take?

Hi,

Thank you for developing this very nice benchmark for the speech recognition application. However, could you please tell me how much time does it usually take for the training to converge (I am training on a single machine with a 1080 Ti GPU card)?

Currently I am on the 24k training iteration of the first epoch (which takes around 3 days to reach) and I am still getting prediction outputs that are very different compared with the label.

[    INFO][2019/06/13 12:20:35.564] label: in mail their horses clad yet fleet and strong prauncing their riders bore the flower and choice of many provinces from bound to bound from arachosia from candaor east 
[    INFO][2019/06/13 12:20:35.564] pred : th th th th th th  th the the th th    th th  th the th the the  th th th th  e  , cer: 0.711538 (distance: 74/ label length: 104)
[    INFO][2019/06/13 12:20:35.564] Epoch[0] Batch[23999] SAVE CHECKPOINT

Thanks in advance.

GRU implementation

Hi,

I want to do some optimization for this model using mkldnn backend library.

But I find that GRU.py seems not implemented as standardized one. The difference is that Batch Norm for i2h is used in this model after indata*weight+bias.

Can you kindly show the reference paper ?

Thanks very much.

Can't reproduce cer,

Hi

We train this model using one NVidia P100 GPU, but we can't reproduce the result you provided.

Could you kindly help to provide the pretrain model ?

Or your Epoch 0 is our Epoch19 ?

Thanks very much!

This is our cer
[ INFO][2018/03/06 16:52:02.162] Epoch[0] val cer=0.538507 (63447 / 137482)
[ INFO][2018/03/06 22:41:14.731] Epoch[1] val cer=0.388775 (83967 / 137375)
[ INFO][2018/03/07 04:30:03.758] Epoch[2] val cer=0.319315 (93501 / 137363)
[ INFO][2018/03/07 10:18:48.315] Epoch[3] val cer=0.279020 (99178 / 137560)
[ INFO][2018/03/07 16:08:08.181] Epoch[4] val cer=0.254146 (102584 / 137539)
[ INFO][2018/03/07 21:56:56.019] Epoch[5] val cer=0.239708 (104525 / 137480)
[ INFO][2018/03/08 03:45:09.380] Epoch[6] val cer=0.225611 (106401 / 137400)
[ INFO][2018/03/08 09:33:38.506] Epoch[7] val cer=0.210001 (108515 / 137361)
[ INFO][2018/03/08 15:22:11.360] Epoch[8] val cer=0.209912 (108664 / 137534)
[ INFO][2018/03/08 21:10:18.962] Epoch[9] val cer=0.195206 (110664 / 137506)
[ INFO][2018/03/09 02:58:42.151] Epoch[10] val cer=0.192939 (110899 / 137411)
[ INFO][2018/03/09 08:47:19.200] Epoch[11] val cer=0.188388 (111569 / 137466)
[ INFO][2018/03/09 14:35:52.631] Epoch[12] val cer=0.183532 (112266 / 137502)
[ INFO][2018/03/09 20:24:17.168] Epoch[13] val cer=0.187081 (111852 / 137593)
[ INFO][2018/03/10 02:12:39.140] Epoch[14] val cer=0.183611 (112180 / 137410)
[ INFO][2018/03/10 08:00:59.198] Epoch[15] val cer=0.180528 (112697 / 137524)
[ INFO][2018/03/10 13:49:28.977] Epoch[16] val cer=0.180026 (112771 / 137530)
[ INFO][2018/03/10 19:37:55.186] Epoch[17] val cer=0.182596 (112505 / 137637)
[ INFO][2018/03/11 01:25:56.847] Epoch[18] val cer=0.177797 (113048 / 137494)
[ INFO][2018/03/11 07:14:02.295] Epoch[19] val cer=0.178752 (112874 / 137442)
[ INFO][2018/03/11 13:02:28.010] Epoch[20] val cer=0.176924 (113233 / 137573)
[ INFO][2018/03/11 18:50:45.359] Epoch[21] val cer=0.172450 (113784 / 137495)
[ INFO][2018/03/12 00:39:22.560] Epoch[22] val cer=0.178593 (112996 / 137564)
[ INFO][2018/03/12 06:28:00.156] Epoch[23] val cer=0.172375 (113575 / 137230)
[ INFO][2018/03/12 12:16:44.326] Epoch[24] val cer=0.176732 (113010 / 137270)
[ INFO][2018/03/12 18:05:27.260] Epoch[25] val cer=0.173333 (113751 / 137602)
[ INFO][2018/03/12 23:54:03.362] Epoch[26] val cer=0.179838 (112865 / 137613)
[ INFO][2018/03/13 05:42:37.309] Epoch[27] val cer=0.173278 (113666 / 137490)

and this is yours
Epoch[0] unfortunately powered off when 19 * 3000 batches(batch size is 12)
Epoch[1] 0.177740 (we restarted from checkpoint of 19*3000th batch)
Epoch[2] 0.144390
Epoch[3] 0.126324
Epoch[4] 0.121056
Epoch[5] 0.110635
Epoch[6] 0.102347
Epoch[7] 0.100333
Epoch[8] 0.098945
(test clean dataset and wav file with shorter than 16 second limitation).

How many epochs does DeepSpeech2 need to converge on LibriSpeech

I read the issue regarding the performance of DeepSpeech2 and noticed the CER result reported by @Soonhwan-Kwon is 0.15648 at epoch 3.

It seems really promising so I'm trying to reproduce the result. But right now I'm at epoch 5 and my validation CER (vali-clean and val-other) is still 0.3122.... So I'm wondering whether I did anything wrong or was that the intended result.

Also, the test other CER on LibriSpeech reported in DeepSpeech2 paper was 0.1325. Have you guys ever come close to this number? And if so, how many epochs do you need to get there?

Thanks in advance!

src/storage/storage.cc:119: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: unknown error

I run below command, and it throws error as follow:
python main.py --configfile default.cfg

[ INFO][2018/05/18 17:42:58.046] load_optimizer_states = True
[ INFO][2018/05/18 17:42:58.046] is_start_from_batch = False
[ INFO][2018/05/18 17:42:58.046]
[ INFO][2018/05/18 17:42:58.047] [optimizer]
[ INFO][2018/05/18 17:42:58.047] optimizer = adam
[ INFO][2018/05/18 17:42:58.047] optimizer_params_dictionary = {"beta1":0.9,"beta2":0.999}
[ INFO][2018/05/18 17:42:58.047] clip_gradient = 0
[ INFO][2018/05/18 17:42:58.047] weight_decay = 0.
[ INFO][2018/05/18 17:42:58.047]
Traceback (most recent call last):
File "main.py", line 305, in
do_training(args=args, module=module, data_train=data_train, data_val=data_val)
File "/gruntdata/zhimo.bmz/deepspeech.mxnet/train.py", line 93, in do_training
for_training=True)
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/module/module.py", line 430, in bind
state_names=self._state_names)
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/module/executor_group.py", line 265, in init
self.bind_exec(data_shapes, label_shapes, shared_group)
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/module/executor_group.py", line 361, in bind_exec
shared_group))
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/module/executor_group.py", line 639, in _bind_ith_exec
shared_buffer=shared_data_arrays, **input_shapes)
File "/gruntdata/zhimo.bmz/mxnet/python/mxnet/symbol/symbol.py", line 1524, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (2, 393, 161)
forward_l0_init_h: (2, 1760)
backward_l0_init_h: (2, 1760)
label: (2, 53)
[17:42:58] src/storage/storage.cc:119: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: unknown error

Stack trace returned 10 entries:
[bt] (0) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x4a) [0x7fc1be6dce0a]
[bt] (1) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x21) [0x7fc1be6dd411]
[bt] (2) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(+0x35cf6f0) [0x7fc1c103f6f0]
[bt] (3) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(+0x35d03d9) [0x7fc1c10403d9]
[bt] (4) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x3f) [0x7fc1c10417bf]
[bt] (5) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(+0x301a753) [0x7fc1c0a8a753]
[bt] (6) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::InitZeros(mxnet::NDArrayStorageType, nnvm::TShape const&, mxnet::Context const&, int)+0x3d) [0x7fc1c0aa233d]
[bt] (7) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(+0x3021211) [0x7fc1c0a91211]
[bt] (8) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::InitArguments(nnvm::IndexedGraph const&, std::vector<nnvm::TShape, std::allocatornnvm::TShape > const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::unordered_set<std::string, std::hashstd::string, std::equal_tostd::string, std::allocatorstd::string > const&, mxnet::Executor const*, std::unordered_map<std::string, mxnet::NDArray, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, mxnet::NDArray> > >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >)+0xa0c) [0x7fc1c0a9567c]
[bt] (9) /gruntdata/zhimo.bmz/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::lessstd::string, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::vector<mxnet::Context, std::allocatormxnet::Context > const&, std::unordered_map<std::string, nnvm::TShape, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, nnvm::TShape> > > const&, std::unordered_map<std::string, int, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::unordered_set<std::string, std::hashstd::string, std::equal_tostd::string, std::allocatorstd::string > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::unordered_map<std::string, mxnet::NDArray, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, mxnet::NDArray> > >, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x842) [0x7fc1c0a9dd42]

How do I use dropout?

I found that there are several places where you can dropout, what experience do you have?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.