robmsmt / kerasdeepspeech Goto Github PK

A Keras CTC implementation of Baidu's DeepSpeech for model experimentation

License: GNU Affero General Public License v3.0

Python 82.23% Jupyter Notebook 17.70% Shell 0.07%

keras deepspeech asr ctc coreml speechrecognition speech-to-text deep-learning machine-learning neural-networks

kerasdeepspeech's Introduction

Keras DeepSpeech

Repository for experimenting with different CTC based model designs for ASR. Supports live recording and testing of speech and quickly creates customised datasets using own-voice dataset creation scripts!

OVERVIEW

SETUP

Recommended > use virtualenv installed with python2.7 (3.x untested and will not work with Core ML)
git clone https://github.com/robmsmt/KerasDeepSpeech
pip install -r requirements.txt
Get the data using the import/download scripts in the folder, LibriSpeech is a good example.
Download the language model (large file) run ./lm/get_lm.sh

RUN

To Train, simply run python run-train.py In order to specify training/validation files use python run-train.py --train_files <csvfile> --valid_files <csvfile> (see run-train for complete arguments list)
To Test, run python run-test.py --test_files <datacsvfile>

CREDIT

Mozilla DeepSpeech
Baidu DS1 & DS2 papers

Licence

Contributing

Have a question? Like the tool? Don't like it? Open an issue and let's talk about it! Pull requests are appreciated!

kerasdeepspeech's People

Stargazers

Watchers

Forkers

alan918727 shubhampachori12110095 stevenlol spyamine chrisdinant entonytang matthewwaller m3g4r00t liuguangyuan kittymac reloadbrain vanova hzauccg flamingo123456 yangperasd entn-at kunlqt rollingstone cpadilha feherbalazs edresson catofes sknadig afcarl aiedward vlinhd11 clever-scientist mahimajeslani26 magellen sushantjha8 profsatwinder fancyerii lvscar xdcesc tiefenauer huyhoang17 linzehua akabhishekrony16 mwang-lifesize rasmusafj dataxujing thzll2001 liangwq cdyangbo clcarwin sunilsivadas windowzzhhuu mohamadeq windowxiaoming jshuadvd aijianiula0601 arnavdas88 zqs01 anshularya owen864720655 fakhraddin subash-khanal sparkingarthur comtigo shivadvg19 gdy1201 youngdumb21 jinpoon baucheng hakanaku1234 gsv-cesar kyleguymandude zswitten dotrado coolzhao liangtianxin ashishpatel26 mohsen-goodarzi shamoons tonywaite hfut-fyf elijahahianyo kandy22

kerasdeepspeech's Issues

Must I download that LM model ?

Must I download that LM model ?
How to build LM from scratch ?

Input dimension mismatch error on training the model ds1 dropout

This is the error on running model.fit_generator with Keras(2.0.9) ,Theano(0.9.0) as backend and python version 3

ValueError: Input dimension mis-match. (input[0].shape[1] = 1, input[1].shape[1] = 16)
Apply node that caused the error: Elemwise{eq,no_inplace}(training/ctc_target, Elemwise{round_half_to_even,no_inplace}.0)
Toposort index: 762 Inputs types: [TensorType(float64, matrix), TensorType(float64, row)]
Inputs shapes: [(16, 1), (1, 16)]
Inputs strides: [(8, 8), (128, 8)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Sum{acc_dtype=int64}(Elemwise{eq,no_inplace}.0)]]

How to make it work with Python 3?

I actually need to know why Core ML won't work with Python 3.

WER

hi,I want to know why the WER is 0.8? I use the default parameters using TIMIT,shave I do something wrong???

How to adjust the timesteps

Hello, thank you for sharing the great project.

I want to adjust the timesteps in ownModel. However, I can't find where should be adjusted?
In, def ownModel(), it has a

input_data = Input(name='_the_input', shape=(None,input_dim))
...
x = TimeDistributed(Dense(fc_size...))(x)

Where is the def. of timesteps? Thank a lot!

Why "run-train.py" could not train on GPU?

I very confident of that I have configured the GPU, and train an example on the GPU. But, it seems to don't work on the "run-train.py"。

link is broken!

" live recording and testing of speech and quickly creates customised datasets using own-voice dataset creation scripts!"

live recoding and testing of speech link and using own-voice dataset creation script is broken.
Could you re-link those?

ZeroPadding1D at the ds2_gru_model

Hi, I don't understand, because you use ZeroPadding1D in this model. you are adding 2048 zeros in the second shape dimension.
example: when input shape is: (1,280,161) after pass in the ZeroPadding1D layer the output is (1,2328,161).

Do you want to keep the sequence fixed in 2048?
if yes, is it necessary to calculate the number of zeros required for each of the inputs?

Thank you for your response.

Only 1 conv layer where supposed to be many

At model.py at line 241 you have code like:
if use_conv:
conv = ZeroPadding1D(padding=(0, 2048))(x)
for l in range(conv_layers):
x = Conv1D(filters=fc_size, name='conv_{}'.format(l+1), kernel_size=11, padding='valid', activation='relu', strides=2)(conv)

There must be something like:
if use_conv:
conv = ZeroPadding1D(padding=(0, 2048))(x)
x = Conv1D(filters=fc_size, name='conv_{}'.format(1), kernel_size=11, padding='valid', activation='relu', strides=2)(conv)
for l in range(1, conv_layers):
x = Conv1D(filters=fc_size, name='conv_{}'.format(l+1), kernel_size=11, padding='valid', activation='relu', strides=2)(x)

Padding character: #27 or 28?

KerasDeepSpeech/generator.py

Line 215 in 5536388

t.append(27) # replace with a space char to pad

In generator.py, get_intseq(), the padding is done with character 27. In the char map, it stands for an apostrophe, not the extra 28th padding character. In utils.py, int_to_text_sequence, a character 28 is mentioned as the one for padding. Is that intended?

Accuracy of `model_arch==3` i.e. `own_model`

Is there any result on any dataset for your own model i.e. model_arch == 3?
Secondly, If I select model_acrh == 3. The console prints it as DS3. I dont suppose it this model, or is it?
Thanks in advance.

Language model for bangla

I want to implement it for both bangla isolated and continuous speech. Where i can find language model for banlga if not available how can i make language model?

could you give some examples about the shape in below?

3. input_length (required for CTC loss)

    # this is the time dimension of CTC (batch x time x mfcc)
    #input_length = np.array([get_xsize(mfcc) for mfcc in X_data])
    input_length = np.array(x_val)
    # print("3. input_length shape:", input_length.shape)   
    # print("3. input_length =", input_length)
    assert(input_length.shape == (self.batch_size,))

    # 4. label_length (required for CTC loss)
    # this is the length of the number of label of a sequence
    #label_length = np.array([len(l) for l in labels])
    label_length = np.array(y_val)
    # print("4. label_length shape:", label_length.shape)
    # print("4. label_length =", label_length)
    assert(label_length.shape == (self.batch_size,))

hi, I want to make a ctc demo, I do not know the "label_length.shape" and "input_length.shape", how to calculate them ? and what means them ? thanks you.

conv = ZeroPadding1D(padding=(0, 2048))(x)
for l in range(conv_layers):
  x = Conv1D(filters=fc_size, name='conv_{}'.format(l+1), kernel_size=11, padding='valid', activation='relu', strides=2)(conv)

This is the model summary I get,

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
the_input (InputLayer)          (None, None, 161)    0                                            
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, None, 161)    644         the_input[0][0]                  
__________________________________________________________________________________________________
zero_padding1d_1 (ZeroPadding1D (None, None, 161)    0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
conv_3 (Conv1D)                 (None, None, 512)    907264      zero_padding1d_1[0][0]           
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, None, 512)    2048        conv_3[0][0]                     
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, None, 1024)   9443328     batch_normalization_2[0][0]      
__________________________________________________________________________________________________
bidirectional_2 (Bidirectional) (None, None, 1024)   12589056    bidirectional_1[0][0]            
__________________________________________________________________________________________________
bidirectional_3 (Bidirectional) (None, None, 1024)   12589056    bidirectional_2[0][0]            
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, None, 1024)   4096        bidirectional_3[0][0]            
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, 512)    524800      batch_normalization_3[0][0]      
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, None, 1102)   565326      time_distributed_1[0][0]         
__________________________________________________________________________________________________
the_labels (InputLayer)         (None, None)         0                                            
__________________________________________________________________________________________________
input_length (InputLayer)       (None, 1)            0                                            
__________________________________________________________________________________________________
label_length (InputLayer)       (None, 1)            0                                            
__________________________________________________________________________________________________
ctc (Lambda)                    (None, 1)            0           time_distributed_2[0][0]         
                                                                 the_labels[0][0]                 
                                                                 input_length[0][0]               
                                                                 label_length[0][0]               
==================================================================================================
Total params: 36,625,618
Trainable params: 36,622,224
Non-trainable params: 3,394
__________________________________________________________________________________________________

What could be the possible reason.
Thanks in advance