rolczynski / automatic-speech-recognition Goto Github PK

🎧 Automatic Speech Recognition: DeepSpeech & Seq2Seq (TensorFlow)

License: GNU Affero General Public License v3.0

Python 100.00%

deep-learning machine-learning neural-networks keras speech-recognition speech-to-text deepspeech language-model distill tensorflow

automatic-speech-recognition's Introduction

Automatic Speech Recognition

The project aim is to distill the Automatic Speech Recognition research. At the beginning, you can load a ready-to-use pipeline with a pre-trained model. Benefit from the eager TensorFlow 2.0 and freely monitor model weights, activations or gradients.

import automatic_speech_recognition as asr

file = 'to/test/sample.wav'  # sample rate 16 kHz, and 16 bit depth
sample = asr.utils.read_audio(file)
pipeline = asr.load('deepspeech2', lang='en')
pipeline.model.summary()     # TensorFlow model
sentences = pipeline.predict([sample])

We support english (thanks to Open Seq2Seq). The evaluation results of the English benchmark LibriSpeech dev-clean are in the table. To reference, the DeepSpeech (Mozilla) achieves around 7.5% WER, whereas the state-of-the-art (RWTH Aachen University) equals 2.3% WER (recent evaluation results can be found here). Both of them, use the external language model to boost results. By comparison, humans achieve 5.83% WER here (LibriSpeech dev-clean)

Model Name	Decoder	WER-dev
`deepspeech2`	greedy	6.71

Shortly it turns out that you need to adjust pipeline a little bit. Take a look at the CTC Pipeline. The pipeline is responsible for connecting a neural network model with all non-differential transformations (features extraction or prediction decoding). Pipeline components are independent. You can adjust them to your needs e.g. use more sophisticated feature extraction, different data augmentation, or add the language model decoder (static n-grams or huge transformers). You can do much more like distribute the training using the Strategy, or experiment with mixed precision policy.

import numpy as np
import tensorflow as tf
import automatic_speech_recognition as asr

dataset = asr.dataset.Audio.from_csv('train.csv', batch_size=32)
dev_dataset = asr.dataset.Audio.from_csv('dev.csv', batch_size=32)
alphabet = asr.text.Alphabet(lang='en')
features_extractor = asr.features.FilterBanks(
    features_num=160,
    winlen=0.02,
    winstep=0.01,
    winfunc=np.hanning
)
model = asr.model.get_deepspeech2(
    input_dim=160,
    output_dim=29,
    rnn_units=800,
    is_mixed_precision=False
)
optimizer = tf.optimizers.Adam(
    lr=1e-4,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-8
)
decoder = asr.decoder.GreedyDecoder()
pipeline = asr.pipeline.CTCPipeline(
    alphabet, features_extractor, model, optimizer, decoder
)
pipeline.fit(dataset, dev_dataset, epochs=25)
pipeline.save('/checkpoint')

test_dataset = asr.dataset.Audio.from_csv('test.csv')
wer, cer = asr.evaluate.calculate_error_rates(pipeline, test_dataset)
print(f'WER: {wer}   CER: {cer}')

Installation

You can use pip:

pip install automatic-speech-recognition

Otherwise clone the code and create a new environment via conda:

git clone https://github.com/rolczynski/Automatic-Speech-Recognition.git
conda env create -f=environment.yml     # or use: environment-gpu.yml
conda activate Automatic-Speech-Recognition

References

The fundamental repositories:

Baidu - DeepSpeech2 - A PaddlePaddle implementation of DeepSpeech2 architecture for ASR
NVIDIA - Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
RWTH Aachen University - The RWTH extensible training framework for universal recurrent neural networks
TensorFlow - The implementation of DeepSpeech2 model
Mozilla - DeepSpeech - A TensorFlow implementation of Baidu's DeepSpeech architecture
Espnet - End-to-End Speech Processing Toolkit
Sean Naren - Speech Recognition using DeepSpeech2

Moreover, you can explore the GitHub using key phrases like ASR, DeepSpeech, or Speech-To-Text. The list wer_are_we, an attempt at tracking states of the art, can be helpful too.

automatic-speech-recognition's People

Contributors

Stargazers

Watchers

automatic-speech-recognition's Issues

CtcPipeline not implemented correctly

def fit(self, dataset: dataset.Dataset, dev_dataset: dataset.Dataset, augmentation: augmentation.Augmentation = None, prepared_features: bool = False, **kwargs) -> keras.callbacks.History: """ Get ready data, compile and train a model. """ dataset = self.wrap_preprocess(dataset) dev_dataset = self.wrap_preprocess(dev_dataset) if not self._model.optimizer: # a loss function and an optimizer self.compile_model() # have to be set before the training return self._model.fit(dataset, validation_data=dev_dataset, **kwargs)

from example configuration (examples/use_augmentation.py)
pipeline.fit(dataset, dev_dataset, epochs=25, augmentation=spec_augment)

SpecAugment is passed in the pipeline.fit() method, but never used.

Could the model be run on CPU?

Hi,
Could the provided pretrained model be run on CPU?
Thank you.

Repeated Constant Output

I use your code to train on a specific language. After some epoch, I test the output of the network on a sample of the training dataset. The output is a repeated constant char. For example 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'.

What is the problem?

How to test the model which I have generated with new dataset.

Hi Mr. Rolczynski,
I have tried to generate a new model with a new training dataset(code is as below which is same as you have mentioned in github)
dataset = asr.dataset.Audio.from_csv('C:/Users/XXXXX/Automatic-Speech-Recognition-master/84-121123-dev.csv', batch_size=25)
dev_dataset = asr.dataset.Audio.from_csv('C:/Users/XXXXX/Automatic-Speech-Recognition-master/84-121550-dev.csv', batch_size=25)

alphabet = asr.text.Alphabet(lang='en')
features_extractor = asr.features.FilterBanks(
features_num=160,
winlen=0.02,
winstep=0.01,
winfunc=np.hanning
)
model = asr.model.get_deepspeech2(
input_dim=160,
output_dim=29,
rnn_units=800,
is_mixed_precision=False
)
optimizer = tf.optimizers.Adam(
lr=1e-4,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-8
)
decoder = asr.decoder.GreedyDecoder()
pipeline = asr.pipeline.CTCPipeline(
alphabet, features_extractor, model, optimizer, decoder
)
pipeline.fit(dataset, dev_dataset, epochs=25)
pipeline.save('C:/Users/XXXX/Automatic-Speech-Recognition-master/Automatic-Speech-Recognition-master/automatic_speech_recognition/checkpoint/')

The training resulted in some files, which are as below in the checkpoint directory
'alphabet.bin, , 'decoder.bin', 'feature_extractor.bin', and 'model.h5'

But my question is how to load the model which i have just created. I believe the code which you have provided to test a pre -trained model (below) works only with deep speech model and not my own model.

file = 'to/test/sample.wav' # sample rate 16 kHz, and 16 bit depth
sample = asr.utils.read_audio(file)
pipeline = asr.load('deepspeech2', lang='en')
pipeline.model.summary() # TensorFlow model
sentences = pipeline.predict([sample])

Can you please help me to resolve this. I really appreciate your effort in helping the larger audience to get the knowledge of how speech to text recognition works.

Problem in CTC_Loss

I am getting this error in tf.nn.ctc_loss
TypeError: Value passed to parameter 'indices' has DataType float32 not in list of allowed values: uint8, int32, int64

Not sure where is this float type coming from.

Repeating Benchmark results

Hi @rolczynski

I am experimenting with your code and would like to know how to repeat benchmark results from the Table?

Is it the pipeline from readme? With 25epoch and batch size of 32? How many gpu-s did you use (4x8 I guess)?

Dataset should be full librispeech.
Was data augmentation used?

Does the code support decoding on the whole dev-clean subset?

Unable to load file

Getting error while running below code -
pipeline = asr.load('deepspeech2', lang='en')

OSError: Unable to open file (truncated file: eof = 136667136, sblock->base_addr = 0, stored_eof = 454666120)

Note- I tried deleting model file and reinstalling, but nothing worked.

Not able to use GPU for prediction! by using pip env instead of Conda env !

Im not able to use my GPU ! what could be the possible reason or solution ?

I created the virtual pip environment(not conda env)
pip install automatic-speech-recognition
I have manually uninstalled the tensorflow from my virtual environment and installed
tf gpu 2.2.0

and Im getting these logs :

2020-07-03 19:01:14.700922: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-03 19:01:16.568359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-03 19:01:17.000059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 980M computeCapability: 5.2
coreClock: 1.1265GHz coreCount: 12 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 149.31GiB/s
2020-07-03 19:01:17.000274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-03 19:01:17.003927: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-03 19:01:17.007769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-03 19:01:17.009106: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-03 19:01:17.013800: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-03 19:01:17.016093: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-03 19:01:17.017343: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
2020-07-03 19:01:17.017550: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed
properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-07-03 19:01:17.018224: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-03 19:01:17.026847: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21f43c313f0 initialized for platform Host (this does not guarantee that XLA will be used). Devi
ces:
2020-07-03 19:01:17.027059: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-07-03 19:01:17.027661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-03 19:01:17.027804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]
2020-07-03 19:01:17.032795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 980M computeCapability: 5.2
coreClock: 1.1265GHz coreCount: 12 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 149.31GiB/s
2020-07-03 19:01:17.033024: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-03 19:01:17.033299: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-03 19:01:17.033420: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-03 19:01:17.033786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-03 19:01:17.034116: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-03 19:01:17.034508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-03 19:01:17.036010: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
2020-07-03 19:01:17.036895: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed
properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-07-03 19:01:17.101046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-03 19:01:17.101254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0
2020-07-03 19:01:17.101499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N
2020-07-03 19:01:17.103170: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21f53ac4e20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devi
ces:
2020-07-03 19:01:17.103373: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 980M, Compute Capability 5.2
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
The dtype policy mixed_float16 may run slowly because this machine does not have a GPU.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
Model: "DeepSpeech2"

Layer (type) Output Shape Param #

X (InputLayer) [(None, None, 160)] 0

lambda (Lambda) (None, None, 160, 1) 0

conv_1 (Conv2D) (None, None, 80, 32) 14432

conv_1_bn (BatchNormalizatio (None, None, 80, 32) 128

conv_1_relu (ReLU) (None, None, 80, 32) 0

conv_2 (Conv2D) (None, None, 40, 32) 236544

conv_2_bn (BatchNormalizatio (None, None, 40, 32) 128

conv_2_relu (ReLU) (None, None, 40, 32) 0

reshape (Reshape) (None, None, 1280) 0

bidirectional_1 (Bidirection (None, None, 1600) 9993600

dropout (Dropout) (None, None, 1600) 0

bidirectional_2 (Bidirection (None, None, 1600) 11529600

dropout_1 (Dropout) (None, None, 1600) 0

bidirectional_3 (Bidirection (None, None, 1600) 11529600

dropout_2 (Dropout) (None, None, 1600) 0

bidirectional_4 (Bidirection (None, None, 1600) 11529600

dropout_3 (Dropout) (None, None, 1600) 0

bidirectional_5 (Bidirection (None, None, 1600) 11529600

dense_1 (TimeDistributed) (None, None, 1600) 2561600

dense_1_relu (ReLU) (None, None, 1600) 0

dropout_4 (Dropout) (None, None, 1600) 0

dense_2 (TimeDistributed) (None, None, 29) 46429

Total params: 58,971,261
Trainable params: 58,971,133
Non-trainable params: 128

TensorFlow multi_gpu_model function is deprecated

Hello, i want to train my dataset, and i have two gpus.

bellwo is code.

`pipeline = asr.pipeline.CTCPipeline(
alphabet, features_extractor, model, optimizer, decoder, gpus=['gpu:0','gpu:1']
)
dataset =pipeline.wrap_preprocess(dataset, False, None)
dev_dataset =pipeline.wrap_preprocess(dev_dataset, False, None)

y = tf.keras.layers.Input(name='y', shape=[None], dtype='int32')
loss = pipeline.get_loss()
pipeline._model.compile(pipeline._optimizer, loss, target_tensors=[y])
pipeline._model.fit(dataset,validation_data=dev_dataset,epochs=100)
pipeline._model.save(os.path.join('/checkpoint', 'model.h5'))`

But, model use only one gpu.

It seems that an OOM occurs when the batch size is increased.

Installation Conda

Love the project . I'm looking to use this to adapt models to my voice. But I'm struggling on the installation process with conda. Getting a "Solving environment : Failed " message after running the command :

conda env create -f=environment.yml

ResolvePackagesNotFound
scipy=1.1.0=py36h7c811a0_2
.......
tensorflow=1.12.0=gpu_py36he74679b_0

Anyway Thank you keep up the good work 👍

Log filterbank and keras version

Hi,

I see you recently you change from MFCC to log filterbank for acoustic feature extraction. Is there any particular reason (improvement, etc)? Should it better to keep those two so user can choose and experimenting between those two feature extractions?

Also, as the main principle of your proposed platform is for easiness and understandable, isn't it better to keep to use Keras instead of tf.keras? I see it your to-do/contributing list.

Not installable. Setup py is missing.

The module cannot be installed as python package.

ImportError: DLL load failed: The specified module could not be found.

Hello, after installing Automatic-Speech-Recognition via pip when I am trying to import automatic_speech_recognition it is giving me error as below:

ImportError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow.py in
57
---> 58 from tensorflow.python.pywrap_tensorflow_internal import *
59

~\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py in
27 return _mod
---> 28 _pywrap_tensorflow_internal = swig_import_helper()
29 del swig_import_helper

~\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py in swig_import_helper()
23 try:
---> 24 _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
25 finally:

~\Anaconda3\lib\imp.py in load_module(name, file, filename, details)
241 else:
--> 242 return load_dynamic(name, filename, file)
243 elif type_ == PKG_DIRECTORY:

~\Anaconda3\lib\imp.py in load_dynamic(name, path, file)
341 name=name, loader=loader, origin=path)
--> 342 return _load(spec)
343

ImportError: DLL load failed: The specified module could not be found.

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in
----> 1 import automatic_speech_recognition as asr
2
3 file = 'data_wav/sample_000000.wav' # sample rate 16 kHz, and 16 bit depth
4 sample = asr.utils.read_audio(file)
5 pipeline = asr.load('deepspeech2', lang='en')

~\Anaconda3\lib\site-packages\automatic_speech_recognition_init_.py in
1 from . import augmentation
----> 2 from . import callback
3 from . import dataset
4 from . import decoder
5 from . import evaluate

~\Anaconda3\lib\site-packages\automatic_speech_recognition\callback_init_.py in
----> 1 from tensorflow.keras.callbacks import *
2 from .batch_logger import BatchLogger
3 from .distributed_model_checkpoint import DistributedModelCheckpoint

~\Anaconda3\lib\site-packages\tensorflow_init_.py in
39 import sys as _sys
40
---> 41 from tensorflow.python.tools import module_util as _module_util
42 from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader
43

~\Anaconda3\lib\site-packages\tensorflow\python_init_.py in
48 import numpy as np
49
---> 50 from tensorflow.python import pywrap_tensorflow
51
52 # Protocol buffers

~\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow.py in
67 for some common reasons and solutions. Include the entire stack trace
68 above this error message when asking for help.""" % traceback.format_exc()
---> 69 raise ImportError(msg)
70
71 # pylint: enable=wildcard-import,g-import-not-at-top,unused-import,line-too-long

ImportError: Traceback (most recent call last):
File "C:\Users\Jeet\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "C:\Users\Jeet\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "C:\Users\Jeet\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "C:\Users\Jeet\Anaconda3\lib\imp.py", line 242, in load_module
return load_dynamic(name, filename, file)
File "C:\Users\Jeet\Anaconda3\lib\imp.py", line 342, in load_dynamic
return _load(spec)
ImportError: DLL load failed: The specified module could not be found.

Please anyone can help me with this?
Thank you in advance.

Running the code without installing the package

Hello and thank you for the great work!

I am trying to run the code (training) without installing the package for experimentation.
Do you have a script which can help me run it?

I cloned the code and made venv (with pip). When running baseline.py or another examples it gives syntax error for print and No module named 'automatic_speech_recognition'

Thank you

OSError: Unable to open file

Hello, thank you for sharing this project!

I'm trying to run the code but I'm getting the following error. Could someone help me please?
------- INPUT
import automatic_speech_recognition as asr
pipeline = asr.load('deepspeech2', lang='en')

------- OUTPUT
OSError: Unable to open file (truncated file: eof = 19259392, sblock->base_addr = 0, stored_eof = 454666120)

OSError Traceback (most recent call last)
in
----> 1 pipeline = asr.load('deepspeech2', lang='en')

~/anaconda3/lib/python3.7/site-packages/automatic_speech_recognition/load/load.py in load(name, lang, version)
11 def load(name: str, lang: str, version=0.1) -> pipeline.Pipeline:
12 if name == 'deepspeech2' and lang == 'en' and version == 0.1:
---> 13 return load_deepspeech2_en()
14 raise ValueError('Specified model is not supported')
15

~/anaconda3/lib/python3.7/site-packages/automatic_speech_recognition/load/load.py in wrapper()
25 local_path = f'{os.path.dirname(file)}/models/{file_name}'
26 utils.maybe_download_from_bucket(bucket, remote_path, local_path)
---> 27 return loader(weights_path=local_path)
28 return wrapper
29 return closure

~/anaconda3/lib/python3.7/site-packages/automatic_speech_recognition/load/load.py in load_deepspeech2_en(weights_path)
33 def load_deepspeech2_en(weights_path: str) -> pipeline.CTCPipeline:
34 deepspeech2 = model.get_deepspeech2(input_dim=160, output_dim=29)
---> 35 deepspeech2.load_weights(weights_path)
36 alphabet_en = text.Alphabet(lang='en')
37 spectrogram = features.Spectrogram(

~/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py in load_weights(self, filepath, by_name, skip_mismatch)
232 raise ValueError('Load weights is not yet supported with TPUStrategy '
233 'with steps_per_run greater than 1.')
--> 234 return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
235
236 @trackable.no_automatic_dependency_tracking

~/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py in load_weights(self, filepath, by_name, skip_mismatch)
1213 'first, then load the weights.')
1214 self._assert_weights_created()
-> 1215 with h5py.File(filepath, 'r') as f:
1216 if 'layer_names' not in f.attrs and 'model_weights' in f:
1217 f = f['model_weights']

~/anaconda3/lib/python3.7/site-packages/h5py/_hl/files.py in init(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, **kwds)
392 fid = make_fid(name, mode, userblock_size,
393 fapl, fcpl=make_fcpl(track_order=track_order),
--> 394 swmr=swmr)
395
396 if swmr_support:

~/anaconda3/lib/python3.7/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
168 if swmr and swmr_support:
169 flags |= h5f.ACC_SWMR_READ
--> 170 fid = h5f.open(name, flags, fapl=fapl)
171 elif mode == 'r+':
172 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (truncated file: eof = 19259392, sblock->base_addr = 0, stored_eof = 454666120)

Negative loss value

Hi,

I am trying to run the sample training on a librispeech clean 100h.

After few hours of training with batch size=10 the printed loss value becomes negative.
It happens in the first epoch.

The thing i changed is read_audio function to use soundfile for reading flac files insted of waves with wavfile.read. Although both give the same output when reading files so it shouldn't make a difference.

Are you familiar with the issue? Loss seems to decrease too fast.
Any guess what is going wrong?

missing optimizer class

when i try to train, i cant because the optimizer class doesnt exists

Traceback (most recent call last):
File "C:\Users\benja\OneDrive\Desktop\Drazcat\python\Automatic-Speech-Recognition\train.py", line 19, in
optimizer = asr.optimizer.Adam(
AttributeError: module 'automatic_speech_recognition' has no attribute 'optimizer'

Installation readme problem

Hi,

I've a problem with installation of your version of deepspeech.
The
'''pip install deepspeech-keras'''
command doesn't work since there is no such package at pypi ( https://pypi.org/search/?q=deepspeech ) (or at least, I'm unable to find it).

Also there's a problem with alternative installation precedure, since there is no requirements.txt file in the repo (there is only environment.yml, which is usable by conda, however it's significantly harder to use deepspeech without conda (for example with pipenv)).

It'd be really helpful for pip install option to work, since there is only a few STT open-sourced tools with premade models : )

Skipping optimization due to error while loading function libraries

Hello !

I run into this error while trying to make a simple prediction with this package :
W tensorflow/core/grappler/optimizers/implementation_selector.cc:310] Skipping optimization due to error while loading function libraries: Invalid argument: Functions '__inference_standard_gru_13758' and '__inference_standard_gru_13758_specialized_for_bidirectional_3_2_backward_gru_3_StatefulPartitionedCall_at___inference_keras_scratch_graph_16375' both implement 'gru_fed7d4c6-fe30-487d-b5ea-3c3017a988bb' but their signatures do not match.

Any idea where that comes from ?

After activating in google collab when i try to import deepspeech it says "No module named 'deepspeech'".

PIP not working

PIP cmd not working

Conda packages are unavailable in windows 10.

Some of the mentioned conda packages are available in pip but not in conda.
Apart from that, conda is unable to identify the packages due to the mentioning of build strings in environment yml file.

How to transcribe from wav audio?

Dear Mr. Rolczynski, I have try to train with my language (Bahasa Indonesia), because of the alphabet's character from my language is identical with englis language, so when i do trainning (with augmentation), i set the aplphabet to english.

just try 5 epoch, the setting is similar with the example that you give it to me.

The WER and CER record for 5 epoch, each one is similar:
917s 738ms/step - loss: -0.6931
2586s 2s/step - loss: -0.1466 - val_loss: -0.6931

then when do:
wer, cer = asr.evaluate.calculate_error_rates(pipeline, test_dataset) print(f'WER: {wer} CER: {cer}')
the result is:
WER: 1.0 CER: 1.0

the traiining result some files, that is:
'alphabet.bin, , 'decoder.bin', 'feature_extractor.bin', and 'model.h5'

from the pipy documentation, i perceive the way to transcribe is:
import automatic_speech_recognition as asr

but i see, the script looks like try to load deepspeech2 model from tensorflow. and I don't see any manual to use all file that produce from trainning process. the error is:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-185fe2c59fa1> in <module>
      5 pipeline = asr.load('deepspeech2', lang='en',version=0.1)
      6 pipeline.model.summary()     # TensorFlow model
----> 7 sentences = pipeline.predict([sample])

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/pipeline/ctc_pipeline.py in predict(self, batch_audio, **kwargs)
     92     def predict(self, batch_audio: List[np.ndarray], **kwargs) -> List[str]:
     93         """ Get ready features, and make a prediction. """
---> 94         features = self._features_extractor(batch_audio)
     95         batch_logits = self._model.predict(features, **kwargs)
     96         decoded_labels = self._decoder(batch_logits)

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/features/feature_extractor.py in __call__(self, batch_audio)
      8     def __call__(self, batch_audio: List[np.ndarray]) -> np.ndarray:
      9         """ Extract features from the file list. """
---> 10         features = [self.make_features(audio) for audio in batch_audio]
     11         X = self.align(features)
     12         return X.astype(np.float16)

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/features/feature_extractor.py in <listcomp>(.0)
      8     def __call__(self, batch_audio: List[np.ndarray]) -> np.ndarray:
      9         """ Extract features from the file list. """
---> 10         features = [self.make_features(audio) for audio in batch_audio]
     11         X = self.align(features)
     12         return X.astype(np.float16)

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/features/spectrogram.py in make_features(self, audio)
     28         audio = self.pad(audio) if self.pad_to else audio
     29         frames = python_speech_features.sigproc.framesig(
---> 30             audio, self.frame_len, self.frame_step, self.winfunc
     31         )
     32         features = python_speech_features.sigproc.logpowspec(

~/.local/lib/python3.6/site-packages/python_speech_features/sigproc.py in framesig(sig, frame_len, frame_step, winfunc)
     31 
     32     zeros = numpy.zeros((padlen - slen,))
---> 33     padsignal = numpy.concatenate((sig,zeros))
     34 
     35     indices = numpy.tile(numpy.arange(0,frame_len),(numframes,1)) + numpy.tile(numpy.arange(0,numframes*frame_step,frame_step),(frame_len,1)).T

ValueError: all the input arrays must have same number of dimensions

I think the error caused by the tensorflow not read the model that already build / result from trainning process
.
i means, there is no code that reffer /binding to read the models...
already try to understood your code and try to reproduce base some code in your script, e.g:

mymodel = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/model.h5')
alphabet = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/alphabet.bin')
decoder = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/decoder.bin')

and still error:


UnpicklingError                           Traceback (most recent call last)
<ipython-input-15-dc0c73ed9f58> in <module>
----> 1 mymodel = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/model.h5')
      2 alphabet = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/alphabet.bin')
      3 decoder = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/decoder.bin')

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/utils/utils.py in load(file_path)
     16     """ Load arbitrary python objects from the pickled file. """
     17     with open(file_path, mode='rb') as file:
---> 18         return pickle.load(file)
     19 
     20 

UnpicklingError: invalid load key, 'H'.

because i don't understood how to use the model, maybe I can asking some question:

I check the function in pipeline.model, by dir (pipeline.model) after run:
pipeline = asr.load('deepspeech2', lang='en',version=0.1)

and this is all the function inside it:
'activity_regularizer', 'add_loss', 'add_metric', 'add_update', 'add_variable', 'add_weight', 'apply', 'build', 'built', 'call', 'compile', 'compute_mask', 'compute_output_shape', 'compute_output_signature', 'count_params', 'dtype', 'dynamic', 'evaluate', 'evaluate_generator', 'fit', 'fit_generator', 'from_config', 'get_config', 'get_input_at', 'get_input_mask_at', 'get_input_shape_at', 'get_layer', 'get_losses_for', 'get_output_at', 'get_output_mask_at', 'get_output_shape_at', 'get_updates_for', 'get_weights', 'inbound_nodes', 'input', 'input_mask', 'input_names', 'input_shape', 'input_spec', 'inputs', 'layers', 'load_weights', 'losses', 'metrics', 'metrics_names', 'name', 'name_scope', 'non_trainable_variables', 'non_trainable_weights', 'optimizer', 'outbound_nodes', 'output', 'output_mask', 'output_names', 'output_shape', 'outputs', 'predict', 'predict_generator', 'predict_on_batch', 'reset_metrics', 'reset_states', 'run_eagerly', 'sample_weights', 'save', 'save_weights', 'set_weights', 'state_updates', 'stateful', 'submodules', 'summary', 'supports_masking', 'test_on_batch', 'to_json', 'to_yaml', 'train_on_batch', 'trainable', 'trainable_variables', 'trainable_weights', 'updates', 'variables', 'weights', 'with_name_scope'.
which one is the fuction to load / fit the model to memory....
what the means WER 1.0 and CER 1.0 that result in evaluate script? are that's means 100% error or 1% error?
how to transcribe with that model, or i must use another repos? if the file only the *.h5 (model file), i think, i will not have many problem with that, i will try to read it with another speech recognition repos that base on h5py, but because of it resulting another file, that is: 'alphabet.bin, , 'decoder.bin', 'feature_extractor.bin'...and i'm not familiar with those file, event I guess it's alphabet.bin similar with ken-lm alphabet. but still needs time to figure out i think.
may be you can give some example to transcribe using those file.
not to much in expectation, because I alredy try to build my own language speech recognition, with many failure, including with deepspeech 2, mostly failure caused by lack of dataset (mine only 3 hours dataset from Mozilla Common Voice), and that causing overfitting problem. so to solve it, I plan to implemented this method. inspired after watch this video in youtube. that's make me found your script, a python script that have possibilities to implemented that research.

btw have any idea to do that? I think it's only need arrangement, the number of the rnn_units, which is, giving the minimum value but enough to create overfitting in trainning process, then when the overfitting detected, raise the number to raise complexity continously (but not reach the limitation / capabilities of the GPU memory) , until the condition that the research means happens. so it will be a trully atomatic speech recognition engine...

But yeah, if you can give an example code, it's more perfect...

sorry for all of my request, hope inspiring something...
sorry for my bad english language, i don't know, how many i edit it (after re reads again)...
you can reply me here or in [email protected] ...for send some code may be (just hope it ..he..he..)

cheers...
and thanks...for what have you made...

Where to add tf.keras.backend.clear_session()?

Recently I put your model behind my flask web server. Every time a POST/GET comes to Flask server, the server will execute pipeline.predict([sample]). After couples of POST/GET requests, the server will be out of memory, so I would like to add tf.keras.backend.clear_session() to your code. I am trying to put tf.keras.backend.clear_session() to some position, but it seems like not work. I would be greatly appreciated if you help solve the problem.

Some Flask server code:

@app.route('/test', methods=['POST'])
def hello_world():
    ....
    audio_file = request.files['audio']
    audio_file.save(audio_path)
    sample = asr.utils.read_audio(audio_path)
    pipeline = asr.load('deepspeech2', lang='en')
    sentences = pipeline.predict([sample])
    ....
    return sentences

Not prediction text

sentences = pipeline.predict([sample])
taking too long for predicting or decoding the text

what's the correct data format?

I run your code on Common Voice dataset but got this problem.
`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py in _apply_op_helper(op_type_name, name, **keywords)
467 as_ref=input_arg.is_ref,
--> 468 preferred_dtype=default_dtype)
469 except TypeError as err:

22 frames
ValueError: Tensor conversion requested dtype float16 for Tensor with dtype float32: <tf.Tensor 'loss/dense_2_loss/ctc_loss_dense/ExpandDims:0' shape=(1, None, None, 29) dtype=float32>

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py in _apply_op_helper(op_type_name, name, **keywords)
502 "%s type %s of argument '%s'." %
503 (prefix, dtypes.as_dtype(attrs[input_arg.type_attr]).name,
--> 504 inferred_from[input_arg.type_attr]))
505
506 types = [values.dtype]

TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type float16 of argument 'x'.`

The csv file is like this:

filename, text up_votes, down_votes, age gender, accent, duration
0 cv-other-train/sample-000000.mp3, he had to spit some tobacco out of his mouth, 0, 0, seventies male, england, NaN
1 cv-other-train/sample-000001.mp3, it took her a while to get used to it, 1, 1, twenties, male, scotland, NaN
2 cv-other-train/sample-000002.mp3, you will need some rubber boots, 0, 0, NaN, NaN, NaN, NaN
3 cv-other-train/sample-000003.mp3, you can speak a label to click on an element, 0, 0, fourties, male, us, NaN
4 cv-other-train/sample-000004.mp3, the priest collapsed backwards, 0, 0, NaN, NaN, NaN, NaN

Can you show me your training data sample? what's the correct data format?

Problem when training

Hi!
When training the model, I am getting negative val_loss.
Also when calling the predict function, the model always outputs empty strings.
Tried different datasets. But the problem is persisting.
Any help would be appreciated.
Thanks!

Tried running the model with a wav file, got `ValueError: all the input arrays must have same number of dimensions`

Hello, I tried running it on a wav file I had and got this error, what does it mean?

WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
WARNING:tensorflow:From /opt/conda/envs/lab42/lib/python3.7/site-packages/tensorflow/python/keras/mixed_precision/loss_scale.py:56: DynamicLossScale.__init__ (from tensorflow.python.training.experimental.loss_scale) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras.mixed_precision.LossScaleOptimizer instead. LossScaleOptimizer now has all the functionality of DynamicLossScale
Model: "DeepSpeech2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
X (InputLayer)               [(None, None, 160)]       0         
_________________________________________________________________
lambda (Lambda)              (None, None, 160, 1)      0         
_________________________________________________________________
conv_1 (Conv2D)              (None, None, 80, 32)      14432     
_________________________________________________________________
conv_1_bn (BatchNormalizatio (None, None, 80, 32)      128       
_________________________________________________________________
conv_1_relu (ReLU)           (None, None, 80, 32)      0         
_________________________________________________________________
conv_2 (Conv2D)              (None, None, 40, 32)      236544    
_________________________________________________________________
conv_2_bn (BatchNormalizatio (None, None, 40, 32)      128       
_________________________________________________________________
conv_2_relu (ReLU)           (None, None, 40, 32)      0         
_________________________________________________________________
reshape (Reshape)            (None, None, 1280)        0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 1600)        9993600   
_________________________________________________________________
dropout (Dropout)            (None, None, 1600)        0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 1600)        11529600  
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 1600)        0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, None, 1600)        11529600  
_________________________________________________________________
dropout_2 (Dropout)          (None, None, 1600)        0         
_________________________________________________________________
bidirectional_4 (Bidirection (None, None, 1600)        11529600  
_________________________________________________________________
dropout_3 (Dropout)          (None, None, 1600)        0         
_________________________________________________________________
bidirectional_5 (Bidirection (None, None, 1600)        11529600  
_________________________________________________________________
dense_1 (TimeDistributed)    (None, None, 1600)        2561600   
_________________________________________________________________
dense_1_relu (ReLU)          (None, None, 1600)        0         
_________________________________________________________________
dropout_4 (Dropout)          (None, None, 1600)        0         
_________________________________________________________________
dense_2 (TimeDistributed)    (None, None, 29)          46429     
=================================================================
Total params: 58,971,261
Trainable params: 58,971,133
Non-trainable params: 128
_________________________________________________________________
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-3a5155b23ea3> in <module>
      4 pipeline = asr.load('deepspeech2', lang='en')
      5 pipeline.model.summary()     # TensorFlow model
----> 6 sentences = pipeline.predict([sample])

/opt/conda/envs/lab42/lib/python3.7/site-packages/automatic_speech_recognition/pipeline/ctc_pipeline.py in predict(self, batch_audio, **kwargs)
     92     def predict(self, batch_audio: List[np.ndarray], **kwargs) -> List[str]:
     93         """ Get ready features, and make a prediction. """
---> 94         features = self._features_extractor(batch_audio)
     95         batch_logits = self._model.predict(features, **kwargs)
     96         decoded_labels = self._decoder(batch_logits)

/opt/conda/envs/lab42/lib/python3.7/site-packages/automatic_speech_recognition/features/feature_extractor.py in __call__(self, batch_audio)
      8     def __call__(self, batch_audio: List[np.ndarray]) -> np.ndarray:
      9         """ Extract features from the file list. """
---> 10         features = [self.make_features(audio) for audio in batch_audio]
     11         X = self.align(features)
     12         return X.astype(np.float16)

/opt/conda/envs/lab42/lib/python3.7/site-packages/automatic_speech_recognition/features/feature_extractor.py in <listcomp>(.0)
      8     def __call__(self, batch_audio: List[np.ndarray]) -> np.ndarray:
      9         """ Extract features from the file list. """
---> 10         features = [self.make_features(audio) for audio in batch_audio]
     11         X = self.align(features)
     12         return X.astype(np.float16)

/opt/conda/envs/lab42/lib/python3.7/site-packages/automatic_speech_recognition/features/spectrogram.py in make_features(self, audio)
     28         audio = self.pad(audio) if self.pad_to else audio
     29         frames = python_speech_features.sigproc.framesig(
---> 30             audio, self.frame_len, self.frame_step, self.winfunc
     31         )
     32         features = python_speech_features.sigproc.logpowspec(

/opt/conda/envs/lab42/lib/python3.7/site-packages/python_speech_features/sigproc.py in framesig(sig, frame_len, frame_step, winfunc)
     31 
     32     zeros = numpy.zeros((padlen - slen,))
---> 33     padsignal = numpy.concatenate((sig,zeros))
     34 
     35     indices = numpy.tile(numpy.arange(0,frame_len),(numframes,1)) + numpy.tile(numpy.arange(0,numframes*frame_step,frame_step),(frame_len,1)).T

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

Pretrained model available?

Hi, first of all, thanks for the work and the project.
Are there any pretrained models for your implementation?
And if so, where can I find them?
thx

shape problem

I have an error when I run the basic,py 'ValueError: Error when checking input: expected X to have 3 dimensions, but got array with shape (41440, 1)' error.