Speech recognition using google's tensorflow deep learning framework, sequence-to-sequence neural networks.
Replaces caffe-speech-recognition, see there for some background.
Create a decent standalone speech recognition for Linux etc. Some people say we have the models but not enough training data. We disagree: There is plenty of training data (100GB here, on Gutenberg, synthetic Text to Speech snippets, Movies with transcripts, YouTube with captions etc etc) we just need a simple yet powerful model. It's only a question of time...
Sample spectrogram, Karen uttering 'zero' with 160 words per minute.
Toy examples:
./number_classifier_tflearn.py
./speaker_classifier_tflearn.py
Some less trivial architectures:
./densenet_layer.py
Later:
./train.sh
./record.py
We are in the process of tackling this project in seriousness. Drop an email to [email protected] if you want to join the party.
Update: Nervana demonstrated that it is possible for 'independents' to build models that are state of the art. Unfortunately they didn't open source the software. Sphinx starts using tensorflow LSTMs.
###Fun tasks for newcomers
- Data Augmentation : create on-the-fly modulation of the data: increase the speech frequency, add background noise, alter the pitch etc,...
###Extensions Extensions to current tensorflow which are probably needed:
- Sliding window GPU implementation
- Continuous densenet->seq2seq adaptation
- Modular graphs/models + persistance
- Incremental collaborative snapshots ('P2P learning')