GithubHelp home page GithubHelp logo

adroitanandai / indian-accent-speech-recognition Goto Github PK

View Code? Open in Web Editor NEW
84.0 5.0 34.0 30.88 MB

Traditional ASR (Signal & Cepstral Analysis, DTW, HMM) & DNNs (Custom Models + DeepSpeech) on Indian Accent Speech

Home Page: https://towardsdatascience.com/indian-accent-speech-recognition-2d433eb7edac

License: Creative Commons Zero v1.0 Universal

Jupyter Notebook 99.08% Python 0.92%
accent accented-speech asr cepstral-analysis custom-training deepspeech dnn hmm indian indian-language

indian-accent-speech-recognition's Introduction

Indian Accent Speech Recognition

Traditional ASR (Signal Analysis, MFCC, DTW, HMM & Language Modelling) and DNNs (Custom Models & Baidu DeepSpeech Model) on Indian Accent Speech

<< Uploaded the pre-trained model owing to requests >>
The generated trie file is uploaded to pre-trained-models directory. So you can skip the KenLM Toolkit step.

To understand the context, theory and explanation of this project, head over to my blog:
https://towardsdatascience.com/indian-accent-speech-recognition-2d433eb7edac

How to Use?

A starter Code to use the model is given in the file: Starter.ipynb. You can run it in your Google Colab, if you upload the 3 files (given in params) to your google drive.

  • Install DeepSpeech 0.6.1
  • Download the pre-trained model (.pbmm), language model and trie file.
  • Download instructions are given in pre-trained-models folder. After download give them as arguments.
!deepspeech --model speech/output_graph.pbmm --lm speech/lm.binary --trie speech/trie --audio /content/06_M_artic_01_004.wav

If you run into issue while loading the pre-trained model, then it is mostly due to your deepspeech version.

Contents:

  • vui_notebook.ipynb: DNN Custom Models and Comparative Analysis to make a custom Speech Recognition model.
  • DeepSpeech_Training.ipynb: Retraining of DeepSpeech Model with Indian Accent Voice Data.
  • Training_Instructions.docx: Instructions to train DeepSpeech model.

Data Source/ Training Data:

Indic TTS Project: Downloaded 50+ GB of Indic TTS voice DB from Speech and Music Technology Lab, IIT Madras, which comprises of 10000+ spoken sentences from 20+ states (both Male and Female native speakers)

https://www.iitm.ac.in/donlab/tts/index.php

You can also record your own audio or let the ebook reader apps read a document. But I found it is insufficient to train such a heavy model. Then I requested support of IIT Madras, Speech Lab who kindly granted access to their Voice database.

DNN Custom Models for Speech Recognition:

Model 1: CNN + RNN + TimeDistributed Dense

Model 2: Deeper RNN + TimeDistributed Dense

Comparison: Training Loss & Validation Loss of Model 1 (CNN) & Model 2 (RNN)

Model 3: Pooled CNN+Deep Bidirectional RNN +Time-distributed Dense

DeepSpeech Model Training:

These are the high level steps we gonna do:

  • Get a pre-trained model.
  • Load Indian Accent English Speech dataset
  • Convert to the input format to feed the DeepSpeech model.
  • Compare trained model with DeepSpeech base model to validate improvement.

Step by step instructions

  • The dataset contains the audio and its description. But to load the data to deep speech model, we need to generate CSV containing audio file path, its transcription and file size.
  • Split the CSV file into 3 parts: test.csv,train.csv and valid.csv.
  • Write a python program to set the frame rate for all audio files into 12000hz (deep speech model requirement)
  • Clone the Baidu DeepSpeech Project 0.4.1 from here
  • Execute DeepSpeech.py with appropriate parameters (given below).
  • Export_dir will contain output_graph.pbmm which you load in deepspeech.model() function.
  • KenLM ToolKit is used to generate Trie file. It is required to pass in to deep speech decoder function.
  • model.enableDecoderWithLM(lm_file,trie,0.75,1.85): lm_file is the .pbmm after training and trie is the output of KenLM Toolkit.
  • Use deep speech decoder function to do STT.
./DeepSpeech.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv

To fine-tune the entire graph using data in train.csv, dev.csv, and test.csv, for 3 epochs we can tune hyperparameters as below,

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir path/to/checkpoint/folder --epochs 3 --train_files my-train.csv --dev_files my-dev.csv --test_files my_dev.csv --learning_rate 0.0001

Hyperparameters for Training:

python -u DeepSpeech.py \
   --train_files /home/prem/ds_project/datavoice/data_voice_train.csv \
   --test_files /home/prem/ds_project/datavoice/data_voice_test.csv \
   --dev_files /home/prem/ds_project/datavoice/data_voice_dev.csv \
   --n_hidden 2048 \
   --epoch 100 \
   --use_seq_length False \
   --checkpoint_dir /home/prem/ds_project/datavoice/checkpoints/ \
   --learning_rate 0.0001 \
   --export_dir /home/prem/ds_project/datavoice/model_export/ \
   --train_batch_size 64 \
   --test_batch_size 32 \
   --dev_batch_size 32 \

Comparing Indian Accent English Model with Deepspeech model

To check accuracy, we used 3 metrics: WER, WACC and BLUE SCORE.

Metric shows trained model performs much better for Indian Accent English.

Model Comparison Results:

Lets plot above metrics, feeding Indian Accent Speech Data (Test Set) to both DeepSpeech pre-trained model and our trained model to compare. The 3 bins in graphs below represents low, medium and high accuracy, from left to right.

DeepSpeech Base Model: Most datapoints classified as "Low Accuracy" in all 3 metrics

Trained Model: Most datapoints classified as "Medium & High Accuracy" in all 3 metrics

The above depiction proves that the trained model performs much better for Indian Accent Speech Recognition compared to DeepSpeech model.

Conclusion

'Cepstral Analysis' separate out the accent components in speech signals, while doing Feature Extraction (MFCC) in Traditional ASR. In state-of-the-art Deep Neural Networks, features are intrinsically learnt. Hence, we can transfer learn a pre-trained model with mutiple accents, to let the model learn the accent peculiarities on its own.

We have proved the case, by doing transfer learning Baidu's DeepSpeech pre-trained model on Indian-English Speech data from multiple states. You can easily extend the approach for any root language or locale accent as well.

If you have any query or suggestion, you can reach me here. https://www.linkedin.com/in/ananduthaman/

References

[1] https://www.iitm.ac.in/donlab/tts/database.php
[2] https://www.udacity.com/course/natural-language-processing-nanodegree--nd892

indian-accent-speech-recognition's People

Contributors

adroitanandai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

indian-accent-speech-recognition's Issues

Missing licence File

I am assuming that the code and model provided in the public repository is available for reuse in other applications and personal projects but It would be great to have a licence such as MIT added to the repository to help in making the decision easier.

Cannot find pre-trained model

Hi,
I'm trying to access your pre-trained model. But it says the "file is in owner's trash". Can you please re-upload it again?
Thanks

deepspeech: error: argument --lm_alpha: invalid float value: 'speech/lm.binary'

Hi,
Thank you for the article and the model. But for me, when I run the command given in the doc, I am getting
deepspeech: error: ambiguous option: --lm could match --lm_alpha, --lm_beta
and on adding lm-alpha like so
deepspeech --model speech/output_graph.pbmm --lm speech/lm.binary --trie speech/trie --audio content/8455-210777-0068.wav,
I am getting
deepspeech: error: argument --lm_alpha: invalid float value: 'speech/lm.binary'.
My version for Deepspeech is 0.8.2.
I am kinda new at this and any help will be appreciated.
Thank you

pre-trained model is not Available

hello sir "output_graph.pbmm" is currently not available in your reposetory can you please make it availabe or please provide the file as it is pretrained model and must need for this please accept this request

Provide Pretrained model

Hello,
Loved your article and your work. Can you provide trained model along with access to raw data for references.
Really appreciate it.

CreateModel failed with error code 15

Very good article and I hope model should also be good. However, not able to run it due to some error. I have downloaded your model which you had put in this repo. Please suggest:

deepspeech --model output_graph_indian_test.pbmm --lm models/lm.binary --audio chunk02_speaker0_16.wav --alphabet models/alphabet.txt --trie models/trie

TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6
Data loss: Corrupted memmapped model file: output_graph_indian_test.pbmm Invalid directory offset
Traceback (most recent call last):
  File "/anaconda3/envs/berttorchenv_dev/bin/deepspeech", line 8, in <module>
    sys.exit(main())
  File "/anaconda3/envs/berttorchenv_dev/lib/python3.6/site-packages/deepspeech/client.py", line 80, in main
    ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
  File "/anaconda3/envs/berttorchenv_dev/lib/python3.6/site-packages/deepspeech/__init__.py", line 14, in __init__
    raise RuntimeError("CreateModel failed with error code {}".format(status))
RuntimeError: CreateModel failed with error code 15

Bundle and release model checkpoints

Hi @AdroitAnandAI ! this is some impressive work, and it would be great to make it more usable by releasing model checkpoints as well... the current protobuf model files are harder to use with newer DeepSpeech releases, but checkpoints can be more easily re-formatted

Does this also support voice cloning with indian accent?

Greetings ,

The repo is awesome and also thanks for google colab implementation, it's easy to use , but after testing the repo with ease , i got the output from the .wav file as a text (as mentioned), but i am also curious whether can i do a voice clone of my own voice in real time on the generated text in an indian accent?

Any help/guidance would be really helpfull,
Thanks,
satyam.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.