GithubHelp home page GithubHelp logo

whitefu / tiramisuasr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tensorspeech/tensorflowasr

0.0 1.0 0.0 76.25 MB

A Sweet Automatic Speech Recognition like Tiramisu Cake using Tensorflow 2. Supported languages having small number of characters such as English, Vietnamese, German, etc.

Home Page: https://huylenguyen.com/asr

License: Apache License 2.0

Python 99.49% Shell 0.51%

tiramisuasr's Introduction

TiramisuASR ๐Ÿฐ

GitHub python tensorflow ubuntu

The Newest Automatic Speech Recognition in Tensorflow 2

TiramisuASR implements some speech recognition and speech enhancement architectures such as CTC-based models (Deep Speech 2, etc.), Speech Enhancement Generative Adversarial Network (SEGAN), RNN Transducer (Conformer, etc.). These models can be converted to TFLite to reduce memory and computation for deployment ๐Ÿ˜„

What's New?

  • Support transducer tflite greedy decoding (conversion and invocation)
  • Distributed training using tf.distribute.MirroredStrategy
  • Fixed transducer beam search
  • Add log_gammatone_spectrogram

๐Ÿ˜‹ Supported Models

Requirements

  • Ubuntu distribution (ctc-decoders and semetrics require some packages from apt)
  • Python 3.6+
  • Tensorflow 2.2+: pip install tensorflow

Setup Environment and Datasets

Install gammatone: pip3 install git+https://github.com/detly/gammatone.git

Install tensorflow: pip3 install tensorflow or pip3 install tf-nightly (for using tflite)

Install packages: python3 setup.py install

For setting up datasets, see datasets

  • For training, testing and using CTC Models, run ./scripts/install_ctc_decoders.sh

  • For training Transducer Models, export CUDA_HOME and run ./scripts/install_rnnt_loss.sh

  • For testing Speech Enhancement Model (i.e SEGAN), install octave and run ./scripts/install_semetrics.sh

  • Method tiramisu_asr.utils.setup_environment() automatically enable mixed_precision if available.

  • To enable XLA, run TF_XLA_FLAGS=--tf_xla_auto_jit=2 $python_train_script

Clean up: python3 setup.py clean --all (this will remove /build contents)

TFLite Convertion

After converting to tflite, the tflite model is like a function that transforms directly from an audio signal to unicode code points, then we can convert unicode points to string.

  1. Install tf-nightly using pip install tf-nightly
  2. Build a model with the same architecture as the trained model (if model has tflite argument, you must set it to True), then load the weights from trained model to the built model
  3. Load TFSpeechFeaturizer and TextFeaturizer to model using function add_featurizers
  4. Convert model's function to tflite as follows:
func = model.make_tflite_function(greedy=True) # or False
concrete_func = func.get_concrete_function()
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
converter.experimental_new_converter = True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
                                       tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
  1. Save the converted tflite model as follows:
if not os.path.exists(os.path.dirname(tflite_path)):
    os.makedirs(os.path.dirname(tflite_path))
with open(tflite_path, "wb") as tflite_out:
    tflite_out.write(tflite_model)
  1. Then the .tflite model is ready to be deployed

Features Extraction

See features_extraction

Augmentations

See augmentations

Training & Testing

Example YAML Config Structure

speech_config: ...
model_config: ...
decoder_config: ...
learning_config:
  augmentations: ...
  dataset_config:
    train_paths: ...
    eval_paths: ...
    test_paths: ...
    tfrecords_dir: ...
  optimizer_config: ...
  running_config:
    batch_size: 8
    num_epochs: 20
    outdir: ...
    log_interval_steps: 500

See examples for some predefined ASR models.

Corpus Sources and Pretrained Models

For pretrained models, go to drive

English

Name Source Hours
LibriSpeech LibriSpeech 970h
Common Voice https://commonvoice.mozilla.org 1932h

Vietnamese

Name Source Hours
Vivos https://ailab.hcmus.edu.vn/vivos 15h
InfoRe Technology 1 InfoRe1 (passwd: BroughtToYouByInfoRe) 25h
InfoRe Technology 2 (used in VLSP2019) InfoRe2 (passwd: BroughtToYouByInfoRe) 415h

German

Name Source Hours
Common Voice https://commonvoice.mozilla.org/ 750h

References & Credits

  1. NVIDIA OpenSeq2Seq Toolkit
  2. https://github.com/santi-pdp/segan
  3. https://github.com/noahchalifour/warp-transducer
  4. Sequence Transduction with Recurrent Neural Network
  5. End-to-End Speech Processing Toolkit in PyTorch

tiramisuasr's People

Contributors

nglehuy avatar pquochuy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.