TiramisuASR implements some speech recognition and speech enhancement architectures such as CTC-based models (Deep Speech 2, etc.), Speech Enhancement Generative Adversarial Network (SEGAN), RNN Transducer (Conformer, etc.). These models can be converted to TFLite to reduce memory and computation for deployment ๐
- Support
transducer
tflite greedy decoding (conversion and invocation) - Distributed training using
tf.distribute.MirroredStrategy
- Fixed transducer beam search
- Add
log_gammatone_spectrogram
- CTCModel (End2end models using CTC Loss for training)
- SEGAN (Refer to https://github.com/santi-pdp/segan), see examples/segan
- Transducer Models (End2end models using RNNT Loss for training)
- Conformer Transducer (Reference: https://arxiv.org/abs/2005.08100) See examples/conformer
- Ubuntu distribution (
ctc-decoders
andsemetrics
require some packages from apt) - Python 3.6+
- Tensorflow 2.2+:
pip install tensorflow
Install gammatone: pip3 install git+https://github.com/detly/gammatone.git
Install tensorflow: pip3 install tensorflow
or pip3 install tf-nightly
(for using tflite)
Install packages: python3 setup.py install
For setting up datasets, see datasets
-
For training, testing and using CTC Models, run
./scripts/install_ctc_decoders.sh
-
For training Transducer Models, export
CUDA_HOME
and run./scripts/install_rnnt_loss.sh
-
For testing Speech Enhancement Model (i.e SEGAN), install
octave
and run./scripts/install_semetrics.sh
-
Method
tiramisu_asr.utils.setup_environment()
automatically enable mixed_precision if available. -
To enable XLA, run
TF_XLA_FLAGS=--tf_xla_auto_jit=2 $python_train_script
Clean up: python3 setup.py clean --all
(this will remove /build
contents)
After converting to tflite, the tflite model is like a function that transforms directly from an audio signal to unicode code points, then we can convert unicode points to string.
- Install
tf-nightly
usingpip install tf-nightly
- Build a model with the same architecture as the trained model (if model has tflite argument, you must set it to True), then load the weights from trained model to the built model
- Load
TFSpeechFeaturizer
andTextFeaturizer
to model using functionadd_featurizers
- Convert model's function to tflite as follows:
func = model.make_tflite_function(greedy=True) # or False
concrete_func = func.get_concrete_function()
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
converter.experimental_new_converter = True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
- Save the converted tflite model as follows:
if not os.path.exists(os.path.dirname(tflite_path)):
os.makedirs(os.path.dirname(tflite_path))
with open(tflite_path, "wb") as tflite_out:
tflite_out.write(tflite_model)
- Then the
.tflite
model is ready to be deployed
See augmentations
Example YAML Config Structure
speech_config: ...
model_config: ...
decoder_config: ...
learning_config:
augmentations: ...
dataset_config:
train_paths: ...
eval_paths: ...
test_paths: ...
tfrecords_dir: ...
optimizer_config: ...
running_config:
batch_size: 8
num_epochs: 20
outdir: ...
log_interval_steps: 500
See examples for some predefined ASR models.
For pretrained models, go to drive
Name | Source | Hours |
---|---|---|
LibriSpeech | LibriSpeech | 970h |
Common Voice | https://commonvoice.mozilla.org | 1932h |
Name | Source | Hours |
---|---|---|
Vivos | https://ailab.hcmus.edu.vn/vivos | 15h |
InfoRe Technology 1 | InfoRe1 (passwd: BroughtToYouByInfoRe) | 25h |
InfoRe Technology 2 (used in VLSP2019) | InfoRe2 (passwd: BroughtToYouByInfoRe) | 415h |
Name | Source | Hours |
---|---|---|
Common Voice | https://commonvoice.mozilla.org/ | 750h |