A minimal Seq2Seq example of Automatic Speech Recognition (ASR) based on Transformer
Before launch training, you should download the train and test sub-sets of LRS2,
and prepare ./data/LRS2/train.paths
ใ./data/LRS2/train.text
ใ./data/LRS2/train.lengths
with the format that train.py
requires.
Each line in train.paths represents the local path of an audio file.
Each line in train.text represents a text sentence.
Each line in train.lengths represents an integer value indicating the length of the audio (number of seconds).
The following table suggests a minimal example of the above three files.
train.paths | train.text | train.lengths |
---|---|---|
1.wav | good morning | 1.6 |
2.wav | good afternoon | 2 |
3.wav | nice to meet you | 3.1 |
๐ก If you have difficulty in accessing dataset LRS2, you may use other ASR datasets, such as LibriSpeech or TEDLIUM-v3
Use
torchaudio
,ffmpeg
or any other tools to get the length information of audioIf you are experiencing convergence issue, try subword-based tokenizers (ref) or more sophisticated feature extractors (e.g. 1D ResNet).
Training: python3 ./train.py
Inference: python3 ./test.py <ckpt_path>
bingquanxia AT qq.com