GithubHelp home page GithubHelp logo

espnet / espnet Goto Github PK

View Code? Open in Web Editor NEW
7.9K 177.0 2.1K 953.06 MB

End-to-End Speech Processing Toolkit

Home Page: https://espnet.github.io/espnet/

License: Apache License 2.0

Shell 39.06% Perl 1.83% Python 58.43% Makefile 0.05% Dockerfile 0.05% MATLAB 0.44% M 0.05% Cython 0.01% CMake 0.05% Cuda 0.03% C++ 0.01%
deep-learning end-to-end chainer pytorch kaldi speech-recognition speech-synthesis speech-translation machine-translation voice-conversion

espnet's Introduction

ESPnet: end-to-end speech processing toolkit

system/pytorch ver. 1.12.1 1.13.1 2.0.1 2.1.0
ubuntu/python3.10/pip ci on ubuntu ci on ubuntu
ubuntu/python3.9/pip ci on ubuntu ci on ubuntu
ubuntu/python3.8/pip ci on ubuntu ci on ubuntu
ubuntu/python3.7/pip ci on ubuntu ci on ubuntu
debian11/python3.10/conda ci on debian11
centos7/python3.10/conda ci on centos7
windows/python3.10/pip ci on windows
macos/python3.10/pip ci on macos
macos/python3.10/conda ci on macos

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

Key Features

Kaldi-style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.)
  • Support numbers of TTS recipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT'14, IWSLT'16, the above ST recipes etc.)
  • Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
  • Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
  • Support voice conversion recipe (VCC2020 baseline)
  • Support speaker diarization recipe (mini_librispeech, librimix)
  • Support singing voice synthesis recipe (ofuton_p_utagoe_db, opencpop, m4singer, etc.)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
    • Fast/accurate training with CTC/attention multitask training
    • CTC/attention joint decoding to boost monotonic alignment decoding
    • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
    • Decoder: RNN (LSTM/GRU), Transformer, or S4
  • Attention: Dot product, location-aware attention, variants of multi-head
  • Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
  • Batch GPU decoding
  • Data augmentation
  • Transducer based end-to-end ASR
    • Architecture:
      • Custom encoder supporting RNNs, Conformer, Branchformer (w/ variants), 1D Conv / TDNN.
      • Decoder w/ parameters shared across blocks supporting RNN, stateless w/ 1D Conv, MEGA, and RWKV.
      • Pre-encoder: VGG2L or Conv2D available.
    • Search algorithms:
    • Features:
      • Unified interface for offline and streaming speech recognition.
      • Multi-task learning with various auxiliary losses:
        • Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
        • Decoder: cross-entropy w/ label smoothing.
      • Transfer learning with an acoustic model and/or language model.
      • Training with FastEmit regularization method [Yu et al., 2021].

    Please refer to the tutorial page for complete documentation.

  • CTC segmentation
  • Non-autoregressive model based on Mask-CTC
  • ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
  • Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
  • Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
    • Set frontend to s3prl
    • Select any upstream model by setting the frontend_conf to the corresponding name.
  • Transfer Learning :
  • Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
  • Restricted Self-Attention based on Longformer as an encoder for long sequences
  • OpenAI Whisper model, robust ASR based on large-scale, weakly-supervised multitask learning

Demonstration

TTS: Text-to-speech

  • Architecture
    • Tacotron2
    • Transformer-TTS
    • FastSpeech
    • FastSpeech2
    • Conformer FastSpeech & FastSpeech2
    • VITS
    • JETS
  • Multi-speaker & multi-language extention
    • Pre-trained speaker embedding (e.g., X-vector)
    • Speaker ID embedding
    • Language ID embedding
    • Global style token (GST) embedding
    • Mix of the above embeddings
  • End-to-end training
    • End-to-end text-to-wav model (e.g., VITS, JETS, etc.)
    • Joint training of text2mel and vocoder
  • Various language support
    • En / Jp / Zn / De / Ru / And more...
  • Integration with neural vocoders
    • Parallel WaveGAN
    • MelGAN
    • Multi-band MelGAN
    • HiFiGAN
    • StyleMelGAN
    • Mix of the above models

Demonstration

To train the neural vocoder, please check the following repositories:

SE: Speech enhancement (and separation)

  • Single-speaker speech enhancement
  • Multi-speaker speech separation
  • Unified encoder-separator-decoder structure for time-domain and frequency-domain models
  • Flexible ASR integration: working as an individual task or as the ASR frontend
  • Easy to import pre-trained models from Asteroid
    • Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration

  • Interactive SE demo with ESPnet2 Open In Colab
  • Streaming SE demo with ESPnet2 Open In Colab

ST: Speech Translation & MT: Machine Translation

  • State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
  • Transformer-based end-to-end ST (new!)
  • Transformer-based end-to-end MT (new!)

VC: Voice conversion

  • Transformer and Tacotron2-based parallel VC using Mel spectrogram
  • End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

SLU: Spoken Language Understanding

  • Architecture
    • Transformer-based Encoder
    • Conformer-based Encoder
    • Branchformer based Encoder
    • E-Branchformer based Encoder
    • RNN based Decoder
    • Transformer-based Decoder
  • Support Multitasking with ASR
    • Predict both intent and ASR transcript
  • Support Multitasking with NLU
    • Deliberation encoder based 2 pass model
  • Support using pre-trained ASR models
    • Hubert
    • Wav2vec2
    • VQ-APC
    • TERA and more ...
  • Support using pre-trained NLP models
    • BERT
    • MPNet And more...
  • Various language support
    • En / Jp / Zn / Nl / And more...
  • Supports using context from previous utterances
  • Supports using other tasks like SE in a pipeline manner
  • Supports Two Pass SLU that combines audio and ASR transcript Demonstration
  • Performing noisy spoken language understanding using a speech enhancement model followed by a spoken language understanding model. Open In Colab
  • Performing two-pass spoken language understanding where the second pass model attends to both acoustic and semantic information. Open In Colab
  • Integrated to Hugging Face Spaces with Gradio. See SLU demo on multiple languages: Hugging Face Spaces

SUM: Speech Summarization

  • End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [Sharma et al., 2022]

SVS: Singing Voice Synthesis

  • Framework merge from Muskits
  • Architecture
    • RNN-based non-autoregressive model
    • Xiaoice
    • Tacotron-singing
    • DiffSinger (in progress)
    • VISinger
    • VISinger 2 (its variations with different vocoders-architecture)
  • Support multi-speaker & multilingual singing synthesis
    • Speaker ID embedding
    • Language ID embedding
  • Various language support
    • Jp / En / Kr / Zh
  • Tight integration with neural vocoders (the same as TTS)

SSL: Self-supervised Learning

UASR: Unsupervised ASR (EURO: ESPnet Unsupervised Recognition - Open-source)

  • Architecture
    • wav2vec-U (with different self-supervised models)
    • wav2vec-U 2.0 (in progress)
  • Support PrefixBeamSearch and K2-based WFST decoding

S2T: Speech-to-text with Whisper-style multilingual multitask models

  • Reproduces Whisper-style training from scratch using public data: OWSM
  • Supports multiple tasks in a single model
    • Multilingual speech recognition
    • Any-to-any speech translation
    • Language identification
    • Utterance-level timestamp prediction (segmentation)

DNN Framework

  • Flexible network architecture thanks to Chainer and PyTorch
  • Flexible front-end processing thanks to kaldiio and HDF5 support
  • Tensorboard-based monitoring

ESPnet2

See ESPnet2.

  • Independent from Kaldi/Chainer, unlike ESPnet1
  • On-the-fly feature extraction and text processing when training
  • Supporting DistributedDataParallel and DaraParallel both
  • Supporting multiple nodes training and integrated with Slurm or MPI
  • Supporting Sharded Training provided by fairscale
  • A template recipe that can be applied to all corpora
  • Possible to train any size of corpus without CPU memory error
  • ESPnet Model Zoo
  • Integrated with wandb

Installation

  • If you intend to do full experiments, including DNN training, then see Installation.

  • If you just need the Python module only:

    # We recommend you install PyTorch before installing espnet following https://pytorch.org/get-started/locally/
    pip install espnet
    # To install the latest
    # pip install git+https://github.com/espnet/espnet
    # To install additional packages
    # pip install "espnet[all]"

    If you use ESPnet1, please install chainer and cupy.

    pip install chainer==6.0.0 cupy==6.0.0    # [Option]

    You might need to install some packages depending on each task. We prepared various installation scripts at tools/installers.

  • (ESPnet2) Once installed, run wandb login and set --use_wandb true to enable tracking runs using W&B.

Docker Container

go to docker/ and follow instructions.

Contribution

Thank you for taking the time for ESPnet! Any contributions to ESPnet are welcome, and feel free to ask any questions or requests to issues. If it's your first ESPnet contribution, please follow the contribution guide.

ASR results

expand

We list the character error rate (CER) and word error rate (WER) of major ASR tasks.

Task CER (%) WER (%) Pre-trained model
Aishell dev/test 4.6/5.1 N/A link
ESPnet2 Aishell dev/test 4.1/4.4 N/A link
Common Voice dev/test 1.7/1.8 2.2/2.3 link
CSJ eval1/eval2/eval3 5.7/3.8/4.2 N/A link
ESPnet2 CSJ eval1/eval2/eval3 4.5/3.3/3.6 N/A link
ESPnet2 GigaSpeech dev/test N/A 10.6/10.5 link
HKUST dev 23.5 N/A link
ESPnet2 HKUST dev 21.2 N/A link
Librispeech dev_clean/dev_other/test_clean/test_other N/A 1.9/4.9/2.1/4.9 link
ESPnet2 Librispeech dev_clean/dev_other/test_clean/test_other 0.6/1.5/0.6/1.4 1.7/3.4/1.8/3.6 link
Switchboard (eval2000) callhm/swbd N/A 14.0/6.8 link
ESPnet2 Switchboard (eval2000) callhm/swbd N/A 13.4/7.3 link
TEDLIUM2 dev/test N/A 8.6/7.2 link
ESPnet2 TEDLIUM2 dev/test N/A 7.3/7.1 link
TEDLIUM3 dev/test N/A 9.6/7.6 link
WSJ dev93/eval92 3.2/2.1 7.0/4.7 N/A
ESPnet2 WSJ dev93/eval92 1.1/0.8 2.8/1.8 link

Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by RWTH.

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/asr1/RESULTS.md.

ASR demo

expand

You can recognize speech in a WAV file using pre-trained models. Go to a recipe directory and run utils/recog_wav.sh as follows:

# go to the recipe directory and source path of espnet tools
cd egs/tedlium2/asr1 && . ./path.sh
# let's recognize speech!
recog_wav.sh --models tedlium2.transformer.v1 example.wav

where example.wav is a WAV file to be recognized. The sampling rate must be consistent with that of data used in training.

Available pre-trained models in the demo script are listed below.

Model Notes
tedlium2.rnn.v1 Streaming decoding based on CTC-based VAD
tedlium2.rnn.v2 Streaming decoding based on CTC-based VAD (batch decoding)
tedlium2.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 2
tedlium3.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 3
librispeech.transformer.v1 Joint-CTC attention Transformer trained on Librispeech
commonvoice.transformer.v1 Joint-CTC attention Transformer trained on CommonVoice
csj.transformer.v1 Joint-CTC attention Transformer trained on CSJ
csj.rnn.v1 Joint-CTC attention VGGBLSTM trained on CSJ

SE results

expand

We list results from three different models on WSJ0-2mix, which is one the most widely used benchmark dataset for speech separation.

Model STOI SAR SDR SIR
TF Masking 0.89 11.40 10.24 18.04
Conv-Tasnet 0.95 16.62 15.94 25.90
DPRNN-Tasnet 0.96 18.82 18.29 28.92

SE demos

expand
You can try the interactive demo with Google Colab. Please click the following button to get access to the demos.

Open In Colab

It is based on ESPnet2. Pre-trained models are available for both speech enhancement and speech separation tasks.

Speech separation streaming demos:

Open In Colab

ST results

expand

We list 4-gram BLEU of major ST tasks.

end-to-end system

Task BLEU Pre-trained model
Fisher-CallHome Spanish fisher_test (Es->En) 51.03 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 20.44 link
Libri-trans test (En->Fr) 16.70 link
How2 dev5 (En->Pt) 45.68 link
Must-C tst-COMMON (En->De) 22.91 link
Mboshi-French dev (Fr->Mboshi) 6.18 N/A

cascaded system

Task BLEU Pre-trained model
Fisher-CallHome Spanish fisher_test (Es->En) 42.16 N/A
Fisher-CallHome Spanish callhome_evltest (Es->En) 19.82 N/A
Libri-trans test (En->Fr) 16.96 N/A
How2 dev5 (En->Pt) 44.90 N/A
Must-C tst-COMMON (En->De) 23.65 N/A

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/st1/RESULTS.md.

ST demo

expand

(New!) We made a new real-time E2E-ST + TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!

Open In Colab


You can translate speech in a WAV file using pre-trained models. Go to a recipe directory and run utils/translate_wav.sh as follows:

# Go to recipe directory and source path of espnet tools
cd egs/fisher_callhome_spanish/st1 && . ./path.sh
# download example wav file
wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -
# let's translate speech!
translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav

where test.wav is a WAV file to be translated. The sampling rate must be consistent with that of data used in training.

Available pre-trained models in the demo script are listed as below.

Model Notes
fisher_callhome_spanish.transformer.v1 Transformer-ST trained on Fisher-CallHome Spanish Es->En

MT results

expand
Task BLEU Pre-trained model
Fisher-CallHome Spanish fisher_test (Es->En) 61.45 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 29.86 link
Libri-trans test (En->Fr) 18.09 link
How2 dev5 (En->Pt) 58.61 link
Must-C tst-COMMON (En->De) 27.63 link
IWSLT'14 test2014 (En->De) 24.70 link
IWSLT'14 test2014 (De->En) 29.22 link
IWSLT'14 test2014 (De->En) 32.2 link
IWSLT'16 test2014 (En->De) 24.05 link
IWSLT'16 test2014 (De->En) 29.13 link

TTS results

ESPnet2

You can listen to the generated samples in the following URL.

Note that in the generation, we use Griffin-Lim (wav/) and Parallel WaveGAN (wav_pwg/).

You can download pre-trained models via espnet_model_zoo.

You can download pre-trained vocoders via kan-bayashi/ParallelWaveGAN.

ESPnet1

NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest results in the above ESPnet2 results.

You can listen to our samples in demo HP espnet-tts-sample. Here we list some notable ones:

You can download all of the pre-trained models and generated samples:

Note that in the generated samples, we use the following vocoders: Griffin-Lim (GL), WaveNet vocoder (WaveNet), Parallel WaveGAN (ParallelWaveGAN), and MelGAN (MelGAN). The neural vocoders are based on the following repositories.

If you want to build your own neural vocoder, please check the above repositories. kan-bayashi/ParallelWaveGAN provides the manual about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.

Here we list all of the pre-trained neural vocoders. Please download and enjoy the generation of high-quality speech!

Model link Lang Fs [Hz] Mel range [Hz] FFT / Shift / Win [pt] Model type
ljspeech.wavenet.softmax.ns.v1 EN 22.05k None 1024 / 256 / None Softmax WaveNet
ljspeech.wavenet.mol.v1 EN 22.05k None 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v1 EN 22.05k None 1024 / 256 / None Parallel WaveGAN
ljspeech.wavenet.mol.v2 EN 22.05k 80-7600 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v2 EN 22.05k 80-7600 1024 / 256 / None Parallel WaveGAN
ljspeech.melgan.v1 EN 22.05k 80-7600 1024 / 256 / None MelGAN
ljspeech.melgan.v3 EN 22.05k 80-7600 1024 / 256 / None MelGAN
libritts.wavenet.mol.v1 EN 24k None 1024 / 256 / None MoL WaveNet
jsut.wavenet.mol.v1 JP 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
jsut.parallel_wavegan.v1 JP 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN
csmsc.wavenet.mol.v1 ZH 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
csmsc.parallel_wavegan.v1 ZH 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN

If you want to use the above pre-trained vocoders, please exactly match the feature setting with them.

TTS demo

ESPnet2

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis!

  • Real-time TTS demo with ESPnet2 Open In Colab

English, Japanese, and Mandarin models are available in the demo.

ESPnet1

NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo.

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis.

  • Real-time TTS demo with ESPnet1 Open In Colab

We also provide a shell script to perform synthesis. Go to a recipe directory and run utils/synth_wav.sh as follows:

# Go to recipe directory and source path of espnet tools
cd egs/ljspeech/tts1 && . ./path.sh
# We use an upper-case char sequence for the default model.
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt
# let's synthesize speech!
synth_wav.sh example.txt

# Also, you can use multiple sentences
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example_multi.txt
echo "TEXT TO SPEECH IS A TECHNIQUE TO CONVERT TEXT INTO SPEECH." >> example_multi.txt
synth_wav.sh example_multi.txt

You can change the pre-trained model as follows:

synth_wav.sh --models ljspeech.fastspeech.v1 example.txt

Waveform synthesis is performed with the Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN). You can change the pre-trained vocoder model as follows:

synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt

WaveNet vocoder provides very high-quality speech, but it takes time to generate.

See more details or available models via --help.

synth_wav.sh --help

VC results

expand
  • Transformer and Tacotron2-based VC

You can listen to some samples on the demo webpage.

  • Cascade ASR+TTS as one of the baseline systems of VCC2020

The Voice Conversion Challenge 2020 (VCC2020) adopts ESPnet to build an end-to-end based baseline system. In VCC2020, the objective is intra/cross-lingual nonparallel VC. You can download converted samples of the cascade ASR+TTS baseline system here.

SLU results

expand

We list the performance on various SLU tasks and datasets using the metric reported in the original dataset paper

Task Dataset Metric Result Pre-trained Model
Intent Classification SLURP Acc 86.3 link
Intent Classification FSC Acc 99.6 link
Intent Classification FSC Unseen Speaker Set Acc 98.6 link
Intent Classification FSC Unseen Utterance Set Acc 86.4 link
Intent Classification FSC Challenge Speaker Set Acc 97.5 link
Intent Classification FSC Challenge Utterance Set Acc 78.5 link
Intent Classification SNIPS F1 91.7 link
Intent Classification Grabo (Nl) Acc 97.2 link
Intent Classification CAT SLU MAP (Zn) Acc 78.9 link
Intent Classification Google Speech Commands Acc 98.4 link
Slot Filling SLURP SLU-F1 71.9 link
Dialogue Act Classification Switchboard Acc 67.5 link
Dialogue Act Classification Jdcinal (Jp) Acc 67.4 link
Emotion Recognition IEMOCAP Acc 69.4 link
Emotion Recognition swbd_sentiment Macro F1 61.4 link
Emotion Recognition slue_voxceleb Macro F1 44.0 link

If you want to check the results of the other recipes, please check egs2/<name_of_recipe>/asr1/RESULTS.md.

CTC Segmentation demo

ESPnet1

CTC segmentation determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets.

As a demo, we align the start and end of utterances within the audio file ctc_align_test.wav, using the example script utils/asr_align_wav.sh. For preparation, set up a data directory:

cd egs/tedlium2/align1/
# data directory
align_dir=data/demo
mkdir -p ${align_dir}
# wav file
base=ctc_align_test
wav=../../../test_utils/${base}.wav
# recipe files
echo "batchsize: 0" > ${align_dir}/align.yaml

cat << EOF > ${align_dir}/utt_text
${base} THE SALE OF THE HOTELS
${base} IS PART OF HOLIDAY'S STRATEGY
${base} TO SELL OFF ASSETS
${base} AND CONCENTRATE
${base} ON PROPERTY MANAGEMENT
EOF

Here, utt_text is the file containing the list of utterances. Choose a pre-trained ASR model that includes a CTC layer to find utterance segments:

# pre-trained ASR model
model=wsj.transformer_small.v1
mkdir ./conf && cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf

../../../utils/asr_align_wav.sh \
    --models ${model} \
    --align_dir ${align_dir} \
    --align_config ${align_dir}/align.yaml \
    ${wav} ${align_dir}/utt_text

Segments are written to aligned_segments as a list of file/utterance names, utterance start and end times in seconds, and a confidence score. The confidence score is a probability in log space that indicates how well the utterance was aligned. If needed, remove bad utterances:

min_confidence_score=-5
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments

The demo script utils/ctc_align_wav.sh uses an already pre-trained ASR model (see the list above for more models). It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models with a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with sox if needed. A full example recipe is in egs/tedlium2/align1/.

ESPnet2

CTC segmentation determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets.

As a demo, we align the start and end of utterances within the audio file ctc_align_test.wav. This can be done either directly from the Python command line or using the script espnet2/bin/asr_align.py.

From the Python command line interface:

# load a model with character tokens
from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader(cachedir="./modelcache")
wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj")
# load the example file included in the ESPnet repository
import soundfile
speech, rate = soundfile.read("./test_utils/ctc_align_test.wav")
# CTC segmentation
from espnet2.bin.asr_align import CTCSegmentation
aligner = CTCSegmentation( **wsjmodel , fs=rate )
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""
segments = aligner(speech, text)
print(segments)
# utt1 utt 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt2 utt 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt3 utt 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt4 utt 4.20 6.10 -0.4899 AND CONCENTRATE ON PROPERTY MANAGEMENT

Aligning also works with fragments of the text. For this, set the gratis_blank option that allows skipping unrelated audio sections without penalty. It's also possible to omit the utterance names at the beginning of each line by setting kaldi_style_text to False.

aligner.set_config( gratis_blank=True, kaldi_style_text=False )
text = ["SALE OF THE HOTELS", "PROPERTY MANAGEMENT"]
segments = aligner(speech, text)
print(segments)
# utt_0000 utt 0.37 1.72 -2.0651 SALE OF THE HOTELS
# utt_0001 utt 4.70 6.10 -5.0566 PROPERTY MANAGEMENT

The script espnet2/bin/asr_align.py uses a similar interface. To align utterances:

# ASR model and config files from pre-trained model (e.g., from cachedir):
asr_config=<path-to-model>/config.yaml
asr_model=<path-to-model>/valid.*best.pth
# prepare the text file
wav="test_utils/ctc_align_test.wav"
text="test_utils/ctc_align_text.txt"
cat << EOF > ${text}
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE
utt5 ON PROPERTY MANAGEMENT
EOF
# obtain alignments:
python espnet2/bin/asr_align.py --asr_train_config ${asr_config} --asr_model_file ${asr_model} --audio ${wav} --text ${text}
# utt1 ctc_align_test 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt2 ctc_align_test 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt3 ctc_align_test 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt4 ctc_align_test 4.20 4.97 -0.6017 AND CONCENTRATE
# utt5 ctc_align_test 4.97 6.10 -0.3477 ON PROPERTY MANAGEMENT

The output of the script can be redirected to a segments file by adding the argument --output segments. Each line contains the file/utterance name, utterance start and end times in seconds, and a confidence score; optionally also the utterance text. The confidence score is a probability in log space that indicates how well the utterance was aligned. If needed, remove bad utterances:

min_confidence_score=-7
# here, we assume that the output was written to the file `segments`
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' segments

See the module documentation for more information. It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models that have a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with sox if needed.

Also, we can use this tool to provide token-level segmentation information if we prepare a list of tokens instead of that of utterances in the text file. See the discussion in #4278 (comment).

Citations

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
@inproceedings{hayashi2020espnet,
  title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
  booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7654--7658},
  year={2020},
  organization={IEEE}
}
@inproceedings{inaguma-etal-2020-espnet,
    title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
    author = "Inaguma, Hirofumi  and
      Kiyono, Shun  and
      Duh, Kevin  and
      Karita, Shigeki  and
      Yalta, Nelson  and
      Hayashi, Tomoki  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.34",
    pages = "302--311",
}
@article{hayashi2021espnet2,
  title={Espnet2-tts: Extending the edge of tts research},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Yoshimura, Takenori and Wu, Peter and Shi, Jiatong and Saeki, Takaaki and Ju, Yooncheol and Yasuda, Yusuke and Takamichi, Shinnosuke and Watanabe, Shinji},
  journal={arXiv preprint arXiv:2110.07840},
  year={2021}
}
@inproceedings{li2020espnet,
  title={{ESPnet-SE}: End-to-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration},
  author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe},
  booktitle={Proceedings of IEEE Spoken Language Technology Workshop (SLT)},
  pages={785--792},
  year={2021},
  organization={IEEE},
}
@inproceedings{arora2021espnet,
  title={{ESPnet-SLU}: Advancing Spoken Language Understanding through ESPnet},
  author={Arora, Siddhant and Dalmia, Siddharth and Denisov, Pavel and Chang, Xuankai and Ueda, Yushi and Peng, Yifan and Zhang, Yuekai and Kumar, Sujay and Ganesan, Karthik and Yan, Brian and others},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7167--7171},
  year={2022},
  organization={IEEE}
}
@inproceedings{shi2022muskits,
  author={Shi, Jiatong and Guo, Shuai and Qian, Tao and Huo, Nan and Hayashi, Tomoki and Wu, Yuning and Xu, Frank and Chang, Xuankai and Li, Huazhe and Wu, Peter and Watanabe, Shinji and Jin, Qin},
  title={{Muskits}: an End-to-End Music Processing Toolkit for Singing Voice Synthesis},
  year={2022},
  booktitle={Proceedings of Interspeech},
  pages={4277-4281},
  url={https://www.isca-speech.org/archive/pdfs/interspeech_2022/shi22d_interspeech.pdf}
}
@inproceedings{lu22c_interspeech,
  author={Yen-Ju Lu and Xuankai Chang and Chenda Li and Wangyou Zhang and Samuele Cornell and Zhaoheng Ni and Yoshiki Masuyama and Brian Yan and Robin Scheibler and Zhong-Qiu Wang and Yu Tsao and Yanmin Qian and Shinji Watanabe},
  title={{ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={5458--5462},
}
@article{gao2022euro,
  title={{EURO}: {ESPnet} Unsupervised ASR Open-source Toolkit},
  author={Gao, Dongji and Shi, Jiatong and Chuang, Shun-Po and Garcia, Leibny Paola and Lee, Hung-yi and Watanabe, Shinji and Khudanpur, Sanjeev},
  journal={arXiv preprint arXiv:2211.17196},
  year={2022}
}
@article{peng2023reproducing,
  title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
  author={Peng, Yifan and Tian, Jinchuan and Yan, Brian and Berrebbi, Dan and Chang, Xuankai and Li, Xinjian and Shi, Jiatong and Arora, Siddhant and Chen, William and Sharma, Roshan and others},
  journal={arXiv preprint arXiv:2309.13876},
  year={2023}
}

espnet's People

Contributors

a-quarter-mile avatar b-flo avatar bloodraven66 avatar bobchennan avatar d-keqi avatar emrys365 avatar fhrozen avatar ftshijt avatar gtache avatar hirofumi0810 avatar jerryuhoo avatar jungjee avatar kamo-naoyuki avatar kan-bayashi avatar lichenda avatar masao-someki avatar mergify[bot] avatar neillu23 avatar popcornell avatar potato-inoue avatar pre-commit-ci[bot] avatar pyf98 avatar roshansh-cmu avatar shigekikarita avatar siddhu001 avatar simpleoier avatar sw005320 avatar unilight avatar yosukehiguchi avatar yushiueda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

espnet's Issues

A sudden change in "validation/main/acc" after loading the snapshot

Hi,

After loading the training model from a snapshot snapshot_iter_136356, it's weird to have
a sudden change in "validation/main/loss_ctc".

Before loading the snapshot

video_acc_0

After loading the snapshot:

video_acc_2

Before loading the snapshot
{
    "validation/main/loss": 51.400978088378906, 
    "main/loss": 128.95794677734375, 
    "main/loss_ctc": 190.42391967773438, 
    "iteration": 136400, 
    "eps": 1e-08, 
    "main/loss_att": 67.49195861816406, 
    "validation/main/loss_att": 25.587093353271484, 
    "elapsed_time": 112193.59979605675, 
    "epoch": 11, 
    "validation/main/acc": 0.7833305597305298, 
    "main/acc": 0.7837930917739868, 
    "validation/main/loss_ctc": 77.21483612060547
}, 

After loading the snapshot:
{
    "main/loss_ctc": 28.216089248657227, 
    "main/loss": 18.790611267089844, 
    "validation/main/loss": 51.408992767333984, 
    "iteration": 136357, 
    "eps": 1e-08, 
    "main/loss_att": 9.365135192871094, 
    "validation/main/loss_att": 25.660505294799805, 
    "elapsed_time": 112159.20889616013, 
    "epoch": 11, 
    "validation/main/acc": 0.7828497886657715, 
    "main/acc": 0.8584474921226501, 
    "validation/main/loss_ctc": 77.15746307373047
}, 
{
    "main/loss": 147.9554901123047, 
    "main/loss_ctc": 218.45945739746094, 
    "iteration": 136400, 
    "eps": 1e-08, 
    "main/loss_att": 77.45149230957031, 
    "elapsed_time": 112206.56107521057, 
    "epoch": 11, 
    "main/acc": 0.7835178971290588
}, 

Thank you very much!

warpctc issue

It seems that warpctc is not correctly installed, which makes travis check failed.

CUDA check memory error

I am using single GPU for librispeech task. Is it necessary to use multiple GPU? I am getting following error while training.
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception in main training loop: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

I have incorporated commit suggested in FIXED VARIABLE FLAG IN EVALUATION #192 but still error remains the same.
can someone please guide?

Thanks

deeper better

My previous experiments show that deeper encoders always yielded better performance (up to 10-12). @kan-bayashi could you perform such experience for CSJ, TEDLIUM, and Voxforge? I'll do it for CHiME-4, HKUST, and WSJ. Then, we can find optimal elayers and update run.sh and RESULTS. Now it's time to tune the performance. If GPUs are available, we should try it for both chainer and pytorch.

  • CSJ
  • CHiME-4
  • HKUST
  • Voxforge
  • TEDLIUM
  • WSJ

chime5 results

Was the RESULTS in chime5/asr1 correspond to the run.sh?
For me I got about 10% worse WER than yours after running the original chime5/asr1/run.sh script and this makes me confused. Maybe you could offer me some tips on that?
Thanks

Multiple GPU supports

It seems that we can easily use multiple GPUs (data parallel) with chainer trainer. I want someone to consider this extension. Several tasks (e.g., Librispeech) take around one week, even with the pytorch backend (although this is still quite fast). With this extension, we could finish such tasks only within a few days.

Multi GPUs training freezes after first iter

I found this error while I was training the network for CHiME5.

PC setup:
Server (CentOS) with nvidia K80 and docker environment.

It stops after reporting the mtl loss and do not update (I needed to stop one container after 2 days without any update).

I am not sure if this is related to the setup or chainer ver(4.0) or docker environment. I tested different models on chainer and docker single gpu without any problem.
I will test in my computer with 2 gpus. I will try in both docker and normal env.

I am just notifying to check if something else related to multi gpu has a issue.

Installation issues

ESPnet has several installation problems (many people pointed out).
#125 is one possible solution, and @jtrmal also proposed a shell script as follows

I have had a look at the Makefile and I'd suggest converting it to a shell script (or we should write a shell install script for each of the components). The way it's designed it's bound to generate tons of issues even in trivial cases like using python3

So, I'd like to ask your opinions, and want to fix this at least in a month for the workshop activity.

Modularization

@hirofumi0810 pointed out that it would be better to modularize the network part (say split it to encoder, decoder, and attention parts). I want to ask your opinions about this direction.

plotting validation accuracy during training the model

Hi,

The curve "acc.png" plotted in the stage "Network Training" presenting the DNN accuracy. We can also predict the sequence using beam search and get a CER report in the stage "Decoding". However, A high classification accuracy in DNN but a low ASR performance, such as CER obtained in decoding.

Many thanks for your help!

anaconda make

@ShigekiKarita can you include your anaconda based make in Makefile?
If it is complicated, you can prepare different Makefile. Also do you have your own anaconda based path.sh? Is it possible to use one path.sh to handle both venv and anaconda environments?

python 2 and 3 compatible

Some people reported an issue due to the python 2 and 3 compatibility.
Right now, the unicode type in python 2 would cause some issues, which are used in everywhere, and need to be fixed.

Mobile inference and training?

Hi ESPNet,

I wonder if i could deploy my Chainer trained ESPNet model on mobile devices for on-device inference?

On-device training would be amazing, but inference would do for now.

Thanks

Toward a stable version

I think we have fixed many issues, and we can add a version 1.0 (or 0.1) as a stable version.
Toward that we need to finish

  • VGG2L for pytorch by @ShigekiKarita
  • AN4 recipe by me
  • AMI recipe
  • swbd recipe
  • fisher_swbd recipe
  • LM integration @sw005320
  • Attention/CTC joint decoding @takaaki-hori
  • End detection
  • Documentation by @sw005320 @kan-bayashi
  • Modify L.embed to avoid the randomness @takaaki-hori
  • Add WER scoring
  • label smoothing by @takaaki-hori
  • replace _ilens_to_index to np.cumsum
  • refactor main training and recognition to be independent of pytorch and chainer backends.

If you have any action items, please add them in this issue.
Then, we can move to more research-related implementation.

KeyError: 'missing keys in state_dict when decoding

Traceback (most recent call last):
File "/home/bing/espnet/egs/librispeech/asr1/../../../src/bin/asr_recog.py", line 117, in
main()
File "/home/bing/espnet/egs/librispeech/asr1/../../../src/bin/asr_recog.py", line 111, in main
recog(args)
File "/home/bing/espnet/src/asr/asr_pytorch.py", line 391, in recog
model.load_state_dict(remove_dataparallel(torch.load(args.model, map_location=cpu_loader)))
File "/home/bing/espnet/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 526, in load_state_dict
raise KeyError('missing keys in state_dict: "{}"'.format(missing))
KeyError: 'missing keys in state_dict: "set(['predictor.enc.enc1.bilstm1.bias_ih_l0_reverse', 'predictor.dec.att.gvec.bias', 'predictor.enc.enc1.bilstm1.weight_ih_l0', 'predictor.enc.enc1.bt1.bias', 'predictor.dec.att.mlp_enc.weight', 'predictor.enc.enc1.bilstm1.weight_hh_l0', 'predictor.dec.output.weight', 'predictor.dec.embed.weight', 'predictor.enc.enc1.bt1.weight', 'predictor.enc.enc1.bt0.weight', 'predictor.enc.enc1.bt2.bias', 'predictor.enc.enc1.bilstm0.bias_ih_l0_reverse', 'predictor.dec.decoder.0.weight_hh', 'predictor.enc.enc1.bilstm0.weight_ih_l0_reverse', 'predictor.enc.enc1.bilstm2.bias_ih_l0_reverse', 'predictor.att.loc_conv.weight', 'predictor.enc.enc1.bilstm2.bias_ih_l0', 'predictor.dec.att.gvec.weight', 'predictor.enc.enc1.bilstm0.weight_hh_l0_reverse', 'predictor.enc.enc1.bilstm1.weight_ih_l0_reverse', 'predictor.enc.enc1.bilstm0.bias_ih_l0', 'predictor.enc.enc1.bilstm0.weight_hh_l0', 'predictor.enc.enc1.bilstm1.bias_ih_l0', 'predictor.enc.enc1.bilstm0.bias_hh_l0', 'predictor.enc.enc1.bilstm2.weight_ih_l0_reverse', 'predictor.ctc.ctc_lo.weight', 'predictor.enc.enc1.bilstm2.weight_hh_l0_reverse', 'predictor.att.mlp_dec.weight', 'predictor.att.mlp_att.weight', 'predictor.dec.decoder.0.bias_ih', 'predictor.ctc.ctc_lo.bias', 'predictor.enc.enc1.bilstm1.bias_hh_l0_reverse', 'predictor.att.mlp_enc.bias', 'predictor.dec.att.mlp_att.weight', 'predictor.enc.enc1.bt0.bias', 'predictor.dec.att.loc_conv.weight', 'predictor.dec.decoder.0.bias_hh', 'predictor.enc.enc1.bilstm1.bias_hh_l0', 'predictor.enc.enc1.bt2.weight', 'predictor.enc.enc1.bilstm2.bias_hh_l0_reverse', 'predictor.dec.decoder.0.weight_ih', 'predictor.dec.att.mlp_dec.weight', 'predictor.enc.enc1.bilstm2.weight_ih_l0', 'predictor.att.gvec.bias', 'predictor.enc.enc1.bilstm0.weight_ih_l0', 'predictor.enc.enc1.bilstm0.bias_hh_l0_reverse', 'predictor.dec.output.bias', 'predictor.enc.enc1.bilstm1.weight_hh_l0_reverse', 'predictor.att.gvec.weight', 'predictor.att.mlp_enc.weight', 'predictor.enc.enc1.bilstm2.bias_hh_l0', 'predictor.dec.att.mlp_enc.bias', 'predictor.enc.enc1.bilstm2.weight_hh_l0'])"'

pytorch GPU memory full error at validation

I don't know why, but the pytorch backend uses a lot of GPU memory during validation (which should not be), and fails with an error due to the out of memory.
This is a librispeech task, and used to be working without such issues.
Can someone fix it?

THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception in main training loop: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58
Traceback (most recent call last):hs
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
    entry.extension(self)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/chainer/training/extensions/evaluator.py", line 140, in __call__
    result = self.evaluate()
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/asr/asr_pytorch.py", line 77, in evaluate
    self.model(x)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/nets/e2e_asr_attctc_th.py", line 117, in forward
    loss_ctc, loss_att, acc = self.predictor(x)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/nets/e2e_asr_attctc_th.py", line 297, in forward
    hpad, hlens = self.enc(xpad, ilens)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/nets/e2e_asr_attctc_th.py", line 1907, in forward
    xs, ilens = self.enc1(xs, ilens)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/nets/e2e_asr_attctc_th.py", line 1951, in forward
    ys, (hy, cy) = bilstm(xpack)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/rnn.py", line 204, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/_functions/rnn.py", line 385, in forward
    return func(input, *fargs, **fkwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 328, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 350, in forward
    result = self.forward_extended(*nested_tensors)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/_functions/rnn.py", line 294, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/backends/cudnn/rnn.py", line 281, in forward
    fn.reserve = torch.cuda.ByteTensor(reserve_size.value)
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/export/a08/shinji/201707e2e/espnet_dev2/egs/librispeech/asr1/../../../src/bin/asr_train.py", line 205, in <module>
    main()
  File "/export/a08/shinji/201707e2e/espnet_dev2/egs/librispeech/asr1/../../../src/bin/asr_train.py", line 199, in main
    train(args)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/asr/asr_pytorch.py", line 358, in train
    trainer.run()
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
    entry.extension(self)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/chainer/training/extensions/evaluator.py", line 140, in __call__
    result = self.evaluate()
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/asr/asr_pytorch.py", line 77, in evaluate
    self.model(x)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/nets/e2e_asr_attctc_th.py", line 117, in forward
    loss_ctc, loss_att, acc = self.predictor(x)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/nets/e2e_asr_attctc_th.py", line 297, in forward
    hpad, hlens = self.enc(xpad, ilens)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/nets/e2e_asr_attctc_th.py", line 1907, in forward
    xs, ilens = self.enc1(xs, ilens)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/src/nets/e2e_asr_attctc_th.py", line 1951, in forward
    ys, (hy, cy) = bilstm(xpack)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/rnn.py", line 204, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/_functions/rnn.py", line 385, in forward
    return func(input, *fargs, **fkwargs)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 328, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 350, in forward
    result = self.forward_extended(*nested_tensors)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/nn/_functions/rnn.py", line 294, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/export/a08/shinji/201707e2e/espnet_dev2/tools/venv/local/lib/python2.7/site-packages/torch/backends/cudnn/rnn.py", line 281, in forward
    fn.reserve = torch.cuda.ByteTensor(reserve_size.value)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58
# Accounting: time=1815 threads=1
# Finished at Fri May 25 17:35:47 EDT 2018 with status 1

Input frame lengths become shorter.....

@ShigekiKarita

old kaldi io

2017-12-15 18:52:53,217 (e2e_asr_attctc:770) INFO: VGG2L input lengths: [1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162]
2017-12-15 18:52:54,578 (e2e_asr_attctc:699) INFO: BLSTMP input lengths: [291 291 291 291 291 291 291 291 291 291 291 291 291 291 291]
2017-12-15 18:52:55,201 (e2e_asr_attctc:239) INFO: CTC input lengths:  [291 291 291 291 291 291 291 291 291 291 291 291 291 291 291]
2017-12-15 18:52:55,201 (e2e_asr_attctc:240) INFO: CTC output lengths: [147 133 136 130 150 140 143 128 139 146 154 130 144 104 135]
2017-12-15 18:52:56,377 (e2e_asr_attctc:244) INFO: ctc loss:878.386901855
2017-12-15 18:52:56,382 (e2e_asr_attctc:427) INFO: Decoder input lengths:  [291 291 291 291 291 291 291 291 291 291 291 291 291 291 291]
2017-12-15 18:52:56,383 (e2e_asr_attctc:428) INFO: Decoder output lengths: [148 134 137 131 151 141 144 129 140 147 155 131 145 105 136]
2017-12-15 18:52:58,724 (e2e_asr_attctc:456) INFO: att loss:549.235229492
2017-12-15 18:52:58,748 (e2e_asr_attctc:471) INFO: groundtruth[0]: FOR CHRISTMAS LAST YEAR HE GAVE ALL SEVENTY EMPLOYEES VIDEOCASSETTE RECORDERS AND SET UP A FREE VIDEO LIBRARY WITH ONE THOUSAND FIVE HUNDRED MOVIES<eos>

new kaldi io

2017-12-15 18:59:11,724 (e2e_asr_attctc:770) INFO: VGG2L input lengths: [784 784 784 784 784 784 784 784 784 784 784 784 784 784 784]
2017-12-15 18:59:12,722 (e2e_asr_attctc:699) INFO: BLSTMP input lengths: [196 196 196 196 196 196 196 196 196 196 196 196 196 196 196]
2017-12-15 18:59:13,120 (e2e_asr_attctc:239) INFO: CTC input lengths:  [196 196 196 196 196 196 196 196 196 196 196 196 196 196 196]
2017-12-15 18:59:13,120 (e2e_asr_attctc:240) INFO: CTC output lengths: [147 133 136 130 150 140 143 128 139 146 154 130 144 104 135]
2017-12-15 18:59:13,879 (e2e_asr_attctc:244) INFO: ctc loss:630.451171875
2017-12-15 18:59:13,885 (e2e_asr_attctc:427) INFO: Decoder input lengths:  [196 196 196 196 196 196 196 196 196 196 196 196 196 196 196]
2017-12-15 18:59:13,885 (e2e_asr_attctc:428) INFO: Decoder output lengths: [148 134 137 131 151 141 144 129 140 147 155 131 145 105 136]
2017-12-15 18:59:24,061 (e2e_asr_attctc:456) INFO: att loss:549.195800781
2017-12-15 18:59:25,978 (e2e_asr_attctc:471) INFO: groundtruth[0]: FOR CHRISTMAS LAST YEAR HE GAVE ALL SEVENTY EMPLOYEES VIDEOCASSETTE RECORDERS AND SET UP A FREE VIDEO LIBRARY WITH ONE THOUSAND FIVE H
UNDRED MOVIES<eos>

unexpected keyError during decoding

Hi all,
I'm the beginner for ESpnet and I followed the instructions from ESpnet/egs/wsj/asr1/run.sh
the training of language model and acoustic model looks fine,

export CUDA_VISIBLE_DEVICES=0 ; ./run.sh --ngpu 1 --backend pytorch --etype blstmp

but I face problem during decoding... the following message is from decode.2.log
Does anyone face the same problem ? Any suggestion will be appreciated. Thanks.

2018-05-04 15:26:04,898 (asr_recog:97) INFO: python path = /mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/:
2018-05-04 15:26:04,898 (asr_recog:102) INFO: set random seed = 1
2018-05-04 15:26:04,898 (asr_recog:105) INFO: backend = pytorch
2018-05-04 15:28:18,098 (asr_pytorch:314) INFO: reading a model config file fromexp/train_si284_503-lm/results/model.conf
2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: backend: pytorch
2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: beam_size: 20
2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: ctc_weight: 0.3
2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: debugmode: 1
2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: gpu: None
2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: lm_weight: 1.0
2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: maxlenratio: 0.0
2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: minlenratio: 0.0
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: model: exp/train_si284_503-lm/results/model.acc.best
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: model_conf: exp/train_si284_503-lm/results/model.conf
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: nbest: 1
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: ngpu: 0
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: penalty: 0.0
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: recog_feat: ark,s,cs:apply-cmvn --norm-vars=true data/train_si284/cmvn.ark scp:data/test_dev93/split32utt/2/feats.scp ark:- |
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: recog_label: data/test_dev93/data.json
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: result_label: exp/train_si284_503-lm/decode_test_dev93_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm1.0/data.2.json
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: rnnlm: exp/train_rnnlm_2layer_bs2048/rnnlm.model.best
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: seed: 1
2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: verbose: 1
2018-05-04 15:28:18,107 (asr_pytorch:321) INFO: reading model parameters fromexp/train_si284_503-lm/results/model.acc.best
2018-05-04 15:28:18,108 (e2e_asr_attctc_th:170) INFO: subsample: 1 2 2 1 1 1 1
2018-05-04 15:28:18,108 (e2e_asr_attctc_th:175) INFO: Use label smoothing with unigram
2018-05-04 15:28:26,029 (e2e_asr_attctc_th:1927) INFO: BLSTM with every-layer projection for encoder
Traceback (most recent call last):
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/asr_recog.py", line 117, in
main()
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/asr_recog.py", line 111, in main
recog(args)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 327, in recog
model.load_state_dict(torch.load(args.model, map_location=cpu_loader))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
KeyError: 'unexpected key "module.predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'

pytorch instllation

I'm just trying to play with the pytorch backend.
Here is my note:

$ cd egs/wsj/asr1
$ . ./path.sh
$ pip install pip --upgrade
$ pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl
$ pip install torchvision

Then,

$ ./run.sh --stage 3 --gpu jhu --backend pytorch

Then, I got the error:

Traceback (most recent call last):
  File "/export/a08/shinji/201707e2e/espnet_dev/egs/wsj/asr1/../../../src/bin/asr_train_th.py", line 567, in <module>
    main()
  File "/export/a08/shinji/201707e2e/espnet_dev/egs/wsj/asr1/../../../src/bin/asr_train_th.py", line 433, in main
    e2e = E2E(idim, odim, args)
  File "/export/a08/shinji/201707e2e/espnet_dev/src/nets/e2e_asr_attctc_th.py", line 152, in __init__
    self.ctc = CTC(odim, args.eprojs, args.dropout_rate)
  File "/export/a08/shinji/201707e2e/espnet_dev/src/nets/e2e_asr_attctc_th.py", line 247, in __init__
    from warpctc_pytorch import CTCLoss
ImportError: No module named warpctc_pytorch

Need help to analyze the results

Hi,

I am trying to optimize the MTL architecture for my data. I need help to analyze the errors made by the model. I wish to generate alignment between characters and acoustic frames as depicted in this paper on page 4.

Can I use att_c and att_w variables here to generate these plots?

I will be grateful for any guidance.

Thanks in advance!

CPU decoding uses a lot of threads

Decoding process uses a lot of threads.
If we use the default recipe, load average becomes very big number.

img076

Why don't you add the regularization for the use of cpus when decoding?

wrong symlink to best model in lm_train.py

lm_train.py wrongly makes symlink to best model.

(lm_train.py: line 400)
               dest = args.outdir + '/rnnlm.model.best'
                if os.path.lexists(dest):
                    os.remove(dest)
                os.symlink(args.outdir + '/rnnlm.model.' + str(epoch_now), dest)

I think above code should be as follows

(lm_train.py: line 400)
               dest = args.outdir + '/rnnlm.model.best'
                if os.path.lexists(dest):
                    os.remove(dest)
                os.symlink('rnnlm.model.' + str(epoch_now), dest)

Word boundary in CSJ

CSJ includes word boundary informatoin (wb) though it depends on morphological analysis.

However, the current preprocessing removes wb, so we cannot compute WER.
I know including wb increases computational complexity, but it provides additional information to the network.
In addition, if we can utilize wb, word-level LMs can be used in the decoding stage (off course @takaaki-hori 's multi-level LM decoding can be used).

What do you think about them?

Chainer or Pytorch

Hi All,

Is it possible to use only Pytorch as the backend? If not, could someone tell me the purpose of having two deep learning frameworks?

Thanks,
Lahiru

WER increases with increasing beam size, and with including LM

Hello,

I am running experiments with the WSJ corpus using PyTorch. I only use the seq2seq part of the code with --mtlalpha set to 0. During decoding, even without a language model, the CER, WER with greedy decoding is best and increasing the beam size only leads to increased errors.

Upon LM training on the WSJ corpus, the best model perplexity reached is:

2018-04-04 21:44:27,006 (lm_pytorch:248) INFO: iteration: 79400
2018-04-04 21:44:27,006 (lm_pytorch:249) INFO: training perplexity: 2.81606144442938
2018-04-04 21:45:16,637 (lm_pytorch:255) INFO: epoch: 24
2018-04-04 21:45:16,638 (lm_pytorch:256) INFO: validation perplexity: 4.354437210237004 

And decoding with including such a LM, worsens CER, WER results further.

I believe this might be due to some issue in the decoder or the beam search? Please let me know if you need any further details.

Babel recipe

We're planning to create a babel recipe for jsalt18.

map int instead of float

Hi,

I think that this float cast from str might be wrong

ys = [self.xp.array(map(float, i[1]['tokenid'].split()), dtype=np.int32) for i in data]

it could be

ys = [self.xp.array(map(int, i[1]['tokenid'].split()), dtype=np.int32) for i in data]

chainer_ctc installation

@jheymann85, I have the following error message when installing it. Do you have any ideas?

    Complete output from command python setup.py egg_info:
    /usr/lib/python2.7/distutils/extension.py:133: UserWarning: Unknown Extension options: 'include'
      warnings.warn(msg)
    zip_safe flag not set; analyzing archive contents...

    Installed /tmp/pip-0PFC4n-build/setuptools_cython-0.2.1-py2.7.egg
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-0PFC4n-build/setup.py", line 29, in <module>
        language='c++')
      File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
        _setup_distribution = dist = klass(attrs)
      File "/export/a08/shinji/201707e2e/espnet_conda/tools/venv/local/lib/python2.7/site-packages/setuptools/dist.py", line 266, in __init__
        _Distribution.__init__(self,attrs)
      File "/usr/lib/python2.7/distutils/dist.py", line 287, in __init__
        self.finalize_options()
      File "/export/a08/shinji/201707e2e/espnet_conda/tools/venv/local/lib/python2.7/site-packages/setuptools/dist.py", line 301, in finalize_options
        ep.load()(self, ep.name, value)
      File "/export/a08/shinji/201707e2e/espnet_conda/tools/venv/local/lib/python2.7/site-packages/pkg_resources.py", line 2190, in load
        ['__name__'])
      File "build/bdist.linux-x86_64/egg/setuptools_cython.py", line 21, in <module>
    AttributeError: 'module' object has no attribute 'Distutils'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-0PFC4n-build/
Makefile:41: recipe for target 'chainer_ctc' failed

Publishing commit changes

Hello,

Would it be possible to publish "commit changes" -- a log of sorts where there is detailed information about commits, bug fixes, or pull requests that are merged into the master branch, which would also state or help users understand what changes are to be expected in the results across various commits.

This would help us understand why the results are changing even with the same seed, across different commits of the master branch.

Thank you!

Automatic doc build & deploy based on Travis-CI and travis-sphinx

Manual deploy is troublesome. I've already tested travis-sphinx that automates sphinx-doc build & gh-pages deploy with Travis-CI.

  • .travis.yml
  • gh-pages
    • I confirmed that my fix of typo ENCODER NE(W)TWORK CLASS was deployed automatically
  • you can build HTML doc locally by travis-sphinx build --source=doc --nowarn and see doc/build/index.html

This enables us to add and confirm documentation rapidly.
Do you have any problem to automate this?

New development plan

I'm now managing several development plans upon requests.
Please feel free to ask us any functions that you feel it is beneficial for your research.
What I got requests so far are

  • Fix VGG-BLSTM in PyTorch
  • Generalize to accept multiple input/output streams
  • Back compatibility
  • Add unidirectional LSTM layer in the encoder
  • Output attention weights
  • Output more detailed decoder information (e.g., each score at every output time step) for post processwing
  • Scheduled sampling to handle exposure bias
  • Completely switch to pure CTC or seq2seq by mtlalpha = 0.0 or 1.0
  • end of sentence norm and length norm
  • Tensorboard X visualization for pytorch and chainer (https://github.com/lanpa/tensorboard-pytorch ???)

Travis CI (error: travis-sphinx build)

@kan-bayashi @ShigekiKarita @sw005320
I got the following error message when testing my pull request #95 with Travis CI.
Do you have any ideas to pass this test?

$ travis-sphinx build --source=doc --nowarn
Traceback (most recent call last):
File "/home/travis/virtualenv/python2.7.14/bin/travis-sphinx", line 11, in
sys.exit(main())
File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/travis_sphinx/build.py", line 39, in build
if sphinx.build_main(args + [source, outdir]):
AttributeError: 'module' object has no attribute 'build_main'
The command "travis-sphinx build --source=doc --nowarn" exited with 1.

Code linters

Although @kan-bayashi cleaned up some lines in #25 , we still have bad lines

# there are too many bad lines violates E501: line too long (> 79 chars)
$ flake8 ./src/bin src/nets | grep -v E501
./src/bin/asr_recog.py:132:1: E303 too many blank lines (3)
./src/bin/asr_recog_th.py:97:5: E306 expected 1 blank line before a nested definition, found 0
./src/bin/asr_train_th.py:14:1: F401 'subprocess' imported but unused
./src/bin/asr_train_th.py:26:1: F401 'chainer.function' imported but unused
./src/bin/asr_train_th.py:54:9: F841 local variable 'eval_func' is assigned to but never used
./src/bin/asr_train_th.py:446:1: W293 blank line contains whitespace
./src/bin/asr_train_th.py:448:5: E303 too many blank lines (2)
./src/bin/asr_train.py:13:1: F401 'subprocess' imported but unused
src/nets/e2e_asr_attctc.py:628:13: W291 trailing whitespace
src/nets/e2e_asr_attctc.py:679:29: W291 trailing whitespace
src/nets/e2e_asr_attctc_th.py:784:32: E265 block comment should start with '# '
src/nets/e2e_asr_attctc_th.py:791:31: E265 block comment should start with '# '

Do you think it's time to wake up these linters in .travis.yml?

espnet/.travis.yml

Lines 26 to 32 in d50a8c2

script:
# TODO test coding style?
# - flake8
# - autopep8 -r . --global-config .pep8 --diff | tee check_autopep8
# - test ! -s check_autopep8
- export PYTHONPATH=`pwd`/src/nets:`pwd`/src/utils
- pytest test

cupy out of memory error when running the voxforge example

When I was running the voxforge example, the job failed in the stage 3: network training. In the log file, it indicates that an out of memory error occured.

My working computer has only one Quadro M1000M GPU card which has 2GB display memory. What should I do to avoid this kind of error?

The log file is provided below:
train.log

Chainer CSJ results

I finished testing CSJ recipe.

exp/train_nodup_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval1_beam20_eacc.best_p0_len0.0-0.8/result.txt:
|        Sum/Avg         |        1272                 43897        |        84.9                  6.2                   8.9                  1.4                 16.5                 70.6        |
exp/train_nodup_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval2_beam20_eacc.best_p0_len0.0-0.8/result.txt:
|        Sum/Avg         |        1292                 43623        |        89.2                  5.0                   5.8                  1.0                 11.7                 65.9        |
exp/train_nodup_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/decode_eval3_beam20_eacc.best_p0_len0.0-0.8/result.txt:
|        Sum/Avg         |        1385                 28225        |        89.3                  5.6                   5.1                  1.6                 12.3                 53.8        |

exploding gradients

Hi, when I tried playing with my own dataset at networking training stage, some warnings appear:

2018-05-23 13:05:04,260 (e2e_asr_attctc:88) WARNING: loss (=250000064.000000) is not correct
2018-05-23 13:05:12,361 (e2e_asr_attctc:88) WARNING: loss (=1500000000.000000) is not correct
2018-05-23 13:05:13,682 (asr_chainer:132) WARNING: grad norm is nan. Do not update model.
2018-05-23 13:05:14,925 (e2e_asr_attctc:88) WARNING: loss (=250000048.000000) is not correct
2018-05-23 13:05:29,965 (e2e_asr_attctc:88) WARNING: loss (=250000096.000000) is not correct
2018-05-23 13:05:34,489 (e2e_asr_attctc:88) WARNING: loss (=250000064.000000) is not correct
2018-05-23 13:05:46,163 (e2e_asr_attctc:88) WARNING: loss (=250000080.000000) is not correct

Do you have any suggestions in solving this issue? Many thanks!

Using NCCLv1.x instead of 2.x for chainer MultiGPU

Hello there.

Just to notify that nccl 2x is not working with chainer when using multiple GPU. I tried installing version for cuda 9.0 and cuda9.2, but the Grad norm moves between inf to nan only when those version are used. This was only tested on versions installed with apt.
To work with multiple GPU, by the moment, build the nccl lib from the git (ver.1.x).

I would like to know if someone else tried multiple gpu with a different SetUp.

CTC in pytorch 0.4 degrades the performance

I have confirmed that the warpctc pytorch binding degraded the CTC performance after updating to pytorch 0.4.
In addition, surprisingly, when building the master branch of pytorch (0.5), the CTC performance degraded further.
This is consistent to my experiments with TIMIT and CSJ corpora.

Therefore, I strongly recommend using pytorch 0.3 until releasing the new version of warp_ctc.
There was no significant deference between 0.3 and 0.4 regarding the attention-based seq2seq models .

If anyone confirmed the same problem, please tell me. Thank you.

chainer vs. pytorch in terms of training speed

Can you summarize chainer vs. pytorch back ends in terms of the training time?
With @ShigekiKarita 's efforts, we can compare them with almost same conditions (maybe with blstmp? @ShigekiKarita do you still need some time to finish vggblstmp in pytorch?). I will report them to the chainer issue. The chainer developer would give us some solutions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.