GithubHelp home page GithubHelp logo

freewym / espresso Goto Github PK

View Code? Open in Web Editor NEW
941.0 941.0 116.0 17.59 MB

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

License: Other

Python 98.20% C++ 0.45% Lua 0.09% Shell 0.09% Makefile 0.06% Cuda 0.82% Cython 0.28%
asr end-to-end fairseq kaldi python pytorch speech-recognition

espresso's People

Contributors

alexeib avatar cndn avatar davidecaroselli avatar dianaml0 avatar edunov avatar erip avatar freewym avatar huihuifan avatar jhcross avatar jingfeidu avatar joshim5 avatar kahne avatar kartikayk avatar lematt1991 avatar liezl200 avatar liuchen9494 avatar louismartin avatar maigoakisame avatar mortimerp9 avatar multipath avatar myleott avatar pipibjc avatar skritika avatar sravyapopuri388 avatar sshleifer avatar tangyuq avatar theweiho avatar xu-song avatar xutaima avatar yuntang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

espresso's Issues

AttributeError: 'SpeechRecognitionEspressoTask' object has no attribute 'feat_dim'

๐Ÿ› Bug

The SpeechRecognitionEspressoTask has no attribute 'feat_dim'. Issue seen when trying to evaluate the model.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Try to import pretrained model using checkpoint file.
See error

model = models.build_model(args, self)
File "git/espresso/fairseq/models/init.py", line 48, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "git/espresso/espresso/models/speech_transformer.py", line 128, in build_model
logger.info("input feature dimension: {}, channels: {}".format(task.feat_dim, task.feat_in_channels))
AttributeError: 'SpeechRecognitionEspressoTask' object has no attribute 'feat_dim'

Code sample

Expected behavior

No error

Environment

fairseq Version (e.g., 1.0 or master): 0.9.0
PyTorch Version (e.g., 1.0) 1.6.0
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip

Verify WER by scoring with Kaldi

Hi authors,
I'm using Librispeech run.sh recipe. I trained the acoustic model (speech_conv_lstm_librispeech) using 4 GPU 1080ti But I'm facing this error while doing kaldi scoring.
local/score.sh data/test_clean exp/lstm/decode_test_clean_shallow_fusion
run.pl: job failed, log is in exp/lstm/decode_test_clean_shallow_fusion/scoring_kaldi/log/score.log
My second question
Is there any documentation for using my pre-trained model to decode audio wav, i would like to compare the decoding speed between ESPNet and Esspresso https://arxiv.org/abs/1909.08723

SpecAug slows down training time

Hey there,

I am training a Librispeech transformer on 4 P100 GPUs which works fine so far with ~1.1h/epoch. As I was now experimenting with specAug, I noticed that the training time doubles with ~2.38h/epoch.

Is this expected behaviour?

I suspected that SpecAug might be part of dataloading so I tried to increase num-workers during training to something > 0, but that gave me errors which seems to be caused by an insufficient shared memory size (which I unfortunately cannot change due to missing root privileges).

So is there any other way to speed up SpecAug training?

Thanks, Timo

LM shallow fusion for the Japanese Language.

What is your question?

I 'm looking for a fast speech recognition toolkit available in Japanese.
I tried to build a CSJ(Corpus of Spontaneous Japanese) recipe of Espresso referring to a librispeech recipe.
The CSJ is a Japanese corpus also used in Kaldi and Espnet.
https://github.com/kaldi-asr/kaldi/tree/master/egs/csj/

However, LM shallow fusion seems to be not effective in my recipe and I can't obtain sufficient results.

What have you tried?

I obtained the model of character error rates shown below.

eval1 eval2 eval3
espresso(my recipe) 11.89% 8.30% 8.94%
kaldi(https://www.merl.com/publications/docs/TR2018-036.pdf) 9.0% 7.2% 9.6%

I tried the language model off. (lm_shallow_fusion=false)
Unexpectedly, the character error rates improved.

eval1 eval2 eval3
espresso(lm_shallow_fusion=false) 11.53% 7.82% 8.56%

I don't know why but the language model I build seems to be not effective for speech recognition.
Training specifications and logs are shown below.

LM training (This is same as the librispeech recipe)

if [ ${stage} -le 5 ]; then
  echo "Stage 5: subword LM Training"
  valid_subset=valid
  mkdir -p $lmdir/log
  log_file=$lmdir/log/train.log
  [ -f $lmdir/checkpoint_last.pt ] && log_file="-a $log_file"
  CUDA_VISIBLE_DEVICES=$free_gpu python3 ../../fairseq_cli/train.py $lmdatadir --seed 1 --user-dir espresso \
    --task language_modeling_for_asr --dict $lmdict \
    --log-interval $((16000/ngpus)) --log-format simple \
    --num-workers 0 --max-tokens 32000 --max-sentences 1024 --curriculum 1 \
    --valid-subset $valid_subset --max-sentences-valid 1536 \
    --distributed-world-size $ngpus --distributed-port $(if [ $ngpus -gt 1 ]; then echo 100; else echo -1; fi) \
    --max-epoch 30 --optimizer adam --lr 0.001 --clip-norm 1.0 \
    --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
    --save-dir $lmdir --restore-file checkpoint_last.pt --save-interval-updates $((16000/ngpus)) \
    --keep-interval-updates 3 --keep-last-epochs 5 --validate-interval 1 \
    --arch lstm_lm_librispeech --criterion cross_entropy --sample-break-mode eos 2>&1 | tee $log_file
fi

LM training log

2020-04-05 07:25:11 | INFO | fairseq.data.data_utils | loaded 1209204 examples from: data/lm_text/train
2020-04-05 07:25:12 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16
2020-04-05 07:32:04 | INFO | train | epoch 001 | loss 8.388 | ppl 334.94 | wps 52376.4 | ups 3.06 | wpb 17090.1 |
bsz 969.7 | num_updates 1247 | lr 0.001 | gnorm 0.582 | clip 14.2 | train_wall 328 | wall 413
2020-04-05 07:32:04 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 8.256 | ppl 305.75 | wps 107397 |
wpb 18047.8 | bsz 800 | num_updates 1247
2020-04-05 07:32:05 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lm_lstm/checkpoint1.pt (epoch 1 @ 124
7 updates, score 8.256) (writing took 1.0231214840023313 seconds)
2020-04-05 07:38:55 | INFO | train | epoch 002 | loss 7.194 | ppl 146.42 | wps 51840.6 | ups 3.03 | wpb 17090.1 |
bsz 969.7 | num_updates 2494 | lr 0.001 | gnorm 0.624 | clip 12.3 | train_wall 324 | wall 824
2020-04-05 07:38:56 | INFO | valid | epoch 002 | valid on 'valid' subset | loss 6.767 | ppl 108.94 | wps 106708 |
wpb 18047.8 | bsz 800 | num_updates 2494 | best_loss 6.767
...

2020-04-05 10:44:54 | INFO | train | epoch 029 | loss 4.56 | ppl 23.59 | wps 51650.9 | ups 3.02 | wpb 17090.1 | bs
z 969.7 | num_updates 36163 | lr 1.52588e-08 | gnorm 0.482 | clip 6.4 | train_wall 325 | wall 11983
2020-04-05 10:44:55 | INFO | valid | epoch 029 | valid on 'valid' subset | loss 6.794 | ppl 110.97 | wps 106502 |
wpb 18047.8 | bsz 800 | num_updates 36163 | best_loss 6.387
2020-04-05 10:44:58 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lm_lstm/checkpoint29.pt (epoch 29 @ 3
6163 updates, score 6.794) (writing took 2.9264725770044606 seconds)
2020-04-05 10:51:46 | INFO | train | epoch 030 | loss 4.56 | ppl 23.59 | wps 51649.6 | ups 3.02 | wpb 17090.1 | bs
z 969.7 | num_updates 37410 | lr 1.52588e-08 | gnorm 0.482 | clip 6.4 | train_wall 325 | wall 12396
2020-04-05 10:51:47 | INFO | valid | epoch 030 | valid on 'valid' subset | loss 6.794 | ppl 110.97 | wps 106266 |
wpb 18047.8 | bsz 800 | num_updates 37410 | best_loss 6.387

Model Training (This is same as the librispeech recipe)

if [ ${stage} -le 8 ]; then
  echo "Stage 8: Model Training"
  valid_subset=valid
  mkdir -p $dir/log
  log_file=$dir/log/train.log
  [ -f $dir/checkpoint_last.pt ] && log_file="-a $log_file"
  opts=""
  if $apply_specaug; then
    opts="$opts --max-epoch 95 --lr-scheduler tri_stage --warmup-steps $((2000/ngpus)) --hold-steps $((600000/ngpus)) --decay-steps $((1040000/ngpus))"
    opts="$opts --encoder-rnn-layers 5"
    specaug_config="{'W': 80, 'F': 27, 'T': 100, 'num_freq_masks': 2, 'num_time_masks': 2, 'p': 1.0}"
  else
    opts="$opts --max-epoch 30 --lr-scheduler reduce_lr_on_plateau_v2 --lr-shrink 0.5 --start-reduce-lr-epoch 10"
  fi
  CUDA_VISIBLE_DEVICES=$free_gpu speech_train.py data --task speech_recognition_espresso --seed 1 --user-dir espresso \
    --log-interval $((8000/ngpus)) --log-format simple --print-training-sample-interval $((4000/ngpus)) \
    --num-workers 0 --max-tokens 26000 --max-sentences 24 --curriculum 1 \
    --valid-subset $valid_subset --max-sentences-valid 48 --ddp-backend no_c10d \
    --distributed-world-size $ngpus --distributed-port $(if [ $ngpus -gt 1 ]; then echo 100; else echo -1; fi) \
    --optimizer adam --lr 0.001 --weight-decay 0.0 --clip-norm 2.0 \
    --save-dir $dir --restore-file checkpoint_last.pt --save-interval-updates $((6000/ngpus)) \
    --keep-interval-updates 3 --keep-last-epochs 5 --validate-interval 1 --best-checkpoint-metric wer \
    --arch speech_conv_lstm_librispeech --criterion label_smoothed_cross_entropy_v2 \
    --label-smoothing 0.1 --smoothing-type uniform \
    --scheduled-sampling-probs 1.0 --start-scheduled-sampling-epoch 1 \
    --dict $dict --bpe sentencepiece --sentencepiece-vocab ${sentencepiece_model}.model \
    --max-source-positions 9999 --max-target-positions 999 \
    $opts --specaugment-config "$specaug_config" 2>&1 | tee $log_file
fi

if [ ${stage} -le 9 ]; then
  echo "Stage 9: Decoding"
  opts=""
  path=$dir/$checkpoint
  decode_affix=
  if $lm_shallow_fusion; then
    path="$path:$lmdir/$lm_checkpoint"
    opts="$opts --lm-weight 0.47 --eos-factor 1.5"
    if $apply_specaug; then
      # overwrite the existing opts
      opts="$opts --lm-weight 0.4"
    fi
    decode_affix=shallow_fusion
  fi
  for dataset in $test_set; do
    decode_dir=$dir/decode_$dataset${decode_affix:+_${decode_affix}}
    CUDA_VISIBLE_DEVICES=$(echo $free_gpu | sed 's/,/ /g' | awk '{print $1}') speech_recognize.py data \
      --task speech_recognition_espresso --user-dir espresso --max-tokens 15000 --max-sentences 24 \
      --num-shards 1 --shard-id 0 --dict $dict --bpe sentencepiece --sentencepiece-vocab ${sentencepiece_model}.model \
      --gen-subset $dataset --max-source-positions 9999 --max-target-positions 999 \
      --path $path --beam 60 --max-len-a 0.08 --max-len-b 0 --lenpen 1.0 \
      --results-path $decode_dir $opts

    echo "log saved in ${decode_dir}/decode.log"
    if $kaldi_scoring; then
      echo "verify WER by scoring with Kaldi..."
      local/score_e2e.sh data/$dataset $decode_dir
      cat ${decode_dir}/scoring_kaldi/wer
    fi
  done
fi

Model Training log

2020-04-05 15:29:52 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: ใˆ
2020-04-05 15:29:52 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: ๅ‡บใ‚‹็‡ฟ
2020-04-05 15:50:20 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: ๅ–ซ่Œถๅบ—่กŒใฃใŸใ‚Š
2020-04-05 15:50:20 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: ไธ‰ใ•ใซ
2020-04-05 16:03:00 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 12.629 | nll_loss 12.256 | wer 96.
275 | cer 94.7993 | ppl 4892.8 | wps 1669.6 | wpb 699.5 | bsz 31 | num_updates 6000
2020-04-05 16:03:18 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lstm/checkpoint_1_6000.pt (epoch 1 @
6000 updates, score 94.7993) (writing took 18.49637553100183 seconds)
2020-04-05 16:15:50 | INFO | train_inner | epoch 001:   8000 / 52726 loss=4.376, nll_loss=3.062, ppl=8.35, wps=293
.8, ups=2.9, wpb=101.3, bsz=24, num_updates=8000, lr=0.001, gnorm=1.54, clip=7.7, train_wall=654, wall=3035
2020-04-05 16:15:50 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: ใ‚ใ‚ใ™ใ“ใง็ŠฌใŒ
2020-04-05 16:15:50 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: ใ‚ใ‚ใ™ใ“ใฎใฎใŒ
2020-04-05 16:46:41 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 9.753 | nll_loss 9.16 | wer 95.05
| cer 89.5882 | ppl 571.93 | wps 1507.5 | wpb 699.5 | bsz 31 | num_updates 12000 | best_cer 89.5882

....

2020-04-24 08:41:54 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: ใพใŸใˆใƒผไบˆ็จฟ้›†ใซใฏ้–“ใซๅˆใ‚ใชใ‹ใฃใŸใฎใงใ™ใŒใˆใƒผ้˜ปๅฎณ้Ÿณใจๅ…ฑ้ณด้Ÿณใฎ้•ใ„ใ‚’่กจใ‚ใ—ใฆใ„ใ‚‹
2020-04-24 08:41:54 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: ใพใŸใˆใƒผไบˆ็จฟ้›†ใซใฏ้–“ใซๅˆใ‚ใชใ‹ใฃใŸใฎใงใ™ใŒใˆใƒผ้˜ปๅฎณ้Ÿณใจๅ…ฑ้ณด้Ÿณใฎ้•ใ„ใ‚’่กจใ‚ใ—ใฆใ„ใ‚‹
2020-04-24 09:11:43 | INFO | train | epoch 030 | loss 1.776 | nll_loss 0.192 | ppl 1.14 | wps 381.1 | ups 0.94 | wpb 404.2 | bsz 22.9 | num_updates 1.58178e+06 | lr 1e-05 | gnorm 965926 | clip 0.1 | train_wall 19236 | wall 1.61919e+06
2020-04-24 09:13:05 | INFO | valid | epoch 030 | valid on 'valid' subset | loss 2.08 | nll_loss 0.544 | wer 53.725 | cer 8.7176 | ppl 1.46 | wps 1094.4 | wpb 699.5 | bsz 31 | num_updates 1.58178e+06 | best_cer 8.6631

Could you give me any hint?
Are there things to be careful of applying LM shallow fusion to a language excluding English?

What's your environment?

fairseq Version (e.g., 1.0 or master): 0.9.0
PyTorch Version (e.g., 1.0): 1.4.0
OS (e.g., Linux): Ubuntu 18.04.4 LTS
How you installed fairseq (pip, source): pip
Python version: 3.7
CUDA/cuDNN version: 10.0.130 / libcudnn.so.7.5.1
GPU models and configuration: Tesla K80

thanks.

issues about speech_fconv.py

What is your question?

I read your code about applying fairseq to ASR. Regarding the decoder part, I noticed that position embedding is not added to the default parameters. But since I don't have the librispeech dataset, I only used the Chinese dataset I have to do an experiment. I found that when the position embedding is not added, the loss can be reduced to about 0.6, but the decoding will appear: due to the lack of position information, the decoded sentence is too short or will be decoded to an empty state. But when I set decoder_positional_embed to True, I found that the loss will start to oscillate around 3. I want to ask if this phenomenon occurs because I haven't trained enough epochs. (According to experience, the loss generally needs to be reduced to below 1 and the decoded result can be partially correct)

Code

image

Besides, I saw the fairseq paper, the implementation of fposition embedding is different from the traditional sine and cosine formula. I want to ask if I can adjust the weight when adding x and pos_emb? Thanks a lot!

Getting OOM error at the middle of the training in asr_swbd recipe on lstm encoder decoder model

error message

2021-03-19 12:09:30 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2021-03-19 12:09:30 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 600.00 MiB (GPU 0; 10.92 GiB total capacity; 8.55 GiB already allocated; 385.56 MiB free; 9.09 GiB reserved in total by PyTorch)
2021-03-19 12:09:30 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 51           |        cudaMalloc retries: 66        |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    6951 MB |    9130 MB |   16478 GB |   16471 GB |
|       from large pool |    6938 MB |    9118 MB |   16447 GB |   16441 GB |
|       from small pool |      12 MB |      16 MB |      30 GB |      30 GB |
|---------------------------------------------------------------------------|
| Active memory         |    6951 MB |    9130 MB |   16478 GB |   16471 GB |
|       from large pool |    6938 MB |    9118 MB |   16447 GB |   16441 GB |
|       from small pool |      12 MB |      16 MB |      30 GB |      30 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    9308 MB |    9526 MB |  396082 MB |  386774 MB |
|       from large pool |    9294 MB |    9508 MB |  395864 MB |  386570 MB |
|       from small pool |      14 MB |      18 MB |     218 MB |     204 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  569634 KB |     770 MB |    1982 GB |    1982 GB |
|       from large pool |  568358 KB |     766 MB |    1946 GB |    1946 GB |
|       from small pool |    1276 KB |      12 MB |      36 GB |      36 GB |
|---------------------------------------------------------------------------|
| Allocations           |     441    |     688    |    1162 K  |    1162 K  |
|       from large pool |     140    |     150    |     143 K  |     143 K  |
|       from small pool |     301    |     548    |    1019 K  |    1019 K  |
|---------------------------------------------------------------------------|
| Active allocs         |     441    |     688    |    1162 K  |    1162 K  |
|       from large pool |     140    |     150    |     143 K  |     143 K  |
|       from small pool |     301    |     548    |    1019 K  |    1019 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |      99    |     102    |    1279    |    1180    |
|       from large pool |      92    |      93    |    1170    |    1078    |
|       from small pool |       7    |       9    |     109    |     102    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     103    |     120    |  547671    |  547568    |
|       from large pool |      78    |      80    |   60015    |   59937    |
|       from small pool |      25    |      44    |  487656    |  487631    |
|===========================================================================|

nvidia Driver Version: 460.32.03

environment

blas                      1.0                         mkl  
mkl                       2020.1                      217  
mkl-service               2.3.0            py37he904b0f_0  
mkl_fft                   1.1.0            py37h23d657b_0  
mkl_random                1.1.1            py37h0573a6f_0  
torch                     1.7.1+cu101              pypi_0    pypi
torchaudio                0.7.2                    pypi_0    pypi
torchvision               0.8.2+cu101              pypi_0    pypi

I reduced the batch_size to 1, --empty-cache-freq to 1 ...still, OOM happens at the middle of training

TypeError: get_asr_dataset_from_json() got an unexpected keyword argument 'combined'

๐Ÿ› Bug

The following error was encountered while loading checkpoint file.
TypeError: get_asr_dataset_from_json() got an unexpected keyword argument 'combined'

To Reproduce

Steps to reproduce the behavior (always include the command you ran):
See Error:

Traceback (most recent call last):
File "tests/test_export_asr.py", line 26, in test_jit_and_export_lstm
'dict':'units.txt'})
File "src/fairseq/fairseq/checkpoint_utils.py", line 273, in load_model_ensemble
state,
File "/src/fairseq/fairseq/checkpoint_utils.py", line 319, in load_model_ensemble_and_task
task = tasks.setup_task(cfg.task)
File "src/fairseq/fairseq/tasks/init.py", line 44, in setup_task
return task.setup_task(cfg, **kwargs)
File "src/fairseq/espresso/tasks/speech_recognition.py", line 271, in setup_task
src_dataset = get_asr_dataset_from_json(data_path, cfg.gen_subset, tgt_dict, combined=False).src
TypeError: get_asr_dataset_from_json() got an unexpected keyword argument 'combined'

Expected behavior

No syntax error

Environment

  • fairseq Version (e.g., 1.0 or master):1.10

Using wav2vec with Espresso

Hi

Wav2vec is included under examples. Can it be used with Espresso and are there any examples where features from hdf5 files are used in Espresso?

All the best

Instructions for Training from Scratch?

Hi,
Thanks for releasing this code.
I am trying to do ASR for Gujarati (an Indian Language) and have custom labelled data. It would be great it you could release a README file for:

  1. how to train models from scratch
  2. how to perform inference from scratch

Thanks,
Kalpit

I tried to run a librispeech recipe but a word error rate remains very large.

What is your question?

I tried to run a librispeech recipe (examples/asr_librispeech/run.sh) but a word error rate remains very large(around 100%) in "Stage 8: Model Training" in spite of 30 epochs.
I think a cause is that an execution environment is different.

What's your environment?

My environment is as follows.

  • fairseq Version (e.g., 1.0 or master): 0.9.0
  • PyTorch Version (e.g., 1.0): 1.4.0
  • OS (e.g., Linux): Ubuntu 18.04.3 LTS
  • How you installed fairseq (pip, source): pip
  • Python version: 3.6.5
  • CUDA/cuDNN version: 10.0.130 / libcudnn.so.7.3.0
  • GPU models and configuration: Tesla V100-SXM2-16GB
$ python collect_env.py 
Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.0

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 440.33.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.0

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.4.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2020.0                      166  
[conda] mkl-service               2.3.0            py36he904b0f_0  
[conda] mkl_fft                   1.0.15           py36ha843d7b_0  
[conda] mkl_random                1.1.0            py36hd6b4f25_0  
[conda] pytorch                   1.4.0           py3.6_cuda10.0.130_cudnn7.6.3_0    pytorch
[conda] torch                     1.4.0                    pypi_0    pypi

A commit hash of the espresso is f933e8c.
An output log is as follows, but I can't find any problem.

2020-03-08 22:27:56 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: I SEE I MUST GET SETTLED QUICKLY SO THAT I SHALL HAVE THE POWER TO RESTRAIN YOU THEY ROLLICKED FORTH THEN AND BOUGHT SEVERAL THINGS A BIG STEAMER RUG FOR THE CAR A PAIR OF LONG GRAY MOCHA GLOVES TO MATCH THE HAND BAG A SILK UMBRELLA
2020-03-08 22:27:56 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: AND HAVE THAT' BE A IN AND I I CAN BE TO PLEASURE OF GET THE I ARE UPED AND AND THE THERE THE OF ANDPIECEGERER ANDG AND THE SHIPS FEW OF SHOES WHITE HAIRSACKS WHICH MAN OF PAIR HANDKERCHIEF
2020-03-08 23:13:04 | INFO | valid | epoch 025 | valid on 'valid' subset | loss 6.804 | nll_loss 5.856 | wer 109.669 | cer 100.424 | ppl 57.94 | wps 822.3 | wpb 715.6 | bsz 29.1 | num_updates 414000 | best_wer 96.7413
2020-03-08 23:13:27 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lstm/checkpoint_25_414000.pt (epoch 25 @ 414000 updates, score 109.6687) (writing took 23.222585418028757 seconds)
2020-03-08 23:13:28 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: CONTINUED DUNCAN SPEAKING SLOWLY AND USING THE SIMPLEST FRENCH OF WHICH HE WAS THE MASTER TO BELIEVE THAT NONE OF THIS WISE AND BRAVE NATION UNDERSTAND THE LANGUAGE THAT THE GRAND MONARQUE USES WHEN HE TALKS TO HIS CHILDREN
2020-03-08 23:13:28 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: AND THECAN WITH IN AND IING THE WORDSST WAY LANGUAGE THE HE WAS THE MOST OF BE THAT HE OF THE WAS AND IN MAN COULDS LANGUAGE OF HE WORLDESTITTSS HE ISS TO THE PEOPLE
2020-03-08 23:23:27 | INFO | train | epoch 025:  15999 / 16601 loss=6.624, nll_loss=5.649, ppl=50.18, wps=537.4, ups=0.75, wpb=716.6, bsz=16.9, num_updates=414424, lr=1e-05, gnorm=0.399, clip=0, oom=0, train_wall=13155, wall=554591
2020-03-08 23:34:58 | INFO | train | epoch 025 | loss 6.624 | nll_loss 5.649 | ppl 50.18 | wps 539.9 | ups 0.75 | wpb 716.2 | bsz 16.9 | num_updates 415025 | lr 1e-05 | gnorm 0.4 | clip 0 | oom 0 | train_wall 13641 | wall 555281
2020-03-08 23:37:49 | INFO | valid | epoch 025 | valid on 'valid' subset | loss 6.803 | nll_loss 5.856 | wer 108.198 | cer 99.5523 | ppl 57.92 | wps 822.3 | wpb 715.6 | bsz 29.1 | num_updates 415025 | best_wer 96.7413
2020-03-08 23:38:12 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lstm/checkpoint25.pt (epoch 25 @ 415025 updates, score 108.1984) (writing took 23.870150407077745 seconds)

Because the librispeech is very large dataset, I struggle with debugging.
Could you give me any hint?

I think if the espresso has a recipe for a small dataset like the an4 recipe in the espnet, a trial run is easier.
Do you have any plan to implement a recipe for a small dataset?

thanks.

SIGSEGV while running train.py on a multi GPU setup

I have setup a ubuntu 18.04 4 CPU and 4 GPU environment to execute the librispeech dataset training.

The prepare step went through fine.

But when I launch the training using:
python train.py ./librispeech-workdir/preprocessed-data/ --save-dir ./librispeech-workdir/train-output/ --max-epoch 80 --task speech_recognition_e --arch vggtransformer_2 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 5000 --log-format json --log-interval 1 --criterion cross_entropy_acc --user-dir examples/speech_recognition/

I get the following error right at the outset:
)

| model vggtransformer_2, criterion CrossEntropyWithAccCriterion
| num. model params: 315190057 (num. trained: 315190057)
| training on 4 GPUs
| max tokens per GPU = 5000 and max sentences per GPU = None
| no existing checkpoint found ./librispeech-workdir/train-output/checkpoint_last.pt
| loading train data for epoch 0
Traceback (most recent call last):
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGSEGV

Unable to proceed ahead in teh absence of any clues a to what might be causing it etc

Please help

It starts out with


| distributed init (rank 3): tcp://localhost:15160
| distributed init (rank 0): tcp://localhost:15160
| distributed init (rank 2): tcp://localhost:15160
| distributed init (rank 1): tcp://localhost:15160
| initialized host espresso-2 as rank 2
| initialized host espresso-2 as rank 1
| initialized host espresso-2 as rank 3
| initialized host espresso-2 as rank 0
Namespace(adadelta_eps=1e-08, adadelta_rho=0.95, anneal_eps=False, arch='vggtransformer_2', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=
10.0, conv_dec_config='((256, 3, True),) * 4', cpu=False, criterion='cross_entropy_acc', curriculum=0, data='./librispeech-workdir/preprocessed-data/', dataset_i
mpl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:15160', distributed_no_
spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, empty_cache_freq=0, enc_output_dim=1024, fast_stat_sync=False, find_unused_parame
ters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_windo
w=None, input_feat_per_channel=80, keep_interval_updates=-1, keep_last_epochs=-1, log_format='json', log_interval=1, lr=[1.0], lr_scheduler='fixed', lr_shrink=0.
1, max_epoch=80, max_sentences=None, max_sentences_valid=None, max_tokens=5000, max_tokens_valid=5000, max_update=0, maximize_best_checkpoint_metric=False, memor
y_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_op
timizer_state=False, num_workers=1, optimizer='adadelta', optimizer_overrides='{}', required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=Fa
lse, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./librispeech-workdir/train-output/', save_interval=1, save_interval
_updates=0, seed=1, sentence_avg=False, silence_token='โ–', skip_invalid_size_inputs_valid_test=False, task='speech_recognition_e', tbmf_wrapper=False, tensorboar
d_logdir='', tgt_embed_dim=512, threshold_loss_scale=None, tokenizer=None, train_subset='train', transformer_dec_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15
),) * 6', transformer_enc_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 16', update_freq=[1], use_bmuf=False, user_dir='examples/speech_recognition/', va
lid_subset='valid', validate_interval=1, vggblock_enc_config='[(64, 3, 2, 2, True), (128, 3, 2, 2, True)]', warmup_updates=0, weight_decay=0.0)
| dictionary: 5001 types


(I have had to rename the speech_recognition task to speech_recognition_e as there is a similarly named task in fairseq directory as well)

GPU Distributed Data Parallel Error

I see that in the run.sh script for asr_swbd, in stage 6, model training, the distributed_world_size parameter is set to ngpus(=1 in my case). This causes the ValueError in GPU Training

File "M/espresso/fairseq/distributed_utils.py", line 73, in distributed_init                                                                                                                         raise ValueError('Cannot initialize distributed with distributed_world_size=1')                 

Is there a fix to this issue?
I am using the latest Espresso from the repo, Ubuntu 18.04, Slurm Scheduler, NVIDIA 1080 Ti if that helps.

I have set export_CUDA_VISIBLE_DEVICES, and the --free-gpu option.
If more details are required, I would be happy to provide them.

SWBD Recipe Error

Hi, I am trying to run the SWBD recipe on my local machine. I am getting errors at Stage 2 of the run script, building the dictionary and text tokenization. The error seems to be coming from the "tokenizing text for train/valid/test sets..." stage running spm_encode.py.

Code

This is the full shell output:

sentencepiece_trainer.cc(116) LOG(INFO) Running command: --bos_id=-1 --pad_id=0 --eos_id=1 --unk_id=2 --input=data/lang/input --vocab_size=1003 --character_coverage=1.0 --model_type=unigram --model_prefix=data/lang/train_nodup_unigram1000 --input_sentence_size=10000000 --user_defined_symbols=[laughter],[noise],[vocalized-noise]
sentencepiece_trainer.cc(49) LOG(INFO) Starts training with :
TrainerSpec {
  input: data/lang/input
  input_format:
  model_prefix: data/lang/train_nodup_unigram1000
  model_type: UNIGRAM
  vocab_size: 1003
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 10000000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  treat_whitespace_as_suffix: 0
  user_defined_symbols: [laughter]
  user_defined_symbols: [noise]
  user_defined_symbols: [vocalized-noise]
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 2
  bos_id: -1
  eos_id: 1
  pad_id: 0
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  โ‡
}
NormalizerSpec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv:
}

trainer_interface.cc(267) LOG(INFO) Loading corpus: data/lang/input
trainer_interface.cc(139) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 2000000 lines
trainer_interface.cc(114) LOG(WARNING) Too many sentences are loaded! (2416025), which may slow down training.
trainer_interface.cc(116) LOG(WARNING) Consider using --input_sentence_size=<size> and --shuffle_input_sentence=true.
trainer_interface.cc(119) LOG(WARNING) They allow to randomly sample <size> sentences from the entire corpus.
trainer_interface.cc(315) LOG(INFO) Loaded all 2416025 sentences
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <pad>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [laughter]
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [noise]
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [vocalized-noise]
trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
trainer_interface.cc(384) LOG(INFO) all chars count=120465092
trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=43
trainer_interface.cc(403) LOG(INFO) Final character coverage=1
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 2416025 sentences.
unigram_model_trainer.cc(129) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(133) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(184) LOG(INFO) Initialized 166028 seed sentencepieces
trainer_interface.cc(441) LOG(INFO) Tokenizing input sentences with whitespace: 2416025
trainer_interface.cc(451) LOG(INFO) Done! 69957
unigram_model_trainer.cc(470) LOG(INFO) Using 69957 sentences for EM training
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=59852 obj=9.23769 num_tokens=130093 num_tokens/piece=2.17358
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=44412 obj=7.29956 num_tokens=132354 num_tokens/piece=2.98014
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=33308 obj=7.24442 num_tokens=141637 num_tokens/piece=4.25234
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=33303 obj=7.23651 num_tokens=141660 num_tokens/piece=4.25367
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=24977 obj=7.21871 num_tokens=158375 num_tokens/piece=6.34083
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=24977 obj=7.21644 num_tokens=158399 num_tokens/piece=6.34179
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=18732 obj=7.21162 num_tokens=175442 num_tokens/piece=9.3659
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=18732 obj=7.20821 num_tokens=175404 num_tokens/piece=9.36387
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=14049 obj=7.21798 num_tokens=192101 num_tokens/piece=13.6736
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=14049 obj=7.21295 num_tokens=192059 num_tokens/piece=13.6707
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=10536 obj=7.23918 num_tokens=207654 num_tokens/piece=19.709
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=10536 obj=7.23244 num_tokens=207609 num_tokens/piece=19.7047
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=7902 obj=7.27241 num_tokens=221580 num_tokens/piece=28.041
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=7902 obj=7.26387 num_tokens=221484 num_tokens/piece=28.0289
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=5926 obj=7.32839 num_tokens=234743 num_tokens/piece=39.6124
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=5926 obj=7.31716 num_tokens=234693 num_tokens/piece=39.6039
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=4444 obj=7.40817 num_tokens=248571 num_tokens/piece=55.9341
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=4444 obj=7.39317 num_tokens=248418 num_tokens/piece=55.8996
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=3333 obj=7.50897 num_tokens=262750 num_tokens/piece=78.8329
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=3333 obj=7.49001 num_tokens=262534 num_tokens/piece=78.7681
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=2499 obj=7.64161 num_tokens=276859 num_tokens/piece=110.788
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=2499 obj=7.61733 num_tokens=276640 num_tokens/piece=110.7
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1874 obj=7.80273 num_tokens=292799 num_tokens/piece=156.243
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1874 obj=7.77333 num_tokens=292543 num_tokens/piece=156.106
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1405 obj=7.99379 num_tokens=309225 num_tokens/piece=220.089
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1405 obj=7.95503 num_tokens=308821 num_tokens/piece=219.801
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1103 obj=8.15973 num_tokens=321388 num_tokens/piece=291.376
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1103 obj=8.12422 num_tokens=321274 num_tokens/piece=291.273
trainer_interface.cc(507) LOG(INFO) Saving model: data/lang/train_nodup_unigram1000.model
trainer_interface.cc(531) LOG(INFO) Saving vocabs: data/lang/train_nodup_unigram1000.vocab
Traceback (most recent call last):
  File "../../scripts/spm_encode.py", line 99, in <module>
    main()
  File "../../scripts/spm_encode.py", line 90, in main
    print(" ".join(enc_line), file=output_h)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2581' in position 0: ordinal not in range(128)

What have you tried?

My setup should be ok as I have been running the WSJ recipe without issue but I notice that a different script is used here for the tokenizing. Any help or advice would be great!

Error in training stage of run_chain_e2e_bichar.sh: 'odict_items' object is not an iterator

I am trying to train a mode on a custom dataset.
All the data preparation stages are done flawlessly (at least seems like that).
But at the begining of the training stage (stage =6) of run_chain_e2e_bichar.sh I get the following error:

<class 'odict_items'> Traceback (most recent call last): File "../../fairseq_cli/train.py", line 510, in <module> cli_main() File "../../fairseq_cli/train.py", line 503, in cli_main distributed_utils.call_main(cfg, main) File "../../fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "../../fairseq_cli/train.py", line 86, in main task = tasks.setup_task(cfg.task) File "../../espresso/fairseq/tasks/__init__.py", line 44, in setup_task return task.setup_task(cfg, **kwargs) File "../../espresso/espresso/tasks/speech_recognition_hybrid.py", line 432, in setup_task src_dataset = get_asr_dataset_from_json(data_path, split, dictionary, combine=False).src File "../../espresso/espresso/tasks/speech_recognition_hybrid.py", line 236, in get_asr_dataset_from_json if "feat" in next(loaded_json.items()): TypeError: 'odict_items' object is not an iterator

I also tried the biphone trainaing (run_chain_e2e.sh) and got exactly the same error.
Any Ideas on what is the problem or what I am doing wrong are appreciated.
Thank you.

  • espresso Version: master
  • PyTorch Version: 1.8.1
  • OS: CentOS Linux 7
  • Python version: 3.6

SWBD ASR Expected Results- WER

Hi, can the expected Switchboard Test Set WER's using the code be confirmed ?

I ran the code as per instructions with no errors and with default hyperparams and was able to obtain WER's 9.5 % on SWBD Test Set, 14.5 % on Eval2000, 19.5 % on Callhome using the provided recipe, and I couldn't match the performance reported in the paper (https://arxiv.org/pdf/1909.08723.pdf)

Would it be possible to share intermediate results such as perplexity on Subword LM testing, maybe loss and WER plots?

Here are some of mine for verification:
ASR Decoding Results:

Callhm

WER=19.5%, Sub=12.6%, Ins=3.5%, Del=3.4%

Eval2000

WER=14.5%, Sub=9.2%, Ins=2.5%, Del=2.7%

LM Results:
Eval2000:

Evaluated 60377 tokens in 2.4s (24979.13 tokens/s)      
                                                                                                                                                                Loss: 3.6210, Perplexity: 37.37

on RT03 -

Evaluated 109920 tokens in 3.9s (28378.85 tokens/s)                                                                                                                                                                    

Loss: 3.7787, Perplexity: 43.76

Thanks

tensorized_lookahead_language_model SyntaxError

Hi~ I was running the asr_wsj and got SyntaxError: invalid syntax.

this is the info.

File "/share/nas165/QAQ/espresso/fairseq/models/tensorized_lookahead_language_model.py", line 61
    self.lm_decoder: FairseqIncrementalDecoder = word_lm.decoder
                   ^
SyntaxError: invalid syntax

Anyone could help me?
tyvm

ONNX exportation of speech_lstm based model

โ“ Questions and Help

Not able onnx export speech_lstm based model.

What is your question?

Is the speech_lstm model expected to be onnx exportable? Currently get the error as shown below:

Code

        torch.onnx.export(model, dummy_input,  
                          f.name+".onnx", verbose=True, opset_version=12,
                          input_names=input_names, output_names=output_names)

Traceback (most recent call last):
File "export_77/../test_scripts/test_export_asr.py", line 67, in _test_save_and_onnx_model
input_names=input_names, output_names=output_names)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/init.py", line 230, in export
custom_opsets, enable_onnx_checker, use_external_data_format)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/utils.py", line 91, in export
use_external_data_format=use_external_data_format)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/utils.py", line 639, in _export
dynamic_axes=dynamic_axes)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/utils.py", line 421, in _model_to_graph
dynamic_axes=dynamic_axes, input_names=input_names)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/utils.py", line 203, in _optimize_graph
graph = torch._C._jit_pass_onnx(graph, operator_export_type)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/init.py", line 263, in _run_symbolic_function
return utils._run_symbolic_function(*args, **kwargs)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/utils.py", line 934, in _run_symbolic_function
return symbolic_fn(g, *inputs, **attrs)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/symbolic_helper.py", line 133, in wrapper
return fn(g, *args, **kwargs)
File "export_77/.env/export/lib/python3.6/site-packages/torch/onnx/symbolic_opset9.py", line 441, in transpose
axes[dim0], axes[dim1] = axes[dim1], axes[dim0]
IndexError: list index out of range

What's your environment?

  • PyTorch Version (e.g., 1.0) 1.7.0
  • OS (e.g., Linux): Linux
  • How you installed fairseq (pip, source): via espresso
  • Python version:3.6

How to train speech transformer models using wsj?

When I was trying to use espresso/espresso/models/speech_transformer.py,
espresso has stopped with this log.

TypeError: __init__() missing 1 required positional argument: 'decoder'

Thus, I added "args" to line 144 like below.

return SpeechTransformerModel(args, encoder=encoder, decoder=decoder)

However, other errors continued to occur.

TypeError: forward() got an unexpected keyword argument 'epoch'

In order to use speech_transformer,
I modified "Stage 9" in espresso/examples/asr_wsj/run.sh,

  CUDA_VISIBLE_DEVICES=$free_gpu speech_train.py data --task speech_recognition_espresso --seed 1 --user-dir espresso \
    --log-interval 400 --log-format simple --print-training-sample-interval 1000 \
    --num-workers 0 --max-tokens 24000 --max-sentences 32 --curriculum 2 \
    --valid-subset $valid_subset --max-sentences-valid 64 --ddp-backend no_c10d \
    --distributed-world-size $ngpus --distributed-port $(if [ $ngpus -gt 1 ]; then echo 100; else echo -1; fi) \
    --max-epoch 70 --optimizer adam --lr 0.001 --weight-decay 0.0 \
    --lr-scheduler reduce_lr_on_plateau_v2 --lr-shrink 0.5 --start-reduce-lr-epoch 11 \
    --save-dir $dir --restore-file checkpoint_last.pt --save-interval-updates 400 \
    --keep-interval-updates 5 --keep-last-epochs 5 --validate-interval 1 --best-checkpoint-metric wer \
    --arch speech_transformer_wsj --criterion label_smoothed_cross_entropy_v2 \
    --label-smoothing 0.05 --smoothing-type temporal \
    --no-scale-embedding \
    --dict $dict --non-lang-syms $nlsyms \
    --max-source-positions 9999 --max-target-positions 999 $opts 2>&1 | tee $log_file

And also I registered new model "speech_transformer_wsj".
/home/sephiroce/open_source/espresso/espresso/models/speech_transformer.py

@register_model_architecture('speech_transformer', 'speech_transformer_wsj')
def speech_transformer_wsj(args):
    args.encoder_conv_channels = getattr(
        args, 'encoder_conv_channels', '[64, 64]',
    )
    args.encoder_conv_kernel_sizes = getattr(
        args, 'encoder_conv_kernel_sizes', '[(3, 3), (3, 3)]',
    )
    args.encoder_conv_strides = getattr(
        args, 'encoder_conv_strides', '[(2, 2), (2, 2)]',
    )
    args.encoder_layers = getattr(args, 'encoder_layers', 12)
    args.decoder_layers = getattr(args, 'decoder_layers', 6)
    args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
    args.encoder_ffn_embed_dim = getattr(args, 'encoder_ffn_embed_dim', 2048)
    args.encoder_attention_heads = getattr(args, 'encoder_attention_heads', 4)
    args.encoder_normalize_before = getattr(args, 'encoder_normalize_before', True)
    args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
    args.decoder_ffn_embed_dim = getattr(args, 'decoder_ffn_embed_dim', 2048)
    args.decoder_normalize_before = getattr(args, 'decoder_normalize_before', True)
    args.decoder_attention_heads = getattr(args, 'decoder_attention_heads', 4)
    args.dropout = getattr(args, 'dropout', 0.1)
    args.attention_dropout = getattr(args, 'attention_dropout', 0.1)
    args.activation_dropout = getattr(args, 'activation_dropout', 0.1)
    base_architecture(args)

Request for recipe for Librispeech 100hr

I have try to run the recipe for Librispeech 960hrs train set, it works well.
But after reduced the data to Librispeech 100hrs, I found it not working and give WER > 100% with validation set.
So would like to know if there is any recipe for Librispeech 100hrs or other train set with similar amount of data as reference ?

Thanks.

CUDA OOM issue

Hi, I was running the Libri example and got CUDA OOM issue (either 1 or 4 v100 GPU). I tried the --empty-cache-freq flag. It still occurs OOM eventually although it takes longer to OOM.

Has anyone seen the same issue?

My setup: centos 7.5 | cuda 9.2 | python 3.6 | pytorch 1.3.0 | nccl 2.4.6

token_text as outputs

โ“ Questions and Help

Hello,

I'm a bit confused after reading changes from #58. I was currently using token_text for my work, and I would prefer to continue, if it is possible, using token_text instead of text (because I use special tags similar to <space>).
After reading changes from #58 I have the impression that it's still possible to use token_text as outputs for ASR systems, is it right?

If so, what argument should be given to the training script?
Which no additional argument, when training with JSON that includes token_text instead of text, I keep getting this error:

Traceback (most recent call last):
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq_cli/train.py", line 176, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/export/home/lium/vpelloin/git/espresso/fairseq_cli/train.py", line 287, in train
    log_output = trainer.train_step(samples)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/trainer.py", line 674, in train_step
    ignore_grad=is_dummy_batch,
  File "/export/home/lium/vpelloin/git/espresso/fairseq/tasks/fairseq_task.py", line 476, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/espresso/criterions/label_smoothed_cross_entropy_v2.py", line 150, in forward
    net_output = model(**sample["net_input"], epoch=self.epoch)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
    return self.module(*args, **kwargs)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/legacy_distributed_data_parallel.py", line 74, in forward
    return self.module(*inputs, **kwargs)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'prev_output_tokens'

Thank you so much for the incredible work you're doing with this tool!

Error found when running librispeech recipe with latest version of espresso

๐Ÿ› Bug

There are two issues after install the latest version of espresso:

  1. The specaug parameter parsing errro occur once we enable the specaug function
2020-11-11 12:04:42 | INFO | espresso.speech_train | --max-tokens is the maximum number of input frames in a batch
Traceback (most recent call last):
  File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/examples/asr_librispeech/../../espresso/speech_train.py", line 415, in <module>
    cli_main()
  File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/examples/asr_librispeech/../../espresso/speech_train.py", line 404, in cli_main
    cfg = convert_namespace_to_omegaconf(args)
  File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/fairseq/dataclass/utils.py", line 324, in convert_namespace_to_omegaconf
    composed_cfg = compose("config", overrides=overrides, strict=False)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/experimental/compose.py", line 31, in compose
    cfg = gh.hydra.compose_config(
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 507, in compose_config
    cfg = self.config_loader.load_configuration(
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 151, in load_configuration
    return self._load_configuration(
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 180, in _load_configuration
    parsed_overrides = parser.parse_overrides(overrides=overrides)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/core/override_parser/overrides_parser.py", line 95, in parse_overrides
    raise OverrideParseException(
hydra.errors.OverrideParseException: mismatched input 'W' expecting <EOF>
See https://hydra.cc/docs/next/advanced/override_grammar/basic for details
  1. It crash in model training step (step 8) without any error
2020-11-11 12:38:55 | INFO | espresso.speech_train | task: SpeechRecognitionEspressoTask
2020-11-11 12:38:55 | INFO | espresso.speech_train | model: SpeechLSTMModel
2020-11-11 12:38:55 | INFO | espresso.speech_train | criterion: LabelSmoothedCrossEntropyV2Criterion)
2020-11-11 12:38:55 | INFO | espresso.speech_train | num. model params: 159660204 (num. trained: 159660204)
2020-11-11 12:38:55 | INFO | fairseq.trainer | detected shared parameter: decoder.attention.query_proj.bias <- decoder.attention.value_proj.bias
2020-11-11 12:38:55 | INFO | espresso.speech_train | training on 1 devices (GPUs/TPUs)
2020-11-11 12:38:55 | INFO | espresso.speech_train | max tokens per GPU = 26000 and batch size per GPU = 24
2020-11-11 12:38:55 | INFO | fairseq.trainer | no existing checkpoint found exp/lstm_wsj.specaug.bpe1k/checkpoint_last.pt
2020-11-11 12:38:55 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-11 12:39:05 | INFO | espresso.tasks.speech_recognition | /nfs/mercury-13/u20/cli/src/espresso.latest/espresso/examples/asr_librispeech/data-bulgarian-bpe1k/train.json 33004 examples
./run.sh: line 259:  4839 Segmentation fault      CUDA_VISIBLE_DEVICES=$free_gpu speech_train.py $data_dir --task speech_recognition_espresso --seed 1 --log-interval $((8000/ngpus/update_freq)) --log-format simple --print-training-sample-interval $((4000/ngpus/update_freq)) --num-workers 0 --data-buffer-size 0 --max-tokens 26000 --batch-size 24 --curriculum 1 --empty-cache-freq 50 --valid-subset $valid_subset --batch-size-valid 48 --ddp-backend no_c10d --update-freq $update_freq --distributed-world-size $ngpus --optimizer adam --lr 0.001 --weight-decay 0.0 --clip-norm 2.0 --save-dir $dir --restore-file checkpoint_last.pt --save-interval-updates $((6000/ngpus/update_freq)) --keep-interval-updates 3 --keep-last-epochs 5 --validate-interval 1 --best-checkpoint-metric wer --criterion label_smoothed_cross_entropy_v2 --label-smoothing 0.1 --smoothing-type uniform --dict $dict --bpe sentencepiece --sentencepiece-model ${sentencepiece_model}.model --max-source-positions 9999 --max-target-positions 999 $opts --specaugment-config "$specaug_config" 2>&1

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd: ./run.sh
  2. See error: listed above

Expected behavior

Able to train model with the recipe

Environment

  • fairseq Version (e.g., 1.0 or master): 1.0.0a0+d966482
  • PyTorch Version (e.g., 1.0): 1.4.0
  • OS (e.g., Linux): CentOS Linux release 7.7.1908 (Core)
  • How you installed fairseq (pip, source): pip install from source
  • Build command you used (if compiling from source): pip install --editable .
  • Python version: 3.8.5
  • CUDA/cuDNN version: py3.8_cuda10.0.130_cudnn7.6.3_0
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Non-ASCII characters in sample PRD and REF

Hi,
During training swbd recipe, the log on screen:

| sample PRD: \xe2\x96\x81maybe\xe2\x96\x81c'
| sample REF: b'\xe2\x96\x81[vocalized-noise]'

Also the WER on swbd val set is very large. Is this normal? Thanks in advance.

Transformer model recipe for Librispeech is not working

๐Ÿ› Bug

Failed running the librispeech training with speech_transformer_librispeech architecture.

To Reproduce
Steps to reproduce the behavior (always include the command you ran):

When I change the script "run.sh" under folder "<espresso_root>/examples/asr_librispeech" to use arch "speech_transformer_librispeech",.

I changed the "run.sh" on below lines:

# Just start training and skip other preparation process
# stage=1 
stage=8

and

# Change arch from speech_conv_lstm_librispeech to speech_transformer_librispeech
  CUDA_VISIBLE_DEVICES=$free_gpu speech_train.py data --task speech_recognition_espresso --seed 1 --user-dir espresso \
    --num-workers 0 --data-buffer-size 0 --max-tokens 26000 --max-sentences 24 --curriculum 1 \
    --valid-subset $valid_subset --max-sentences-valid 48 --ddp-backend no_c10d \
    --distributed-world-size $ngpus --distributed-port $(if [ $ngpus -gt 1 ]; then echo 100; else echo -1; fi) \
    --optimizer adam --lr 0.001 --weight-decay 0.0 --clip-norm 2.0 \
    --save-dir $dir --restore-file checkpoint_last.pt --save-interval-updates $((6000/ngpus)) \
    --keep-interval-updates 3 --keep-last-epochs 5 --validate-interval 1 --best-checkpoint-metric wer \
    --dict $dict --bpe sentencepiece --sentencepiece-vocab ${sentencepiece_model}.model \
    --max-source-positions 9999 --max-target-positions 999 \
    --log-interval $((8000/ngpus)) --log-format simple \
    --arch **speech_transformer_librispeech** --criterion cross_entropy_v2 \
    --print-training-sample-interval $((4000/ngpus)) \
    $opts --specaugment-config "$specaug_config" 2>&1 | tee $log_file

Run cmd './run.sh'
Got error

Traceback (most recent call last):
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/nfs/mercury-13/u20/cli/src/espresso/espresso/speech_train.py", line 341, in distributed_main
    main(args, init_distributed=True)
  File "/nfs/mercury-13/u20/cli/src/espresso/espresso/speech_train.py", line 72, in main
    model = task.build_model(args)
  File "/nfs/mercury-13/u20/cli/src/espresso/espresso/tasks/speech_recognition.py", line 339, in build_model
    model = super().build_model(args)
  File "/nfs/mercury-13/u20/cli/src/espresso/fairseq/tasks/fairseq_task.py", line 211, in build_model
    model = models.build_model(args, self)
  File "/nfs/mercury-13/u20/cli/src/espresso/fairseq/models/__init__.py", line 48, in build_model
    return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
  File "/nfs/mercury-13/u20/cli/src/espresso/espresso/models/speech_transformer.py", line 132, in build_model
    return cls(encoder, decoder)
TypeError: __init__() missing 1 required positional argument: 'decoder'

This is due to the constructor not insert with args, and can be fixed by add back the args param in speech_transformer.py as below

        # return cls(encoder, decoder)
        return cls(args, encoder, decoder)

But after that, it got more complicated error following as below

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/nfs/mercury-13/u20/cli/src/espresso/espresso/speech_train.py", line 341, in distributed_main
    main(args, init_distributed=True)
  File "/nfs/mercury-13/u20/cli/src/espresso/espresso/speech_train.py", line 121, in main
    valid_losses, should_stop = train(args, trainer, task, epoch_itr)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/nfs/mercury-13/u20/cli/src/espresso/espresso/speech_train.py", line 210, in train
    log_output = trainer.train_step(samples)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/nfs/mercury-13/u20/cli/src/espresso/fairseq/trainer.py", line 408, in train_step
    ignore_grad=is_dummy_batch,
  File "/nfs/mercury-13/u20/cli/src/espresso/fairseq/tasks/fairseq_task.py", line 342, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/mercury-13/u20/cli/src/espresso/espresso/criterions/cross_entropy_v2.py", line 49, in forward
    net_output = model(**sample["net_input"], epoch=self.epoch)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/mercury-13/u20/cli/src/espresso/fairseq/legacy_distributed_data_parallel.py", line 86, in forward
    return self.module(*inputs, **kwargs)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'epoch'

Wanna know which recipe involve multi-level LM model train and decoding. Also, can we use word + subword as multi-level decoding ?

What is your question?

As what stated in subject:

  1. Wanna know which recipe involve multi-level LM model train and decoding.
  2. Can we use word + subword as multi-level decoding ? and how ?

What have you tried?

Have read the librispeech and wsj recipe, but unable to see some clear idea on how to enable the multi-level (word + sub-word) in LSTM (ASR) model decoding.

What's your environment?

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version : 1.4.0
  • OS (e.g., Linux): Centos7
  • How you installed fairseq (pip, source): pip
  • Python version: 3.7
  • CUDA/cuDNN version: 10.0

TIMIT Demo example

๐Ÿš€ Feature Request

Would it be possible to upload an example for TIMIT for demonstration purpose? All other Speech Recognition datasets are kinda too large to download when just trying out this repo. Having TIMIT would make allow people new to ASR to quickly try out and appreciate the convinience of this framework. Thanks.

Motivation

Pitch

Alternatives

Additional context

Support for RNNLM use while decoding?

Does Espresso support Beam search decoding with RNNLM or another LM?

Also is there a version requirement for KALDI? Which KALDI installation is ok to connect to Espresso?

Language Model Inference

How to infer a language model based on LSTM (lstm_lm) trained with subword_tokens ? (similar to generate.py for inference in seq2seq).

Different WERs when decoding with different batch size (--max-sentences)

Hi,
I would ask the reason why I get different results when I decode with different batch sizes. (--max-sentences).

For example, with the same language model and same attention-based encoder-decoder model:

  1. by setting the batch size (--max-sentences) as 32, in WSJ, the WER is 3.46% in eval92 and 5.71% in dev93.
  2. by setting the batch size (--max-sentences) as 1, in WSJ, the WER is 3.42% in eval92 and 5.67% in dev93.

The difference is not large, I guess this issue is caused by the beam search in batches. But I'm not sure clearly where the difference comes from. If you know, I'm glad to have your answer.

Problem with Long Utterances for MALACH Corpus

I am trying to use espresso to decode the MALACH Corpus. One of the characteristics of MALACH is that the training utterances are all short ( < 8 secs on the whole) but the test data contains a significant number of long utterances ( . 20 seconds). I am observing that on these long utterances it produces decent output for the first 5-6 seconds, deteriorates rapidly thereafter, puts out some repeated words, and then stops decoding resulting in many deletions. This is for a transformer model based on the wsj recipe. MALACH has about 160 hours of training data. I would welcome some suggestions/help here - it almost looks like some parameter setting would fix things.

Thanks
Michael

Slow training...

Hello,

I have spent some time to compare pychain LF-MMI in Espresso and the pychain_example, which seems to borrow some code from Espresso. I get very slow forward passes in Espresso while they are much faster in PyChain_example (I use DistributedDataParallel on both Espresso ('no_c10d' backend, which uses NCCL anyway?) and PyChain (with 'nccl') ). I use a TDNN model which is the same, same architecture cnn/bn/relu implementation matched from Espresso to PyChain: 6 TDNN+BN+ReLU layers, strides=(1,1,1,1,1,3), dilation=(1,1,1,3,3,3), kernels=(3,3,3,3,3,3), no residual connections. Both use curriliculum learning in the first epoch and start with shortest batches.

Espresso code:

        start= time.time()
        for i in range(len(self.tdnn)):
            if self.residual and i > 0:  # residual connection starts from the 2nd layer
                prev_x = x
            x, x_lengths, padding_mask = self.tdnn[i](x, x_lengths)
            x = self.dropout_out_module(x)
            x = x + prev_x if self.residual and i > 0 and x.size(1) == prev_x.size(1) else x
        print ('6xTDNN time %.5fs' % (time.time() - start,), 'tensor_in_size', s, 'gpu', x.get_device())

PyChain code:

        start = time.time()
        for i in range(len(self.tdnn)):
            if self.residual and i>0:
              x_prev = x
            x, x_lengths = self.tdnn[i](x, x_lengths)
            x = F.dropout(x, p=self.dropout, training=self.training)
            if self.residual and i>0 and x.size(1)==x_prev.size(1):
                x += x_prev
        print ('6xTDNN time %.5fs' % (time.time() - start,), 'tensor_in_size', s, 'gpu', x.get_device())

So, the code is almost line-by-line the same, architecture is the same. Yet, after using DistributedDataParallel, Espresso is much slower. This was run on the same machine, same 2 GPUs, one experiment right after the other (so no load change issues on the machine). I checked that the computing the padding does not significantly affect the timing. Here are the timings for several forward passess of similar size.

Espresso:
6xTDNN time 2.42642s tensor_in_size torch.Size([64, 158, 40]) tensor_out_size torch.Size([64, 53, 640]) gpu 1
6xTDNN time 2.39317s tensor_in_size torch.Size([64, 177, 40]) tensor_out_size torch.Size([64, 59, 640]) gpu 1
6xTDNN time 1.95155s tensor_in_size torch.Size([64, 144, 40]) tensor_out_size torch.Size([64, 48, 640]) gpu 0
6xTDNN time 2.50637s tensor_in_size torch.Size([64, 170, 40]) tensor_out_size torch.Size([64, 57, 640]) gpu 0
6xTDNN time 1.79735s tensor_in_size torch.Size([64, 192, 40]) tensor_out_size torch.Size([64, 64, 640]) gpu 1
6xTDNN time 2.37481s tensor_in_size torch.Size([64, 186, 40]) tensor_out_size torch.Size([64, 62, 640]) gpu 0

...
PyChain:
6xTDNN time 0.07956s tensor_in_size torch.Size([64, 170, 40]) tensor_out_size torch.Size([64, 57, 640]) gpu 0
6xTDNN time 0.08923s tensor_in_size torch.Size([64, 194, 40]) tensor_out_size torch.Size([64, 65, 640]) gpu 1
6xTDNN time 0.08312s tensor_in_size torch.Size([64, 211, 40]) tensor_out_size torch.Size([64, 71, 640]) gpu 0
6xTDNN time 0.08275s tensor_in_size torch.Size([64, 224, 40]) tensor_out_size torch.Size([64, 75, 640]) gpu 1
6xTDNN time 0.08598s tensor_in_size torch.Size([64, 233, 40]) tensor_out_size torch.Size([64, 78, 640]) gpu 0
6xTDNN time 0.08788s tensor_in_size torch.Size([64, 241, 40]) tensor_out_size torch.Size([64, 81, 640]) gpu 1
...
So, PyChain is 10-20 times faster... Espresso uses 40-50% of each GPU, while PyChain uses 85-95% when put together with the LF-MMI loss. I wonder how to make Espresso train as fast as PyChain shows it is possible. Is it a matter of the DistributedDataParallel imlementation in Fairseq? the backend? Any help is welcome.

Is there a list of reference papers about conv-lstm in examples/asr_wsj?

Hello,
before I build up speech-transformer models, I first tried to train the default model using run.sh in examples/asr_wsj. The accuracy is WER 12.35% for test_92 w/o LM.
this is somewhat behind the accuracy of speech-transformer (WER 10.92%) but it is still competitive.

Could you explain which references you referred to make the training script?

Transformer LM in ASR

Hi,
Thanks you for providing the transformer ASR recipe. Is it possible to use transformer language model instead of lstm? I have seen the recipe run script, it provides use_transformer option in the Acoustic Model training. But in the language model training it does not provide such option.

Thank you in advance for your answer.
Martha.

โ“ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Code

What have you tried?

What's your environment?

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Any plan for SpecAug?

Thanks for open sourcing this great package! Is there any plan to add SpecAug or other augmentation methods into the code base?

ASR_WSJ: LM is training but no logging output?

๐Ÿ› Bug

I am running the asr_wsj recipe. It is training the word_lm (stage 6) since last night but does not produce any output, logging or otherwise.

When I run nvtop or Nvidia-smi the gpus seem to be busy with my jobs. I am running 4 GPUs in parallel. Early on there were some OOM problems that it tried to recover from. Is it possible it in some sort of weird infinite loop but is doing nothing?

Attached is the screen output - at the top you can see nvidia-smi is run along with the early OOM messages.

no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/condabin/conda
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/bin/conda
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/bin/conda-env
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/bin/activate
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/bin/deactivate
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/etc/profile.d/conda.sh
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/etc/fish/conf.d/conda.fish
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/shell/condabin/Conda.psm1
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/shell/condabin/conda-hook.ps1
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/lib/python3.7/site-packages/xontrib/conda.xsh
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/etc/profile.d/conda.csh
no change /home/map22/.bashrc
No action taken.
Tue Dec 8 22:30:53 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.36 Driver Version: 440.36 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 00000000:02:00.0 Off | N/A |
| 23% 18C P8 9W / 250W | 1MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... On | 00000000:03:00.0 Off | N/A |
| 23% 21C P8 9W / 250W | 1MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... On | 00000000:82:00.0 Off | N/A |
| 23% 22C P8 8W / 250W | 1MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... On | 00000000:83:00.0 Off | N/A |
| 23% 22C P8 8W / 250W | 1MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Stage 3: Text Binarization for LM Training
./run.sh: binarizing word text...
Unable to get 4 GPUs
Stage 6: word LM Training
2020-12-08 22:32:29 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:19801
2020-12-08 22:32:29 | INFO | fairseq.distributed_utils | distributed init (rank 2): tcp://localhost:19801
2020-12-08 22:32:29 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:19801
2020-12-08 22:32:29 | INFO | fairseq.distributed_utils | distributed init (rank 3): tcp://localhost:19801
2020-12-08 22:32:39 | INFO | fairseq.distributed_utils | initialized host lion6.cs.nyu.edu as rank 3
2020-12-08 22:32:39 | INFO | fairseq.distributed_utils | initialized host lion6.cs.nyu.edu as rank 2
2020-12-08 22:32:39 | INFO | fairseq.distributed_utils | initialized host lion6.cs.nyu.edu as rank 0
2020-12-08 22:32:39 | INFO | fairseq.distributed_utils | initialized host lion6.cs.nyu.edu as rank 1
2020-12-08 22:32:39 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 1000, 'log_format': 'simple', 'tensorboard_logdir': None, 'wandb_project': None, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': True}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 4, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:19801', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'c10d', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'broadcast_buffers': False, 'distributed_wrapper': 'DDP', 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 4, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'tpu': False, 'distributed_num_procs': 4}, 'dataset': {'_name': None, 'num_workers': 0, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 6400, 'batch_size': 256, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 6400, 'batch_size_valid': 512, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 25, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.001], 'min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'exp/wordlm_lstm', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 1000, 'keep_interval_updates': 5, 'keep_last_epochs': 5, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 4}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': False, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False, 'eos_factor': None, 'subwordlm_weight': 0.8, 'oov_penalty': 0.0001, 'disable_open_vocab': False, 'apply_log_softmax': False, 'state_prior_file': None}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='lstm_wordlm_wsj', adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_softmax_cutoff=None, add_bos_token=False, all_gather_list_size=16384, arch='lstm_wordlm_wsj', batch_size=256, batch_size_valid='512', best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', curriculum=0, data='data/wordlm_text', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_dropout_in=0.35, decoder_dropout_out=0.35, decoder_embed_dim=1200, decoder_embed_path=None, decoder_freeze_embed=False, decoder_hidden_size=1200, decoder_layers=3, decoder_out_embed_dim=1200, decoder_rnn_residual=False, device_id=0, dict='data/lang/wordlist_65000.txt', disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, distributed_wrapper='DDP', dropout=0.35, empty_cache_freq=0, eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False, gen_subset='test', is_wordlm=True, keep_best_checkpoints=-1, keep_interval_updates=5, keep_last_epochs=5, localsgd_frequency=3, log_format='simple', log_interval=1000, lr=[0.001], lr_patience=0, lr_scheduler='reduce_lr_on_plateau', lr_shrink=0.5, lr_threshold=0.0001, max_epoch=25, max_target_positions=None, max_tokens=6400, max_tokens_valid=6400, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, model_parallel_size=1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, nprocs_per_node=4, num_shards=1, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_dictionary_size=-1, pad=1, past_target=False, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, profile=False, quantization_config_path=None, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_break_mode='eos', save_dir='exp/wordlm_lstm', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, self_target=False, sentence_avg=False, shard_id=0, share_embed=True, shorten_data_split_list='', shorten_method='none', skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, task='language_modeling_for_asr', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, tpu=False, train_subset='train', unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=-1, warmup_updates=0, weight_decay=0.0, zero_sharding='none'), 'task': {'_name': 'language_modeling_for_asr', 'data': 'data/wordlm_text', 'sample_break_mode': 'eos', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': None, 'shorten_method': 'none', 'shorten_data_split_list': '', 'seed': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'dict': 'data/lang/wordlist_65000.txt'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.999)', 'adam_eps': 1e-08, 'weight_decay': 0.0, 'use_old_adam': False, 'tpu': False, 'lr': [0.001]}, 'lr_scheduler': {'_name': 'reduce_lr_on_plateau', 'lr_shrink': 0.5, 'lr_threshold': 0.0001, 'lr_patience': 0, 'warmup_updates': 0, 'warmup_init_lr': -1.0, 'lr': [0.001], 'maximize_best_checkpoint_metric': False}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
2020-12-08 22:32:39 | INFO | espresso.tasks.language_modeling_for_asr | dictionary: 65003 types
2020-12-08 22:32:39 | INFO | fairseq.data.data_utils | loaded 503 examples from: data/wordlm_text/valid
2020-12-08 22:32:42 | INFO | fairseq_cli.train | LSTMLanguageModelEspresso(
(decoder): SpeechLSTMDecoder(
(dropout_in_module): FairseqDropout()
(dropout_out_module): FairseqDropout()
(embed_tokens): Embedding(65003, 1200, padding_idx=0)
(layers): ModuleList(
(0): LSTMCell(1200, 1200)
(1): LSTMCell(1200, 1200)
(2): LSTMCell(1200, 1200)
)
)
)
2020-12-08 22:32:42 | INFO | fairseq_cli.train | task: LanguageModelingForASRTask
2020-12-08 22:32:42 | INFO | fairseq_cli.train | model: LSTMLanguageModelEspresso
2020-12-08 22:32:42 | INFO | fairseq_cli.train | criterion: CrossEntropyCriterion)
2020-12-08 22:32:42 | INFO | fairseq_cli.train | num. model params: 112592400 (num. trained: 112592400)
2020-12-08 22:32:43 | INFO | fairseq.utils | CUDA enviroments for all 4 workers
2020-12-08 22:32:43 | INFO | fairseq.utils | rank 0: capabilities = 6.1 ; total memory = 10.917 GB ; name = GeForce GTX 1080 Ti
2020-12-08 22:32:43 | INFO | fairseq.utils | rank 1: capabilities = 6.1 ; total memory = 10.917 GB ; name = GeForce GTX 1080 Ti
2020-12-08 22:32:43 | INFO | fairseq.utils | rank 2: capabilities = 6.1 ; total memory = 10.917 GB ; name = GeForce GTX 1080 Ti
2020-12-08 22:32:43 | INFO | fairseq.utils | rank 3: capabilities = 6.1 ; total memory = 10.917 GB ; name = GeForce GTX 1080 Ti
2020-12-08 22:32:43 | INFO | fairseq.utils | CUDA enviroments for all 4 workers
2020-12-08 22:32:43 | INFO | fairseq_cli.train | training on 4 devices (GPUs/TPUs)
2020-12-08 22:32:43 | INFO | fairseq_cli.train | max tokens per GPU = 6400 and batch size per GPU = 256
2020-12-08 22:32:43 | INFO | fairseq.trainer | no existing checkpoint found exp/wordlm_lstm/checkpoint_last.pt
2020-12-08 22:32:43 | INFO | fairseq.trainer | loading train data for epoch 1
/misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espresso-dec082020/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:398: UserWarning: The check_reduction argument in DistributedDataParallel module is deprecated. Please avoid using it.
"The check_reduction argument in DistributedDataParallel "
2020-12-08 22:41:58 | INFO | fairseq.data.data_utils | loaded 1662964 examples from: data/wordlm_text/train
/misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espresso-dec082020/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:398: UserWarning: The check_reduction argument in DistributedDataParallel module is deprecated. Please avoid using it.
"The check_reduction argument in DistributedDataParallel "
/misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espresso-dec082020/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:398: UserWarning: The check_reduction argument in DistributedDataParallel module is deprecated. Please avoid using it.
"The check_reduction argument in DistributedDataParallel "
/misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espresso-dec082020/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:398: UserWarning: The check_reduction argument in DistributedDataParallel module is deprecated. Please avoid using it.
"The check_reduction argument in DistributedDataParallel "
2020-12-08 22:42:06 | INFO | fairseq.trainer | begin training epoch 1
/misc/vlgscratch5/PichenyGroup/picheny/espresso/fairseq/utils.py:347: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/misc/vlgscratch5/PichenyGroup/picheny/espresso/fairseq/utils.py:347: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/misc/vlgscratch5/PichenyGroup/picheny/espresso/fairseq/utils.py:347: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/misc/vlgscratch5/PichenyGroup/picheny/espresso/fairseq/utils.py:347: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2020-12-08 22:42:08 | INFO | root | Reducer buckets have been rebuilt in this iteration.
2020-12-08 22:42:14 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 1; 10.92 GiB total capacity; 7.68 GiB already allocated; 1.37 GiB free; 8.91 GiB reserved in total by PyTorch)
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|

PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
===========================================================================

2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|

PyTorch CUDA memory summary, device ID 1
CUDA OOMs: 1
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
===========================================================================

2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|

PyTorch CUDA memory summary, device ID 2
CUDA OOMs: 0
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
===========================================================================

2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|

PyTorch CUDA memory summary, device ID 3
CUDA OOMs: 0
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
===========================================================================

2020-12-08 22:42:14 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2020-12-08 22:42:14 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 2; 10.92 GiB total capacity; 7.66 GiB already allocated; 945.06 MiB free; 9.36 GiB reserved in total by PyTorch)
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|

PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
===========================================================================

2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|

PyTorch CUDA memory summary, device ID 1
CUDA OOMs: 0
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
===========================================================================

2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|

PyTorch CUDA memory summary, device ID 2
CUDA OOMs: 1
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
===========================================================================

2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|

PyTorch CUDA memory summary, device ID 3
CUDA OOMs: 0
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
===========================================================================

2020-12-08 22:42:14 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass

Error in fp16 training

Hi @freewym , have you had a chance to train the model with float 16 precision? I experienced such error in the swb recipe:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "<path>/codebase/espresso/speech_train.py", line 354, in distributed_main main(args, init_distributed=True) File "<path>/codebase/espresso/speech_train.py", line 128, in main train(args, trainer, task, epoch_itr) File "<path>/codebase/espresso/speech_train.py", line 173, in train log_output = trainer.train_step(samples) File "<path>/codebase/espresso/fairseq/trainer.py", line 342, in train_step raise e File "<path>/codebase/espresso/fairseq/trainer.py", line 306, in train_step ignore_grad File "<path>/codebase/espresso/fairseq/tasks/fairseq_task.py", line 249, in train_step optimizer.backward(loss) File "<path>/codebase/espresso/fairseq/optim/fp16_optimizer.py", line 103, in backward loss.backward() File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/tensor.py", line 150, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: expected scalar type Float but found Half

Stream ASR

Is there any recipe for streaming/online decoding?

WER difference when decode with different batch size

๐Ÿ› Bug

WER difference when decode with different batch size.
As subject, when I do decoding with parameter "--batch-size " will result in different WER.

To Reproduce

Just run decoding with any of the recipe (in my case, no LM model is used) on a test set contain 3 utt. only, then the result of decoding is different, ie.

I got below WER by just setting batch-size to 1 (ms-1) and 3 (ms-3) :

==> test.ms-1.log <==

Recognize valid_3utt with beam=4: WER=115.56%, Sub=66.67%, Ins=48.89%, Del=0.00%
WER saved in /nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/expts/uke-lstm/lstm_uke/decode_valid_3utt_e_best_lp-0.9_lw-0.00/wer
                                  CER=136.92%, Sub=60.00%, Ins=75.38%, Del=1.54%
CER saved in /nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/expts/uke-lstm/lstm_uke/decode_valid_3utt_e_best_lp-0.9_lw-0.00/cer

==> test.ms-3.log <==

Recognize valid_3utt with beam=4: WER=128.89%, Sub=64.44%, Ins=64.44%, Del=0.00%
WER saved in /nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/expts/uke-lstm/lstm_uke/decode_valid_3utt_e_best_lp-0.9_lw-0.00/wer
                                  CER=153.85%, Sub=50.77%, Ins=100.00%, Del=3.08%

I have added some debug to dump the lprobs and found it different with different batch-size.

Debug code (fairseq/sequence_generator.py):

            lprobs, avg_attn_scores = self.model.forward_decoder(
                tokens[:, : step + 1],
                encoder_outs,
                incremental_states,
                self.temperature,
            )

            print("step({}) lprobs[:, :8] = {}".format(step, lprobs[:, :8]))

Debug output:

test.ms-3.log :
step(0) lprobs[:, :8] = tensor(
       [[ -9.5018,  -7.5052,  -9.4293,  -8.1257,  -9.2582,  -9.7691, -10.7347,          -8.4094],
        [ -9.5018,  -7.5052,  -9.4293,  -8.1257,  -9.2582,  -9.7691, -10.7347,          -8.4094],
        [ -9.5018,  -7.5052,  -9.4293,  -8.1257,  -9.2582,  -9.7691, -10.7347,          -8.4094],
        [ -9.5018,  -7.5052,  -9.4293,  -8.1257,  -9.2582,  -9.7691, -10.7347,          -8.4094],
        [ -9.4023,  -8.1336,  -9.3596,  -8.4404, -10.0957,  -8.6689, -11.4972,          -9.5354],
        [ -9.4023,  -8.1336,  -9.3596,  -8.4404, -10.0957,  -8.6689, -11.4972,          -9.5354],
        [ -9.4023,  -8.1336,  -9.3596,  -8.4404, -10.0957,  -8.6689, -11.4972,          -9.5354],
        [ -9.4023,  -8.1336,  -9.3596,  -8.4404, -10.0957,  -8.6689, -11.4972,          -9.5354],
        [ -8.8710,  -6.7084,  -9.0103,  -7.3384,  -8.3311,  -8.8002, -10.2661,          -8.6761],
        [ -8.8710,  -6.7084,  -9.0103,  -7.3384,  -8.3311,  -8.8002, -10.2661,          -8.6761],
        [ -8.8710,  -6.7084,  -9.0103,  -7.3384,  -8.3311,  -8.8002, -10.2661,          -8.6761],
        [ -8.8710,  -6.7084,  -9.0103,  -7.3384,  -8.3311,  -8.8002, -10.2661,          -8.6761]], device='cuda:0')
test.ms-1.log :
step(0) lprobs[:, :8] = tensor(
       [[ -8.7959,  -6.7410,  -8.9221,  -7.2738,  -8.2759,  -8.6486, -10.0568,          -8.6627],
        [ -8.7959,  -6.7410,  -8.9221,  -7.2738,  -8.2759,  -8.6486, -10.0568,          -8.6627],
        [ -8.7959,  -6.7410,  -8.9221,  -7.2738,  -8.2759,  -8.6486, -10.0568,          -8.6627],
        [ -8.7959,  -6.7410,  -8.9221,  -7.2738,  -8.2759,  -8.6486, -10.0568,          -8.6627]], device='cuda:0')
step(0) lprobs[:, :8] = tensor(
       [[ -9.4368,  -8.1758,  -9.4051,  -8.4860, -10.0284,  -8.6903, -11.5442,          -9.6263],
        [ -9.4368,  -8.1758,  -9.4051,  -8.4860, -10.0284,  -8.6903, -11.5442,          -9.6263],
        [ -9.4368,  -8.1758,  -9.4051,  -8.4860, -10.0284,  -8.6903, -11.5442,          -9.6263],
        [ -9.4368,  -8.1758,  -9.4051,  -8.4860, -10.0284,  -8.6903, -11.5442,          -9.6263]], device='cuda:0')
step(0) lprobs[:, :8] = tensor(
       [[ -9.5018,  -7.5052,  -9.4293,  -8.1257,  -9.2582,  -9.7691, -10.7347,          -8.4094],
        [ -9.5018,  -7.5052,  -9.4293,  -8.1257,  -9.2582,  -9.7691, -10.7347,          -8.4094],
        [ -9.5018,  -7.5052,  -9.4293,  -8.1257,  -9.2582,  -9.7691, -10.7347,          -8.4094],
        [ -9.5018,  -7.5052,  -9.4293,  -8.1257,  -9.2582,  -9.7691, -10.7347,          -8.4094]], device='cuda:0')

Since there is only 3 utterance in my test set and I use the beam size 4 in decoding.
By right the batch size parameter in decoding is just to improve the performance and allow the decoder to fit into different GPU memory size, the 12 rows of lprobs should be the same. But I found that the two output, event in first step, is different.

Environment

  • fairseq Version (master): master
  • PyTorch Version : 1.4.0 py3.8_cuda10.0.130_cudnn7.6.3_0
  • OS (e.g., Linux): CentOS Linux release 7.8.2003 (Core)
  • Python version: 3.8.5
  • CUDA/cuDNN version: cuda10.0.130_cudnn7.6.3_0

Additional context

GPU utilization is very low

Dear All,

I'm running the asr_librispeech recipe in an ubuntu 16.04 virtual machine with 4 V100s.

My problem is that the CPUs are all busy but the GPU utilization is always very low. So the training speed is slow (both in the LSTM and transformer cases). How to solve this issue?

Thanks a lot!


top - 19:31:44 up 3 days, 16:48,  1 user,  load average: 68.33, 76.75, 61.09
Tasks:  14 total,   5 running,   9 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9.0 us, 25.6 sy,  0.0 ni, 65.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 52826323+total, 21278182+free, 35688764 used, 27979264+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 48815801+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                          
34316 liao      20   0 32.798g 7.950g 541428 R 791.4  1.6 128:56.46 python3                                                                                                                                                          
34315 liao      20   0 32.940g 7.943g 532860 R 701.7  1.6 166:19.57 python3                                                                                                                                                          
34317 liao      20   0 32.643g 7.951g 542212 R 636.9  1.6 141:01.29 python3                                                                                                                                                          
34318 liao      20   0 32.706g 7.952g 542148 R 613.0  1.6 139:37.17 python3                                                                                                                                                          

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   42C    P0    85W / 300W |   3902MiB / 16160MiB |     17%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   44C    P0    78W / 300W |   3890MiB / 16160MiB |     12%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   49C    P0   162W / 300W |   3902MiB / 16160MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   45C    P0   169W / 300W |   3900MiB / 16160MiB |     17%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

WSJ Recipe: "wsj_data_prep.sh: Spot check of command line arguments failed"

What is your question?

I am trying to run wsj recipe using ./run.sh but I get the following error:

Stage 0: Data Preparation
ln: failed to create symbolic link 'links/??-?.?': File exists
ln: failed to create symbolic link 'links/??-??.?': File exists
wsj_data_prep.sh: Spot check of command line arguments failed
Command line arguments must be absolute pathnames to WSJ directories
with names like 11-13.1.
Note: if you have old-style WSJ distribution,
local/cstr_wsj_data_prep.sh may work instead, see run.sh for example.

Code

./run.sh

What have you tried?

I don't see cstr_wsj_data_prep.sh in local directory of wsj directory.

!ls local/cstr_wsj_data_prep.sh
ls: cannot access 'local/cstr_wsj_data_prep.sh': No such file or directory

What's your environment?

  • fairseq Version (e.g., 1.0 or master): 1.0
  • PyTorch Version (e.g., 1.0): 1.7.0+cu101
  • OS (e.g., Linux): Google Colab (Linux)
  • How you installed fairseq (pip, source): pip install --editable . in espresso source code
  • Build command you used (if compiling from source): To install Espresso commands in readme
  • Python version: 3.6
  • CUDA/cuDNN version: Not using GPU now
  • GPU models and configuration: Not using GPU now, my problem is in downloading dataset stage
  • Any other relevant information: I am using source code from master branch

Results.md request with ASR Recipes, SWBD Scores w/o LM

What is your question?

What numbers are expected with the SWBD Transformer?
I have built AM only models with LSTM and Transformer Models and get similar numbers

LSTM: SWBD (10.4), CALLHM(20.7)
Transformer: SWBD(10.8),CALLHM(20.8)

How much improvement does Specaug give?
I am seeing no improvement with Transformers + Specaug compared to Transformers w/o Specaug?

Could you please add a RESULTS.md file with each recipe which has current best working numbers to compare to?

Note: With LSTM based AM and LM, I was able to match your paper reported numbers

Deprecated `AT_CHECK` in pychain module

Since I don't seem to be able to report issues in PyChain, reporting the compilation issue here:
In pychain/pytorch_binding/src/pychain.cc:23, the AT_CHECK macro seems to be already deprecated in pytorch 1.5. I had to change it to TORCH_CHECK to finish the compilation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.