GithubHelp home page GithubHelp logo

streamingtransformer's People

Contributors

b-flo avatar bobchennan avatar butsugiri avatar creatorscan avatar cywang97 avatar emrys365 avatar enamoria avatar fhrozen avatar ftshijt avatar gtache avatar hirofumi0810 avatar jnishi avatar jzmo avatar kamo-naoyuki avatar kan-bayashi avatar lumaku avatar masao-someki avatar mn5k avatar potato-inoue avatar r9y9 avatar sas91 avatar shigekikarita avatar simpleoier avatar sw005320 avatar takenori-y avatar unilight avatar xiaofei-wang avatar yosukehiguchi avatar yuekaizhang avatar zh794390558 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

streamingtransformer's Issues

Issue about delay

When the chunk size is 32, the maximum decoding delay should be 1280 ms. Is that right?

IndexKernel.cu:53 errors

Hi,
I am currently trying to run the librispeech demo. But, when i run ./train.sh, error happens:

/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [34,0,0], thread: [94,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [34,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [34,0,0], thread: [96,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [34,0,0], thread: [97,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [34,0,0], thread: [98,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [34,0,0], thread: [99,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Did you meet this error before?

Thank you very much!

Question about chunk mask

In the chunk based streaming stretagy, encoder mask is caculated by method "adaptive_enc_mask". I try to reproduce the mask which is showed as the fig. As the figure shows, the encoder has a full history context and the future context is 32 * n_encoder_layer. Is that right?
2021-02-10_095005

Question about ImportError: No module named sentencepiece

I run the script in the following command:
./run --stage 2 and the error is:

File "~/StreamingTransformer/egs/librispeech/asr1/../../../utils/spm_encode", line 14, in
import sentencepiece as spm
ImportError: No module named sentencepiece

But in ESPNet , there was no such error.
I try to solve but failed in this way:
espnet/espnet#1656
Is there any solution? Thanks!!

align-variable

Hello, I run it in follow README, but got the below issue:
File "/home3/mgd/NNStudy/ASR/StreamingTransformer-master/espnet/asr/pytorch_backend/asr_ddp.py", line 118, in <listcomp> ys_pad = pad_list([torch.from_numpy(y) for y in ys]

I check the asr_ddp.py and have seen the align-variable, so the data-preparation need get the alignment-information?

Thank you very much!

Issue about StreamingConverter

I didn't find the third input of func trigger_mask named trigger in StreamingConverter,but found align,should i change the trigger to align?

Is it Streaming?

I think that the code you provided seems not to support streaming mode.

In real streaming condition, recognizer can't get full sequence of Data.

But prefix_recognize() is assuming to receive the full sequence and perform encoding only one time;

Issue about performance

I have finished training and decoding in AISHELL-1 dataset and got cer=12.4% in test set,and i found that my model.json which uses the default config is different from the one of Streaming_transformer-chunk32 with ESPnet Conv2d Encoder. It seems that my model lacks something,such as adaptive decoder.Can you release the result in AISHELL-1?

issue about Viterbi decoding step

I applied your viterbi-decoding step to the aishell1 dataset, the operation seems to be successful, but regarding the generated align, I don’t quite understand the meaning of the number corresponding to each sentence. Does the number represent the starting frame corresponding to the token in this sentence? Thanks a lot !

How can I get the the ''/path/to/model'' in Step 2. Viterbi decoding?

Hi,
I am confused how can I get the the ''/path/to/model'' in Step 2. Viterbi decoding?
Step 2. Viterbi decoding
To train a TA based streaming Transformer, the alignments between CTC paths and transcriptions are required. In our work, we apply Viterbi decoding using the offline Transformer model.

cd egs/librispeech/asr1
./viterbi_decode.sh /path/to/model

Thanks a lot.

Joint CTC-triggered attention decoding algorithm code

Hi, did you finish the implement of Joint CTC-triggered attention decoding algorithm?

I complete the Streaming Transformer model train and finished the train with TensorFlow, but when I tried to finish the Joint CTC-triggered attention decoding algorithm, I was stucked in the CTCPREFIX algorithm, is there any good ideal to complete it?

Papers:
[1]. STREAMING AUTOMATIC SPEECH RECOGNITION WITH THE TRANSFORMER MODEL
[2]. Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models

Missing import in streaming_transformer.py

Hi!
I'm currently trying to run the viterbi decoding with asr_recog.py, and when it gets to viterbi_decode in streaming_transformer.py it crashes because it doesn't find the viterbi_align function.

This seems to be solved by adding

from espnet.nets.viterbi_align import viterbi_align

at the beginning of the module.

Is this a right solution? If not, any idea why I'm getting this error?

To give a broader picture, I'm working with commit 19bcd9d.

Thank you very much in advance!

bad performance for streaming transformer using trigger

Hello, I trained a streaming transformer with following config, it seams that the loss is OK
but the decoding performance is bad. Is it neccesary to use prefix-decoder ?
When I use prefix-recognizie, error occurs. If I don't use prefix-recognize , the performance is bad

File "/home/storage15/username/tools/espnet/egs/librispeech/asr1/../../../espnet/bin/asr_recog.py", line 368, in
main(sys.argv[1:])
File "/home/storage15/username/tools/espnet/egs/librispeech/asr1/../../../espnet/bin/asr_recog.py", line 335, in main
recog_v2(args)
File "/home/storage15/username/tools/espnet/espnet/asr/pytorch_backend/recog.py", line 174, in recog_v2
best, ids, score = model.prefix_recognize(feat, args, train_args, train_args.char_list, lm)
File "/home/storage15/username/tools/espnet/espnet/nets/pytorch_backend/streaming_transformer.py", line 553, in prefix_recognize
self.compute_hyps(tmp,i,h_len,enc_output, hat_att[chunk_index], mask, train_args.chunk)
File "/home/storage15/username/tools/espnet/espnet/nets/pytorch_backend/streaming_transformer.py", line 776, in compute_hyps
enc_output4use, partial_mask4use, cache4use)
File "/home/storage15/username/tools/espnet/espnet/nets/pytorch_backend/transformer/decoder.py", line 310, in forward_one_step
x, tgt_mask, memory, memory_mask, cache=c
File "/home/storage15/username/tools/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/storage15/username/tools/espnet/espnet/nets/pytorch_backend/transformer/decoder_layer.py", line 94, in forward
), f"{cache.shape} == {(tgt.shape[0], tgt.shape[1] - 1, self.size)}"
AssertionError: torch.Size([5, 1, 512]) == (5, 2, 512)

train config:

This configuration requires 4 gpus with 12GB memory

accum-grad: 1
adim: 512
aheads: 8
batch-bins: 3000000
dlayers: 6
dropout-rate: 0.1
dunits: 2048
elayers: 12
epochs: 120
eunits: 2048
grad-clip: 5
lsm-weight: 0.1
model-module: espnet.nets.pytorch_backend.streaming_transformer:E2E
mtlalpha: 0.3
opt: noam
patience: 0
sortagrad: 0
transformer-attn-dropout-rate: 0.0
transformer-init: pytorch
transformer-input-layer: conv2d
transformer-length-normalized-loss: false
transformer-lr: 1.0
transformer-warmup-steps: 2500
n-iter-processes: 0

#enc-init: exp/train_960_pytorch_train_specaug/results/model.val5.avg.best
#/path/to/model
enc-init-mods: encoder,ctc,decoder

streaming: true
chunk: true
chunk-size: 32

decode_config:
lm-weight: 0.5
beam-size: 5
penalty: 2.0
maxlenratio: 0.0
minlenratio: 0.0
ctc-weight: 0.5
threshold: 0.0005
ctc-lm-weight: 0.5
prefix-decode: true

Start train stage, data read error.

70 File "/root/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
71 fn(i, *args)
72 File "/data/app/lilong/StreamingTransformer/espnet/asr/pytorch_backend/asr_ddp.py", line 319, in dist_train
73 train_epoch(train_loader, model, optimizer, epoch, args)
74 File "/data/app/lilong/StreamingTransformer/espnet/asr/pytorch_backend/asr_ddp.py", line 348, in train_epoch
75 for i, batch in enumerate(train_loader):
76 File "/root/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 615, in next
77 batch = self.collate_fn([self.dataset[i] for i in indices])
78 File "/root/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 615, in
79 batch = self.collate_fn([self.dataset[i] for i in indices])
80 File "/root/anaconda3/lib/python3.7/site-packages/chainer/dataset/dataset_mixin.py", line 67, in getitem
81 return self.get_example(index)
82 File "/root/anaconda3/lib/python3.7/site-packages/chainer/datasets/transform_dataset.py", line 52, in get_example
83 return self._transform(in_data)
84 File "/data/app/lilong/StreamingTransformer/espnet/asr/pytorch_backend/asr_ddp.py", line 291, in
85 train_dataset = TransformDataset(train, lambda data: converter(load_tr(data)))
86 File "/data/app/lilong/StreamingTransformer/espnet/asr/pytorch_backend/asr_ddp.py", line 118, in call
87 ys_pad = pad_list([torch.from_numpy(y) for y in ys],
88 File "/data/app/lilong/StreamingTransformer/espnet/asr/pytorch_backend/asr_ddp.py", line 118, in
89 ys_pad = pad_list([torch.from_numpy(y) for y in ys],
90 TypeError: expected np.ndarray (got numpy.int64)

I print the ys, like this:
(1021,)

Under the egs/librispeech/asr1/ folder and run train.sh
Is this a pytorch or numpy version problem?

Help me please.

Bug in prefix_recognize for TA

I trained model with ta use transformer on aishell1 with encoder left window 15, right window 15, decoder window left 15, right 2. I got better acc on train data. But when decode in prefix_recognize, the wer is 9.3 on test set! It was worse than chunk32 with wer 6.3
But compare chunk and Ta training log, the acc in ta was better than chunk. So I doubt the algorithm wrong. By removing the hat_att, which acting like cache, ta got wer 6.5 when ctc_weight 0.5, and terrible Rtf, maybe 8-10.

Could you modify the algorithm to fix ta decoding with better wer and rtf ?

Decode with CPU

Any option to decode with CPU?
I found that there are many cuda arrays in streaming_transformer.py

Can you explain the algorithm the prefix_recognize used?

In streamin_transformer.py, prefix_recognize looks like frame-synchronize decoding algorithm, and merges chunk decoding and trigger decoding。I try to search papers about chunk transformer and trigger attention, but not found! Can you show me the paper that introduced the algorithm?
I also have some questions about ctc prefix search in the code.
In line 662-664:
if l_plus not in hype:
Pb[l_plus] += lpz[i][0] + ...
Pb[l_plus] += lpz[i][c] * Pnb_prev[l_plus]

I doubt the line 664 should be:
Pnb[l_plus] += lpz[i][c] * Pnb_prev[l_plus]
This satisfies Algorithm 1 in paper: First-pass large vocabulary continuous speech recognition using bi-directional recurrent Dnns

dict download

Could you upload the dict as well? especially for the .model file

Bug: raise Exception("Number of expected symbols more than the time stamps"?

Hi, sorry to disturbe you.
When I use my own corpus. when I use the off-line model align the ctc path. Sone instance got this bug. So can you tell me how to solve this problem.

logit shape: (1220, 52) 59 52 Traceback (most recent call last): File "../../../espnet/bin/asr_recog.py", line 180, in <module> main(sys.argv[1:]) File "../../../espnet/bin/asr_recog.py", line 176, in main viterbi_decode(args) File "/espnet/asr/pytorch_backend/asr_recog.py", line 177, in viterbi_decode align = model.viterbi_decode(feat[0][0], y) File "/espnet/nets/pytorch_backend/e2e_asr_transformer.py", line 253, in viterbi_decode align = viterbi_align(logit, y)[0] File "/espnet/nets/viterbi_align.py", line 26, in viterbi_align

raise Exception("Number of expected symbols more than the time stamps"

sone instance frame not enough long

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.