Comments (6)
Hi @DinnoKoluh, the larger latency you see for the first chunk might come from the block processing done by the streaming Conformer encoder as it needs to fill the entire initial block with (downsampled) input frames before it can actually compute any encoder output.
from espnet.
I understand, it's too bad that the time is fairly large.
Do you have any clue about the other issues mentioned?
from espnet.
My guess for the increase in latency is that for each update of the transcription I get the whole transcription back instead of just the update. I am not sure if that is the desired behaviour but it would seem to me that the more natural behaviour is to have transcription updates for the, maybe, last 5 words instead the whole transcript.
AFAIU, the decoding process will always return the best (full length) hypo (or n-best) for entire available encoded input sequence. The (label-synchrounous) beam search might pick a different path when more encoder input becomes available and thus changing words that appeared in earlier results.
Also, for very long audio the current decoder implementation will slow down noticeable because it keeps the whole "history" of encoded inputs. So for very long audio it is better to use VAD to split it into smaller segments before decoding.
I would also like to add, that changing the device from GPU to CPU or vice-versa doesn't have any effect on the latency which is odd, I would expect that switching to a GPU would decrease the latency by a lot.
I haven't run any inference on GPU so can't really comment on that.
from espnet.
I understand but it is supposed to be a streaming model, so I am just mimicking the streaming by chunking a long audio file. But in practice I would expect to just have a stream of audio chunks of some fixed length and that stream can last for example
And doesn't the parameter is_false reset the history (buffer), which I already mentioned?
I should maybe ask the main question with an example: Is the ESPnet streaming model capable, on a live audio stream (let's say listening to a live news channel on YouTube), to produce a live transcript which lags behind the audio stream at most 500 ms (or some other fixed amount)?
from espnet.
I should maybe ask the main question with an example: Is the ESPnet streaming model capable, on a live audio stream (let's say listening to a live news channel on YouTube), to produce a live transcript which lags behind the audio stream at most 500 ms (or some other fixed amount)?
As is, it won't be able to work on a live audio stream from Youtube that runs for hours (as the decoder code keeps the entire history). You would need to implement some simple endpointing in combination with "is_false" parameter to cut the audio at appropriate times (pauses etc.) and reset the internal buffer. The delay you can control via encoder block_size, hop_size and look_ahead parameter in the model config, 500ms (except for initial phase) will be challenging but should be possible.
from espnet.
Okay, thank you for the info.
from espnet.
Related Issues (20)
- An error when using LoRA for s3prl frontend. HOT 1
- TSE with Librimix: mismatch in number of speakers HOT 4
- asr_train.py: error: unrecognized arguments: use_lora HOT 1
- Espnet Collect stats: s3prl Upstream 'hubert-large-ll60k' HOT 5
- How to use 960h LM? HOT 1
- about dc_crn training HOT 1
- Cannot retrieve the public link of the file when running espnet_tts_demo
- [ERROR] The torch version has been changed. Please report to espnet administrators make: *** [Makefile:203: fairscale.done] Error 1
- [ERROR] The torch version has been changed. Please report to espnet administrators make: *** [Makefile:203: fairscale.done] Error 1
- Which class or python script is used for training the linear preencoder used in the config file used during an ASR task which uses representations from XLSR-128 model??
- How to use whisper as frontend?
- Does conformer support batch_size > 1 in ASR task inference? HOT 1
- How to accelerate ESPnet models using TensorRT. HOT 1
- Multilingual ASR with Auxiliary CTC objectives HOT 5
- issue: frontend embed in the latest version.
- Bugs in reproducing VoxtLM v1 HOT 14
- Regarding the reconstruction of some models using keras3 code. HOT 1
- Lora finetune HOT 11
- Teacher forcing vs knowledge distillation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from espnet.