GithubHelp home page GithubHelp logo

Comments (6)

espnetUser avatar espnetUser commented on May 27, 2024

Hi @DinnoKoluh, the larger latency you see for the first chunk might come from the block processing done by the streaming Conformer encoder as it needs to fill the entire initial block with (downsampled) input frames before it can actually compute any encoder output.

from espnet.

DinnoKoluh avatar DinnoKoluh commented on May 27, 2024

I understand, it's too bad that the time is fairly large.
Do you have any clue about the other issues mentioned?

from espnet.

espnetUser avatar espnetUser commented on May 27, 2024

My guess for the increase in latency is that for each update of the transcription I get the whole transcription back instead of just the update. I am not sure if that is the desired behaviour but it would seem to me that the more natural behaviour is to have transcription updates for the, maybe, last 5 words instead the whole transcript.

AFAIU, the decoding process will always return the best (full length) hypo (or n-best) for entire available encoded input sequence. The (label-synchrounous) beam search might pick a different path when more encoder input becomes available and thus changing words that appeared in earlier results.
Also, for very long audio the current decoder implementation will slow down noticeable because it keeps the whole "history" of encoded inputs. So for very long audio it is better to use VAD to split it into smaller segments before decoding.

I would also like to add, that changing the device from GPU to CPU or vice-versa doesn't have any effect on the latency which is odd, I would expect that switching to a GPU would decrease the latency by a lot.

I haven't run any inference on GPU so can't really comment on that.

from espnet.

DinnoKoluh avatar DinnoKoluh commented on May 27, 2024

I understand but it is supposed to be a streaming model, so I am just mimicking the streaming by chunking a long audio file. But in practice I would expect to just have a stream of audio chunks of some fixed length and that stream can last for example $1$ hour. So, the inference should be done on just the incoming chunks or chunks that are near the incoming one as it may need context to update the transcript.

And doesn't the parameter is_false reset the history (buffer), which I already mentioned?

I should maybe ask the main question with an example: Is the ESPnet streaming model capable, on a live audio stream (let's say listening to a live news channel on YouTube), to produce a live transcript which lags behind the audio stream at most 500 ms (or some other fixed amount)?

from espnet.

espnetUser avatar espnetUser commented on May 27, 2024

I should maybe ask the main question with an example: Is the ESPnet streaming model capable, on a live audio stream (let's say listening to a live news channel on YouTube), to produce a live transcript which lags behind the audio stream at most 500 ms (or some other fixed amount)?

As is, it won't be able to work on a live audio stream from Youtube that runs for hours (as the decoder code keeps the entire history). You would need to implement some simple endpointing in combination with "is_false" parameter to cut the audio at appropriate times (pauses etc.) and reset the internal buffer. The delay you can control via encoder block_size, hop_size and look_ahead parameter in the model config, 500ms (except for initial phase) will be challenging but should be possible.

from espnet.

DinnoKoluh avatar DinnoKoluh commented on May 27, 2024

Okay, thank you for the info.

from espnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.