If I understand the implementation correctly, it suppresses any timestamps that fall w

Suppressing timestamps in silent regions - is the premise correct? about stable-ts HOT 7 CLOSED

jianfch commented on August 20, 2024

Suppressing timestamps in silent regions - is the premise correct?

from stable-ts.

Comments (7)

tohe91 commented on August 20, 2024

I share your observations, especially on multiple words sharing the same timestamp, which occurs very frequently in my use cases. I have set suppress_silence = False and suppress_middle = False to get more reliable results, but of course that just creates other inaccuracies. I think your proposed solution could certainly help to stabilize the results and avoid tight word clusters.

from stable-ts.

jianfch commented on August 20, 2024

You raised interesting points. On top of my head, I have three solutions to address some of the issues brought up.
Let's consider the two examples you brought up by combining them and making it less favorable for the current suppression logic by making 3rd candidate slightly early.

1  2             3      4  5   6      7
|  |             |      |  |   |      |
                   ######## ### ###### #############

1. Average pooling the timestamp probability distribution.
Then it becomes something like this where the posts with number on top has the higher probabilities:

1  2             3      4  5   6      7
|||||           |||    |||||| |||    |||
                   ######## ### ###### #############

2. Min pooling the suppression mask.
This way it only suppresses slightly up to before the start of a sound/speech. Thus it leaves room for timestamps right at the edges to be chosen instead of just ignored (without min pooling the small gaps at 5,6,7 would have been suppressed).

                  3     4  5   6      7
                  |    |||||| |||    |||
                   ######## ### ###### #############

3. Suppressing non-gaps and min pooling the suppression mask for that as well.

                  3        5   6      7
                  |       ||| |||    |||
                   ######## ### ###### #############

The last method will not work as well for audio that has more than just speech. If the 4th candidate is actually a gap in speech but there is background noise or some other loud sound the occur during the gap then applying this step actually makes it perform worse. I would have the disabled for word-level timestamps

from stable-ts.

ryanheise commented on August 20, 2024

Regarding background noise, note that if I used this I would definitely be replacing the suppression mask with actual VAD like silero-vad. I assumed the reason you chose not to use this was that you were just showing a proof-of-concept, but with the real intention to use proper VAD, and making it easy for us to swap out that code for real VAD should we choose to.

Regarding your ideas, I don't exactly understand how to interpret your average pooling and min pooling. Just to be precise here, what are the actual pools/batches that you take the average or min of? In the min pooling case, it sounds like you would have a left-leaning bias, in that it would only help to find the start timestamps. I would like to see some effort to also detect the end timestamps by using the suppression mask or proper VAD. In my approach, each gap (by which I mean a silent, or non-speech region), has TWO boundaries, a left and a right one. The left boundary of a gap is an attractor for the end timestamp of the preceding word, while the right boundary of a gap is an attractor for the start timestamp of the next word. It is sort of like that feature in diagramming tools where you can "snap to grid", where here we are snapping to both the left and right boundaries.

As for "Suppressing non-gaps", I also am not clear how to interpret "non-gap". If what you mean is that you want to suppress background noise that is not speech, then it is the same point as using a proper VAD.

As for average pooling the timestamp probability distribution, I am not really sure how this would work out in practice. It could be that "snap to vad" may still improve accuracy further on top of that.

Finally, regarding the problem where multiple words are merged together especially across sentence boundaries, I also want to suggest that in this situation the Whisper timestamps should probably not be respected at all and can be thrown out. The timestamps could be completely wrong, and yet it would still be safe for us to assume that a "." or a "。" should snap to the left boundary of a gap in the VAD, and the next word should start on the right boundary of that gap.

P.S. An afterthought here. I wonder if these more accurate timestamps are actually fed into the prompt for the next segment to actually help the Whisper model get more on track for the next segment. I don't think it does that.

from stable-ts.

ryanheise commented on August 20, 2024

Another thought.

Consider the following audio file with two sentences, again # indicates a probable speech signal inferred by VAD or other means, and | indicate candidate timestamps for the start of the second sentence.

                      1 2             34          5
                      | |             ||          |
               ########   #   #                    ################
                  A       B   C                            D

               I am bob                            Nice to meet you

B and C are false positives in the speech detection. A and D are the correct designations for the first and second sentences.

The following shows how stable-ts can get this wrong:

                      1
                      |
               ########                            ################
                  A                                        B

               I am bob                            Nice to meet you

stable-ts inference:

               ########|###########################################
               I am bob  N  i  c  e    t o    m  e  e  t    y  o  u

For some reason, all the words in the second sentence are spread out. The same thing that occurs in Whisper when no measures are taken to mask the large silence that may exist before the first word in a sentence.

Now it is obvious to us that right after candidate timestamp 1, there is mostly silence for a very long time making it actually a poor candidate. The second sentence has enough speech in it that it would need a sizable amount of speech signal to properly be designated to that region. If we just look at the start timestamp alone, though, we can't see that.

We can estimate the average token duration by taking the total duration of all speech signals and dividing that by the total number of tokens in the audio file as inferred by Whisper. From that we can estimate the expected duration of sentence 2. And using that, we can figure out that candidate 5 is the only start timestamp that has a sufficient amount of voice signal after it to be able to fit the number of inferred tokens for that sentence.

from stable-ts.

jianfch commented on August 20, 2024

_ is where the timestamp tokens will be suppressed. # is non-silence. The silence suppression work like this right now:

                      #######    ##### #### ###     ####
______________________       ____     _    _   _____

After min pooling it will look something like this:

                      #######    ##### #### ###     ####
_____________________         __                ___

I had considered using VAD in place of the silence mask but adding another network to the pipeline adds more complexity, and the failures of both model will stack. You can never be certain that VAD gives you correct results but you can always determine with certainty whether a part is silent or not.

from stable-ts.

ryanheise commented on August 20, 2024

Thanks for the example, the end result looks similar to simple padding (which also has the downside I brought up earlier) but I still don't understand the calculation because I don't understand what are precisely the batches/pools that are being min'ed.

But that still doesn't leverage much of the valuable timestamp data that's available in the vad or silence mask that is described above.

Regarding vad, the problem is that most of the audio I deal with has a longish intro with music and it never works correctly. There can also be different sections of the whole audio that each have their own intro music. Silero-vad is quite reliable and overall improves the accuracy which is the main metric. Either way, the timestamps will still be off sometimes of course, but with Silero-vad, it is comparatively far more accurate, so it's an improvement that doesn't lose any parts of the transcript but just results in more accurate timestamps overall.

from stable-ts.

jianfch commented on August 20, 2024

To deal with music, you can use a music source separation model to preprocess the audio (e.g. https://github.com/facebookresearch/demucs)

from stable-ts.

Suppressing timestamps in silent regions - is the premise correct? about stable-ts HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs