I see the following incorrect tranions, when running my tests with the fine-tune

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Bad timestamp prediction with some finetuned Whisper models about whisper-timestamped HOT 9 OPEN

lumpidu commented on June 12, 2024

Bad timestamp prediction with some finetuned Whisper models

from whisper-timestamped.

Comments (9)

lumpidu commented on June 12, 2024 1

I used for the above output beam_size=5, best_of=5, vad='silero:v4.0', temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0). These correspond with the --accurate option, right ?

The best result was using no options at all, i.e. transcribe('language-and-voice-lab/whisper-large-icelandic-62640-steps-967h', "demo1_ice.webm", language="is"). "Best results" means: there were no obvious segment meta-data issues.

I will rerun with your proposals

from whisper-timestamped.

Jeronymous commented on June 12, 2024

Thank you @lumpidu

Can you please also give the options you use to get the transcription with the bad (too short) second segment?

If I just run

whisper_timestamped iceland.webm --model  language-and-voice-lab/whisper-large-icelandic-62640-steps-967h

I get this which seems more correct:

from whisper-timestamped.

Jeronymous commented on June 12, 2024

I could see some problems with option --accurate

And here is my guess:
That model was finetuned with segments of less than 30 seconds only, without the prediction of the timestamp of the end of the segment.
That's why each text segment is quite long.
So actually, with such finetuning, Whisper models lose their ability to predict timestamps.
And you will have problems to transcribe long form audio with that (audios of more than 30 seconds).

The only thing you can do to alleviate the impact with whisper-timestamped is to use option --recompute_all_timestamps True (if you are using the CLI, other in python code it's whisper_timestamped.transcribe(..., trust_whisper_timestamps=False)).
What does this option is just to ignore the timestamps predicted by Whisper model (which seem to be quite bad with such a finetuned model).
Can you please try that option @lumpidu, and tell if it solves the issue for you?

There will still be the issue that some parts of the audio might be either repeated or missing in your transcription, when transcribing audio of more than 30 seconds with such a model.
The solution I see is to use a VAD to cut in pieces of audio of at most 30 seconds.
(this is not what does the VAD option of whispter-timestamped : this one just remove parts with silence to avoid Whisper hallucinations on them)

from whisper-timestamped.

Jeronymous commented on June 12, 2024

Also another thing you could try is with the regular model --model large-v3 --language is instead of the finetuned model.
Maybe the transcription won't be as accurate on some places, but I guess you won't have those alignment issues.
And you will see that text segments are much shorter (a few seconds), corresponding more to one would see in subtitles.

from whisper-timestamped.

lumpidu commented on June 12, 2024

I just ran with whisper-large-v3. This splits the audio into much smaller segments and does a complete reverse normalization, which I actually don't want. Is there a possibility to prevent the reverse normalization and just get normalized text ?

Could you elaborate, what exactly would be needed for fine-tuning models to predict better timestamps ?

from whisper-timestamped.

Jeronymous commented on June 12, 2024

Have you tried that with the finetuned model? whisper_timestamped.transcribe(..., trust_whisper_timestamps=False)

Concerning text normalization, you mean that there aredigits instead of numbers written with letters, upper case letters, and punctuation marks?
Except from normalizing the text as you want, I don't see other option.
Revoming upper cases and punctuation marks is easy.
Converting digits to letters can be done with, for instance, num2words (https://pypi.org/project/num2words/)

Concerning fine-tuning, models should be finetuned to predict timestamps at the end of each segment.
Most of people finetune Whisper models to only predict the transcription in small segments, without predicting the start/end timestamps. Which make Whisper lose its ability to be applied on audio of more than 30 seconds.

from whisper-timestamped.

Jeronymous commented on June 12, 2024

@lumpidu have you tried option trust_whisper_timestamps=False (in python, or --recompute_all_timestamps True in the CLI) with the finetuned model?

from whisper-timestamped.

Bad timestamp prediction with some finetuned Whisper models about whisper-timestamped HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs