GithubHelp home page GithubHelp logo

Comments (9)

lumpidu avatar lumpidu commented on June 12, 2024 1

I used for the above output beam_size=5, best_of=5, vad='silero:v4.0', temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0). These correspond with the --accurate option, right ?

The best result was using no options at all, i.e. transcribe('language-and-voice-lab/whisper-large-icelandic-62640-steps-967h', "demo1_ice.webm", language="is"). "Best results" means: there were no obvious segment meta-data issues.

I will rerun with your proposals

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 12, 2024

Thank you @lumpidu

Can you please also give the options you use to get the transcription with the bad (too short) second segment?

If I just run

whisper_timestamped iceland.webm --model  language-and-voice-lab/whisper-large-icelandic-62640-steps-967h

I get this which seems more correct:
image

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 12, 2024

I could see some problems with option --accurate

And here is my guess:
That model was finetuned with segments of less than 30 seconds only, without the prediction of the timestamp of the end of the segment.
That's why each text segment is quite long.
So actually, with such finetuning, Whisper models lose their ability to predict timestamps.
And you will have problems to transcribe long form audio with that (audios of more than 30 seconds).

The only thing you can do to alleviate the impact with whisper-timestamped is to use option --recompute_all_timestamps True (if you are using the CLI, other in python code it's whisper_timestamped.transcribe(..., trust_whisper_timestamps=False)).
What does this option is just to ignore the timestamps predicted by Whisper model (which seem to be quite bad with such a finetuned model).
Can you please try that option @lumpidu, and tell if it solves the issue for you?

There will still be the issue that some parts of the audio might be either repeated or missing in your transcription, when transcribing audio of more than 30 seconds with such a model.
The solution I see is to use a VAD to cut in pieces of audio of at most 30 seconds.
(this is not what does the VAD option of whispter-timestamped : this one just remove parts with silence to avoid Whisper hallucinations on them)

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 12, 2024

Also another thing you could try is with the regular model --model large-v3 --language is instead of the finetuned model.
Maybe the transcription won't be as accurate on some places, but I guess you won't have those alignment issues.
And you will see that text segments are much shorter (a few seconds), corresponding more to one would see in subtitles.

from whisper-timestamped.

lumpidu avatar lumpidu commented on June 12, 2024

I just ran with whisper-large-v3. This splits the audio into much smaller segments and does a complete reverse normalization, which I actually don't want. Is there a possibility to prevent the reverse normalization and just get normalized text ?

Could you elaborate, what exactly would be needed for fine-tuning models to predict better timestamps ?

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 12, 2024

Have you tried that with the finetuned model? whisper_timestamped.transcribe(..., trust_whisper_timestamps=False)

Concerning text normalization, you mean that there aredigits instead of numbers written with letters, upper case letters, and punctuation marks?
Except from normalizing the text as you want, I don't see other option.
Revoming upper cases and punctuation marks is easy.
Converting digits to letters can be done with, for instance, num2words (https://pypi.org/project/num2words/)

Concerning fine-tuning, models should be finetuned to predict timestamps at the end of each segment.
Most of people finetune Whisper models to only predict the transcription in small segments, without predicting the start/end timestamps. Which make Whisper lose its ability to be applied on audio of more than 30 seconds.

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 12, 2024

@lumpidu have you tried option trust_whisper_timestamps=False (in python, or --recompute_all_timestamps True in the CLI) with the finetuned model?

from whisper-timestamped.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.