Comments (9)
I used for the above output beam_size=5, best_of=5, vad='silero:v4.0', temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
. These correspond with the --accurate
option, right ?
The best result was using no options at all, i.e. transcribe('language-and-voice-lab/whisper-large-icelandic-62640-steps-967h', "demo1_ice.webm", language="is")
. "Best results" means: there were no obvious segment meta-data issues.
I will rerun with your proposals
from whisper-timestamped.
Thank you @lumpidu
Can you please also give the options you use to get the transcription with the bad (too short) second segment?
If I just run
whisper_timestamped iceland.webm --model language-and-voice-lab/whisper-large-icelandic-62640-steps-967h
I get this which seems more correct:
from whisper-timestamped.
I could see some problems with option --accurate
And here is my guess:
That model was finetuned with segments of less than 30 seconds only, without the prediction of the timestamp of the end of the segment.
That's why each text segment is quite long.
So actually, with such finetuning, Whisper models lose their ability to predict timestamps.
And you will have problems to transcribe long form audio with that (audios of more than 30 seconds).
The only thing you can do to alleviate the impact with whisper-timestamped is to use option --recompute_all_timestamps True
(if you are using the CLI, other in python code it's whisper_timestamped.transcribe(..., trust_whisper_timestamps=False)
).
What does this option is just to ignore the timestamps predicted by Whisper model (which seem to be quite bad with such a finetuned model).
Can you please try that option @lumpidu, and tell if it solves the issue for you?
There will still be the issue that some parts of the audio might be either repeated or missing in your transcription, when transcribing audio of more than 30 seconds with such a model.
The solution I see is to use a VAD to cut in pieces of audio of at most 30 seconds.
(this is not what does the VAD option of whispter-timestamped : this one just remove parts with silence to avoid Whisper hallucinations on them)
from whisper-timestamped.
Also another thing you could try is with the regular model --model large-v3 --language is
instead of the finetuned model.
Maybe the transcription won't be as accurate on some places, but I guess you won't have those alignment issues.
And you will see that text segments are much shorter (a few seconds), corresponding more to one would see in subtitles.
from whisper-timestamped.
I just ran with whisper-large-v3
. This splits the audio into much smaller segments and does a complete reverse normalization, which I actually don't want. Is there a possibility to prevent the reverse normalization and just get normalized text ?
Could you elaborate, what exactly would be needed for fine-tuning models to predict better timestamps ?
from whisper-timestamped.
Have you tried that with the finetuned model? whisper_timestamped.transcribe(..., trust_whisper_timestamps=False)
Concerning text normalization, you mean that there aredigits instead of numbers written with letters, upper case letters, and punctuation marks?
Except from normalizing the text as you want, I don't see other option.
Revoming upper cases and punctuation marks is easy.
Converting digits to letters can be done with, for instance, num2words (https://pypi.org/project/num2words/)
Concerning fine-tuning, models should be finetuned to predict timestamps at the end of each segment.
Most of people finetune Whisper models to only predict the transcription in small segments, without predicting the start/end timestamps. Which make Whisper lose its ability to be applied on audio of more than 30 seconds.
from whisper-timestamped.
@lumpidu have you tried option trust_whisper_timestamps=False
(in python, or --recompute_all_timestamps True
in the CLI) with the finetuned model?
from whisper-timestamped.
Related Issues (20)
- Error when using -vad_v3.1 HOT 1
- Consider using whisper-distilled HOT 2
- Publication on Pypi failing HOT 7
- Is there a way to use it with whisper.cpp HOT 2
- Cannot find audio file HOT 3
- Only part of audio transcribed HOT 4
- Trouble transcribing list of files HOT 2
- torch hub path is not properly set HOT 1
- Broken link for plotting word alignment section HOT 7
- Loading finetuned model serialized with safetensors (and/or sharded models) HOT 10
- How to activate flash attention? HOT 2
- Could it be possible to apply the same technique to the whisper API? HOT 6
- ctranslate2 support HOT 1
- CPU only light install links are broken? HOT 3
- Issue with accented characters coming up as symbols in output json file
- Repetitive Phrase Looping HOT 3
- How to use cuda? HOT 2
- cuda is not available?
- cuda is not available? HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisper-timestamped.