Comments (30)
Recently, I have noticed that when using the default beam size to transcribe certain files, there are occasional occurrences of missing segments, typically around 30 seconds.
Can you share an audio sample with the issue?
from whisper-standalone-win.
Recently, I have noticed that when using the default beam size to transcribe certain files, there are occasional occurrences of missing segments, typically around 30 seconds.
Can you share an audio sample with the issue?
Strange thing is, when I transcribed the entire video file into an MP3 ready to share, and then transcribed it again with the default beam size before uploading, the missing text reappeared.
Directly transcribing the video and transcribing the video after converting it to MP3 yield different results. Then I transcribe this video directly, by split and cut the video:
when I cut the video by retaining the 30 seconds before the missing part and then transcribe it again, the previously missing text still appears. But when I keep the complete portion before the missing part and only cut off the later part, the text still remains missing.
from whisper-standalone-win.
Any micro change in audio affects in different transcriptions.
Use verbose=true
and make a screenshot of a console at the time where it's missing.
from whisper-standalone-win.
Any micro change in audio affects in different transcriptions.
Use
verbose=true
and make a screenshot of a console at the time where it's missing.
the missing part is 13:05.660 to 13:36.640
from whisper-standalone-win.
Yeah, it's just missing, model sometimes just refuse to output anything on some parts, I've seen example where it doesn't transcribe particular speaker.
Btw, check if VAD didn't removed that segment.
from whisper-standalone-win.
Btw, check if VAD didn't removed that segment.
I guess not but i will check it, because when I raised bs to 15,the missing part appears.(in the above segment, even when I increased the beam size to 10, the missing issue persisted. )
Yeah, it's just missing, model sometimes just refuse to output anything on some parts, I've seen example where it doesn't transcribe particular speaker.
Is this a random occurrence with Whisper? Is increasing the beam size the only effective way to address this?
from whisper-standalone-win.
I guess not but i will check it, because when I raised bs to 15, the missing part appears.
Then VAD is not the culprit.
Is this a random occurrence with Whisper?
Maybe, I dunno,
Is increasing the beam size the only effective way to address this?
Lots of things can trigger a model to transcribe differently, maybe even -bs=1
would make it appear.
Sometimes almost nothing helps, like in this case -> Missing the first 21 seconds in small.en and large-v2
from whisper-standalone-win.
Btw, I've personal test version with various ffmpeg preprocessing settings, I didn't release it because I didn't found any preprocessing helpful, but maybe there are some use cases for them.
from whisper-standalone-win.
Btw, I've personal test version with various ffmpeg preprocessing settings, I didn't release it because I didn't found any preprocessing helpful, but maybe there are some use cases for them.
Yes, Converting video files to MP3 or WAV format and then transcribing them seems to be a solution for addressing the issue of missing text. It would be great if this processing could be built-in. :)
In addition, splitting the issue video into several parts can also solve this problem, but I don't know how to automate this task.
from whisper-standalone-win.
顺便说一句,我有带有各种 ffmpeg 预处理设置的个人测试版本,我没有发布它,因为我没有发现任何预处理有用,但也许它们有一些用例。
是的,将视频文件转换为 MP3 或 WAV 格式,然后转录它们似乎是解决文本缺失问题的解决方案。如果可以内置这种处理,那就太好了。:)
此外,将问题视频拆分为几个部分也可以解决这个问题,但我不知道如何自动执行此任务。
You can transcribe in Subtitle Edit
. When transcribing, Subtitle Edit
will extract the audio into .WAV format and then transcribe it.
However, it may not be helpful for the problem you encountered, because some sentences will be lost in some audios. This situation occurs less in English audios and more in other language audios, such as the Portuguese I transcribed recently. This is what happens with audio. At this time, you can only try to change some transcription settings, such as --compute_type
--initial_prompt auto
--initial_prompt default
and other parameters.
from whisper-standalone-win.
Converting video files to MP3 or WAV format and then transcribing them seems to be a solution...
...extract the audio into .WAV format and then transcribe it.
Everything is converted to wav inside before doing anything, converting to wav twice is pointless. :)
from whisper-standalone-win.
Converting video files to MP3 or WAV format and then transcribing them seems to be a solution...
...extract the audio into .WAV format and then transcribe it.
Everything is converted to wav inside before doing anything, converting to wav twice is pointless. :)
It's eerie that when I convert this video to MP3 or WAV format to transcribe, there is indeed no occurrence of such missing situations. However, if I transcribe directly from the video, there is a loss, almost like a ghost story. :(
I use third-party software for lossless converting, and it should also be using ffmpeg.
from whisper-standalone-win.
It's using ffmpeg internally already.
lossless converting
Maybe it's not completely lossless, one bit difference in wav can propagate in a way different transcription.
It fixed this particular issue, maybe it created issues at other parts or would in other audios.
Btw, different versions of ffmpeg produce different wavs, so it's not mathematically lossless in my book.
from whisper-standalone-win.
It's using ffmpeg internally already.
lossless converting
Maybe it's not completely lossless, one bit difference anywhere in wav can propagate in a way different transcription. It fixed this particular issue, maybe it created issues at other parts or would in other audios.
Btw, different versions of ffmpeg produce different wavs, so it's not mathematically lossless in my book.
Currently I tried, using different models, beam sizes, conversion methods, and video clipping all yield different results. It seems there is no universal method to solve this problem.
luckly this is just an incidental occurrence. :)
from whisper-standalone-win.
It would be great if this processing could be built-in. :)
Here is a quick test1 build with few filters added -> https://we.tl/t-F8VG46rY6V
Maybe --ff_speechnorm
is most useful(?) from these, later I'll add more. Use ffmpeg v5 to compare apples to apples.
@despairTK wasn't you who asked for a feature to specify the audio parts to transcribe? I can add that too.
from whisper-standalone-win.
如果可以内置这种处理,那就太好了。:)
这是一个快速的 test1 构建,添加了一些过滤器 -> https://we.tl/t-F8VG46rY6V 也许从这些中最有用(?),稍后我会添加更多。使用 ffmpeg v5 将苹果与苹果进行比较。
--ff_speechnorm
您不是要求指定要转录的音频部分的功能吗?我也可以补充一点。
Thank you very much for your special attention. I will test the beta version and give feedback.
from whisper-standalone-win.
It would be great if this processing could be built-in. :)
Here is a quick test1 build with few filters added -> https://we.tl/t-F8VG46rY6V Maybe
--ff_speechnorm
is most useful(?) from these, later I'll add more. Use ffmpeg v5 to compare apples to apples.@despairTK wasn't you who asked for a feature to specify the audio parts to transcribe? I can add that too.
thx for ur job, but this version still don't work on that issue after testing.
from whisper-standalone-win.
thx for ur job, but this version still don't work on that issue after testing.
Not sure that I understand you, with what command doesn't work?
from whisper-standalone-win.
thx for ur job, but this version still don't work on that issue after testing.
Not sure that I understand you, with what command doesn't work?
from whisper-standalone-win.
How it "still don't work on that issue after testing" if you wasn't able to run it... 😆
Just run ffmpeg.exe
in console and check its version, probably it's some very old one.
from whisper-standalone-win.
How it "still don't work on that issue after testing" if you wasn't able to run it... 😆
Just run
ffmpeg.exe
in console and check its version, probably it's some very old one.
ok i will check it thx
from whisper-standalone-win.
How it "still don't work on that issue after testing" if you wasn't able to run it... 😆
Just run
ffmpeg.exe
in console and check its version, probably it's some very old one.
Sir, I updated my ffmpeg and tried again with this command. There still seems to be text missing in THIS video. Perhaps other filters might be effective.
from whisper-standalone-win.
Here is test2 build -> https://we.tl/t-6NPHtKQbpx
Check with --ff_mp3
, this will pre-process audio with mp3 conversion.
Dunno if it makes a difference, but conversion is at the end of the audio pre-processing, not at start. This way it's much faster.
Anyway, if it wouldn't make that segment appear then I can make this conversion at the start.
from whisper-standalone-win.
I looked at spectrals and I see that mp3 conversion on 16000Hz cut off frequencies above ~7300Hz and I think I see slight timeline distortion like audio moved to the right side.
So here is test3 -> https://we.tl/t-MFz6hagPfA , it's converting MP3 at the start - processes original audio.
Compare it to test2 transcription results.
from whisper-standalone-win.
@Purfview can I have linux build of test version, I have had similar problem in many audios. sometimes there is text loss, some times it repeat itself. I generally use beam_size=5.
from whisper-standalone-win.
Based on the feedback from the tests, the use of test3 has significantly improved the situation of missing text in the large model compared to the original, but there are still instances of missing text in the medium model.
from whisper-standalone-win.
Based on the feedback from the tests, the use of test3 has significantly improved the situation of missing text in the large model compared to the original, but there are still instances of missing text in the medium model.
So, mp3 is not some magic "filter" making whisper to work better, it just alters audio and triggers different transcription for worse or better... or maybe those other instances of missing text are because of a different issue.
Anyway, here is test4 build -> https://we.tl/t-BqvQ33LgqT
Now it has these filters ["rnndn" I find interesting, I see improvements with it] :
--ff_dump
--ff_mp3
--ff_sync
--ff_rnndn_sh
--ff_rnndn_xiph
--ff_fftdn
--ff_tempo
--ff_gate
--ff_speechnorm
--ff_loudnorm
--ff_lowhighpass
I wrote silence suppressor, thought that it would be good addition for VAD, but somehow silero works much worse with it... I'm very unhappy... maybe I need to add some artificial noise after it or something... 😧
And another possible filter to manually select some parts of audio is implemented in the last Whisper's PR, so I'll skip it.
@Purfview can I have linux build of test version, I have had similar problem in many audios. sometimes there is text loss, some times it repeat itself. I generally use beam_size=5.
Sorry, you need to wait for a non-test release. Non Windows releases I build rarely, because need to mess with VMs...
from whisper-standalone-win.
will continue testing them, thx for ur excellent job sir :)
from whisper-standalone-win.
@Purfview that new one fixed my issue ig i will let u know if it happens again, switched to w11 to use the new one ;D
from whisper-standalone-win.
About audio filters post there: #178
from whisper-standalone-win.
Related Issues (20)
- Using distil-whisper HOT 5
- Faster-Whisper-XXL test2: Error code 126 HOT 1
- Whisper: Add support for a new model HOT 1
- Errors on DTS audio tracks HOT 6
- Error when running faster whisper r192.3
- a request: Purfview Whisper Live ? HOT 1
- Named Pipes are not recognized HOT 1
- Americans with Disabilities Act (ADA) guidelines, for subtitles HOT 1
- Missing transcript between segments. HOT 24
- Repeated output issue HOT 1
- Is wisper-standalone-win is closed source? HOT 1
- transcription as best as possible HOT 5
- My computer freezes when transcripts process starts HOT 3
- new Whisper old problems HOT 8
- --highlight_words true --max_line_width 43 --max_line_count 2 HOT 17
- How to make the sentence segmentation more precise HOT 1
- cuBLAS dll file takes too much space HOT 1
- Server/online mode to quickly process files on demand while keeping things in memory HOT 2
- 192.3 in Subtitle Edit, incomplete transcriptions HOT 1
- When the --sentence option is enabled, some names in the transcription will be broken.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisper-standalone-win.