GithubHelp home page GithubHelp logo

Comments (30)

Purfview avatar Purfview commented on June 1, 2024

Recently, I have noticed that when using the default beam size to transcribe certain files, there are occasional occurrences of missing segments, typically around 30 seconds.

Can you share an audio sample with the issue?

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

Recently, I have noticed that when using the default beam size to transcribe certain files, there are occasional occurrences of missing segments, typically around 30 seconds.

Can you share an audio sample with the issue?

Strange thing is, when I transcribed the entire video file into an MP3 ready to share, and then transcribed it again with the default beam size before uploading, the missing text reappeared.

Directly transcribing the video and transcribing the video after converting it to MP3 yield different results. Then I transcribe this video directly, by split and cut the video:

when I cut the video by retaining the 30 seconds before the missing part and then transcribe it again, the previously missing text still appears. But when I keep the complete portion before the missing part and only cut off the later part, the text still remains missing.

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

Any micro change in audio affects in different transcriptions.

Use verbose=true and make a screenshot of a console at the time where it's missing.

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

Any micro change in audio affects in different transcriptions.

Use verbose=true and make a screenshot of a console at the time where it's missing.

0e0facdfc0c9c0bd215bd15c15711ba

the missing part is 13:05.660 to 13:36.640

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

Yeah, it's just missing, model sometimes just refuse to output anything on some parts, I've seen example where it doesn't transcribe particular speaker.

Btw, check if VAD didn't removed that segment.

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

Btw, check if VAD didn't removed that segment.

I guess not but i will check it, because when I raised bs to 15,the missing part appears.(in the above segment, even when I increased the beam size to 10, the missing issue persisted. )
082e72062681b1f28c2c79f0204e5eb

Yeah, it's just missing, model sometimes just refuse to output anything on some parts, I've seen example where it doesn't transcribe particular speaker.

Is this a random occurrence with Whisper? Is increasing the beam size the only effective way to address this?

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

I guess not but i will check it, because when I raised bs to 15, the missing part appears.

Then VAD is not the culprit.

Is this a random occurrence with Whisper?

Maybe, I dunno,

Is increasing the beam size the only effective way to address this?

Lots of things can trigger a model to transcribe differently, maybe even -bs=1 would make it appear.
Sometimes almost nothing helps, like in this case -> Missing the first 21 seconds in small.en and large-v2

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

Btw, I've personal test version with various ffmpeg preprocessing settings, I didn't release it because I didn't found any preprocessing helpful, but maybe there are some use cases for them.

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

Btw, I've personal test version with various ffmpeg preprocessing settings, I didn't release it because I didn't found any preprocessing helpful, but maybe there are some use cases for them.

Yes, Converting video files to MP3 or WAV format and then transcribing them seems to be a solution for addressing the issue of missing text. It would be great if this processing could be built-in. :)

In addition, splitting the issue video into several parts can also solve this problem, but I don't know how to automate this task.

from whisper-standalone-win.

despairTK avatar despairTK commented on June 1, 2024

顺便说一句,我有带有各种 ffmpeg 预处理设置的个人测试版本,我没有发布它,因为我没有发现任何预处理有用,但也许它们有一些用例。

是的,将视频文件转换为 MP3 或 WAV 格式,然后转录它们似乎是解决文本缺失问题的解决方案。如果可以内置这种处理,那就太好了。:)

此外,将问题视频拆分为几个部分也可以解决这个问题,但我不知道如何自动执行此任务。

You can transcribe in Subtitle Edit. When transcribing, Subtitle Edit will extract the audio into .WAV format and then transcribe it.

However, it may not be helpful for the problem you encountered, because some sentences will be lost in some audios. This situation occurs less in English audios and more in other language audios, such as the Portuguese I transcribed recently. This is what happens with audio. At this time, you can only try to change some transcription settings, such as --compute_type --initial_prompt auto --initial_prompt default and other parameters.

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

Converting video files to MP3 or WAV format and then transcribing them seems to be a solution...

...extract the audio into .WAV format and then transcribe it.

Everything is converted to wav inside before doing anything, converting to wav twice is pointless. :)

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

Converting video files to MP3 or WAV format and then transcribing them seems to be a solution...

...extract the audio into .WAV format and then transcribe it.

Everything is converted to wav inside before doing anything, converting to wav twice is pointless. :)

It's eerie that when I convert this video to MP3 or WAV format to transcribe, there is indeed no occurrence of such missing situations. However, if I transcribe directly from the video, there is a loss, almost like a ghost story. :(

I use third-party software for lossless converting, and it should also be using ffmpeg.

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

It's using ffmpeg internally already.

lossless converting

Maybe it's not completely lossless, one bit difference in wav can propagate in a way different transcription.
It fixed this particular issue, maybe it created issues at other parts or would in other audios.

Btw, different versions of ffmpeg produce different wavs, so it's not mathematically lossless in my book.

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

It's using ffmpeg internally already.

lossless converting

Maybe it's not completely lossless, one bit difference anywhere in wav can propagate in a way different transcription. It fixed this particular issue, maybe it created issues at other parts or would in other audios.

Btw, different versions of ffmpeg produce different wavs, so it's not mathematically lossless in my book.

Currently I tried, using different models, beam sizes, conversion methods, and video clipping all yield different results. It seems there is no universal method to solve this problem.
luckly this is just an incidental occurrence. :)

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

It would be great if this processing could be built-in. :)

Here is a quick test1 build with few filters added -> https://we.tl/t-F8VG46rY6V
Maybe --ff_speechnorm is most useful(?) from these, later I'll add more. Use ffmpeg v5 to compare apples to apples.

@despairTK wasn't you who asked for a feature to specify the audio parts to transcribe? I can add that too.

from whisper-standalone-win.

despairTK avatar despairTK commented on June 1, 2024

如果可以内置这种处理,那就太好了。:)

这是一个快速的 test1 构建,添加了一些过滤器 -> https://we.tl/t-F8VG46rY6V 也许从这些中最有用(?),稍后我会添加更多。使用 ffmpeg v5 将苹果与苹果进行比较。--ff_speechnorm

您不是要求指定要转录的音频部分的功能吗?我也可以补充一点。

Thank you very much for your special attention. I will test the beta version and give feedback.

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

It would be great if this processing could be built-in. :)

Here is a quick test1 build with few filters added -> https://we.tl/t-F8VG46rY6V Maybe --ff_speechnorm is most useful(?) from these, later I'll add more. Use ffmpeg v5 to compare apples to apples.

@despairTK wasn't you who asked for a feature to specify the audio parts to transcribe? I can add that too.

thx for ur job, but this version still don't work on that issue after testing.

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

thx for ur job, but this version still don't work on that issue after testing.

Not sure that I understand you, with what command doesn't work?

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

thx for ur job, but this version still don't work on that issue after testing.

Not sure that I understand you, with what command doesn't work?

1705236109724

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

How it "still don't work on that issue after testing" if you wasn't able to run it... 😆

Just run ffmpeg.exe in console and check its version, probably it's some very old one.

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

How it "still don't work on that issue after testing" if you wasn't able to run it... 😆

Just run ffmpeg.exe in console and check its version, probably it's some very old one.

ok i will check it thx

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

How it "still don't work on that issue after testing" if you wasn't able to run it... 😆

Just run ffmpeg.exe in console and check its version, probably it's some very old one.

Sir, I updated my ffmpeg and tried again with this command. There still seems to be text missing in THIS video. Perhaps other filters might be effective.

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

Here is test2 build -> https://we.tl/t-6NPHtKQbpx

Check with --ff_mp3, this will pre-process audio with mp3 conversion.
Dunno if it makes a difference, but conversion is at the end of the audio pre-processing, not at start. This way it's much faster.
Anyway, if it wouldn't make that segment appear then I can make this conversion at the start.

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

I looked at spectrals and I see that mp3 conversion on 16000Hz cut off frequencies above ~7300Hz and I think I see slight timeline distortion like audio moved to the right side.
So here is test3 -> https://we.tl/t-MFz6hagPfA , it's converting MP3 at the start - processes original audio.
Compare it to test2 transcription results.

Original:
Infinity Pool test mkv_dump wav

test2:
Infinity Pool test mkv_dump_mp3 wav

test3:
Infinity Pool test mkv_dump_mp3_2 wav

from whisper-standalone-win.

 avatar commented on June 1, 2024

@Purfview can I have linux build of test version, I have had similar problem in many audios. sometimes there is text loss, some times it repeat itself. I generally use beam_size=5.

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

Based on the feedback from the tests, the use of test3 has significantly improved the situation of missing text in the large model compared to the original, but there are still instances of missing text in the medium model.

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

Based on the feedback from the tests, the use of test3 has significantly improved the situation of missing text in the large model compared to the original, but there are still instances of missing text in the medium model.

So, mp3 is not some magic "filter" making whisper to work better, it just alters audio and triggers different transcription for worse or better... or maybe those other instances of missing text are because of a different issue.

Anyway, here is test4 build -> https://we.tl/t-BqvQ33LgqT
Now it has these filters ["rnndn" I find interesting, I see improvements with it] :

--ff_dump
--ff_mp3
--ff_sync
--ff_rnndn_sh
--ff_rnndn_xiph
--ff_fftdn
--ff_tempo
--ff_gate
--ff_speechnorm
--ff_loudnorm
--ff_lowhighpass

I wrote silence suppressor, thought that it would be good addition for VAD, but somehow silero works much worse with it... I'm very unhappy... maybe I need to add some artificial noise after it or something... 😧
And another possible filter to manually select some parts of audio is implemented in the last Whisper's PR, so I'll skip it.

@Purfview can I have linux build of test version, I have had similar problem in many audios. sometimes there is text loss, some times it repeat itself. I generally use beam_size=5.

Sorry, you need to wait for a non-test release. Non Windows releases I build rarely, because need to mess with VMs...

from whisper-standalone-win.

gkngkngkn avatar gkngkngkn commented on June 1, 2024

will continue testing them, thx for ur excellent job sir :)

from whisper-standalone-win.

 avatar commented on June 1, 2024

@Purfview that new one fixed my issue ig i will let u know if it happens again, switched to w11 to use the new one ;D

from whisper-standalone-win.

Purfview avatar Purfview commented on June 1, 2024

About audio filters post there: #178

from whisper-standalone-win.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.