Language detection for large-v3 does not work: RuntimeError: Given g

OK my bad, I missed the use of pad_or_trim . <p di

Language detection for large-v3 about whisper-timestamped HOT 11 CLOSED

andruxa-smirnov commented on June 20, 2024

Language detection for large-v3

from whisper-timestamped.

Comments (11)

Jeronymous commented on June 20, 2024

Can you please provide the code or command that is failing?
And the version of whisper-timestamped?

Some related issues were fixed on Monday, and whisper-timestamped should work perfectly with large-v3 now (I double checked : not specifying the language works with that last model).
It's not clear how you use whisper-timestamped (which is designed for timestamped transcriptions) for language detection...

from whisper-timestamped.

andruxa-smirnov commented on June 20, 2024

whisper_timestamped 1.13.2

`import whisper_timestamped as whisper

audio = whisper.load_audio("test.tmp")

lang_audio = whisper.pad_or_trim(audio)

mel = whisper.log_mel_spectrogram(lang_audio).to("cuda")
# detect the spoken language
model = whisper.load_model("large-v3", download_root="/opt/models/", device="cuda")
_, probs = model.detect_language(mel)

result = whisper.transcribe(model, audio)

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))
`

from whisper-timestamped.

andruxa-smirnov commented on June 20, 2024

whisper - 20231106

from whisper-timestamped.

Jeronymous commented on June 20, 2024

OK then you should just call log_mel_spectogram with the right number of features (it changed in large-v3, from 80 to 128):

mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")

Nothing related to this repo as your usage is not documented here.

Side note: why don't you pass the detected language to whisper.transcribe?

from whisper-timestamped.

andruxa-smirnov commented on June 20, 2024

I want to detect language first, than transcribe using top language

from whisper-timestamped.

andruxa-smirnov commented on June 20, 2024

mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")

Ok. You are absolutely right. Now it works fine. Thank you!

from whisper-timestamped.

Jeronymous commented on June 20, 2024

I want to detect language first, than transcribe using top language

Yes, look at my command above.
I suggest you pass the detected language to the model.
To save computation, and also to guarantee those are the same languages (it could be different in some settings, for instance if VAD is used in whisper.transcribe)

And thinking about it, your usage is not optimal, because you possibly extract mel on a super long audio, to just use the first 30 sec in the end (as language detection is performed on the first 30 seconds).
It seems possible to easily add something in whisper-timestamped to add the language probability as a new key in the output dictionary.
You can open a issue requesting this feature if you are interested (in having a simple optimized way to do what you do)

from whisper-timestamped.

andruxa-smirnov commented on June 20, 2024

I don't understand, why it's not optimal? I am taking 30 seconds from audio, than transforming it to mel for language detection. After detecting language i am passing to transcribing

from whisper-timestamped.

Jeronymous commented on June 20, 2024

OK my bad, I missed the use of pad_or_trim.

Then there is only the problem of guaranteeing that detected languages are the same (between model.detect_language and whisper.transcribe). Devils are in the details (butterfly effects...) and there can be some corner cases where it detect 2 different languages.
You can probably start by checking that they are the same.

Note that using VAD (option available in whisper_timestamped.transcribe) can improve a lot language detections, in case where the first 30 sec of audio mainly contain silence or music background.
So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.

from whisper-timestamped.

andruxa-smirnov commented on June 20, 2024

So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.

It would be interesting

from whisper-timestamped.

Jeronymous commented on June 20, 2024

@andruxa-smirnov I added the feature.
Now, if you don't specify the language of the audio, you will have a new key in the output dictionary, with the language probabilities.
So the output will look like

{
  ...
  "language": "fr",
  "language_probs": {
    "en": 0.027954353019595146,
    "zh": 0.02743500843644142,
    ...
    "su": 3.0119704064190955e-08,
    "yue": 2.2565967810805887e-05
  }
}

and it should work with all options.
You can read https://github.com/linto-ai/whisper-timestamped#options-that-may-improve-results to see options that can improve accuracy (like VAD)

from whisper-timestamped.

Language detection for large-v3 about whisper-timestamped HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs