GithubHelp home page GithubHelp logo

Comments (11)

Jeronymous avatar Jeronymous commented on June 20, 2024

Can you please provide the code or command that is failing?
And the version of whisper-timestamped?

Some related issues were fixed on Monday, and whisper-timestamped should work perfectly with large-v3 now (I double checked : not specifying the language works with that last model).
It's not clear how you use whisper-timestamped (which is designed for timestamped transcriptions) for language detection...

from whisper-timestamped.

andruxa-smirnov avatar andruxa-smirnov commented on June 20, 2024

whisper_timestamped 1.13.2

`import whisper_timestamped as whisper

audio = whisper.load_audio("test.tmp")

lang_audio = whisper.pad_or_trim(audio)

mel = whisper.log_mel_spectrogram(lang_audio).to("cuda")
# detect the spoken language
model = whisper.load_model("large-v3", download_root="/opt/models/", device="cuda")
_, probs = model.detect_language(mel)

result = whisper.transcribe(model, audio)

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))
`

from whisper-timestamped.

andruxa-smirnov avatar andruxa-smirnov commented on June 20, 2024

whisper - 20231106

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 20, 2024

OK then you should just call log_mel_spectogram with the right number of features (it changed in large-v3, from 80 to 128):

mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")

Nothing related to this repo as your usage is not documented here.

Side note: why don't you pass the detected language to whisper.transcribe?

from whisper-timestamped.

andruxa-smirnov avatar andruxa-smirnov commented on June 20, 2024

I want to detect language first, than transcribe using top language

from whisper-timestamped.

andruxa-smirnov avatar andruxa-smirnov commented on June 20, 2024

mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")

Ok. You are absolutely right. Now it works fine. Thank you!

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 20, 2024

I want to detect language first, than transcribe using top language

Yes, look at my command above.
I suggest you pass the detected language to the model.
To save computation, and also to guarantee those are the same languages (it could be different in some settings, for instance if VAD is used in whisper.transcribe)

And thinking about it, your usage is not optimal, because you possibly extract mel on a super long audio, to just use the first 30 sec in the end (as language detection is performed on the first 30 seconds).
It seems possible to easily add something in whisper-timestamped to add the language probability as a new key in the output dictionary.
You can open a issue requesting this feature if you are interested (in having a simple optimized way to do what you do)

from whisper-timestamped.

andruxa-smirnov avatar andruxa-smirnov commented on June 20, 2024

I don't understand, why it's not optimal? I am taking 30 seconds from audio, than transforming it to mel for language detection. After detecting language i am passing to transcribing

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 20, 2024

OK my bad, I missed the use of pad_or_trim.

Then there is only the problem of guaranteeing that detected languages are the same (between model.detect_language and whisper.transcribe). Devils are in the details (butterfly effects...) and there can be some corner cases where it detect 2 different languages.
You can probably start by checking that they are the same.

Note that using VAD (option available in whisper_timestamped.transcribe) can improve a lot language detections, in case where the first 30 sec of audio mainly contain silence or music background.
So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.

from whisper-timestamped.

andruxa-smirnov avatar andruxa-smirnov commented on June 20, 2024

So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.

It would be interesting

from whisper-timestamped.

Jeronymous avatar Jeronymous commented on June 20, 2024

@andruxa-smirnov I added the feature.
Now, if you don't specify the language of the audio, you will have a new key in the output dictionary, with the language probabilities.
So the output will look like

{
  ...
  "language": "fr",
  "language_probs": {
    "en": 0.027954353019595146,
    "zh": 0.02743500843644142,
    ...
    "su": 3.0119704064190955e-08,
    "yue": 2.2565967810805887e-05
  }
}

and it should work with all options.
You can read https://github.com/linto-ai/whisper-timestamped#options-that-may-improve-results to see options that can improve accuracy (like VAD)

from whisper-timestamped.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.