Comments (11)
Can you please provide the code or command that is failing?
And the version of whisper-timestamped?
Some related issues were fixed on Monday, and whisper-timestamped should work perfectly with large-v3 now (I double checked : not specifying the language works with that last model).
It's not clear how you use whisper-timestamped (which is designed for timestamped transcriptions) for language detection...
from whisper-timestamped.
whisper_timestamped 1.13.2
`import whisper_timestamped as whisper
audio = whisper.load_audio("test.tmp")
lang_audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(lang_audio).to("cuda")
# detect the spoken language
model = whisper.load_model("large-v3", download_root="/opt/models/", device="cuda")
_, probs = model.detect_language(mel)
result = whisper.transcribe(model, audio)
import json
print(json.dumps(result, indent = 2, ensure_ascii = False))
`
from whisper-timestamped.
whisper - 20231106
from whisper-timestamped.
OK then you should just call log_mel_spectogram
with the right number of features (it changed in large-v3, from 80 to 128):
mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")
Nothing related to this repo as your usage is not documented here.
Side note: why don't you pass the detected language to whisper.transcribe
?
from whisper-timestamped.
I want to detect language first, than transcribe using top language
from whisper-timestamped.
mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")
Ok. You are absolutely right. Now it works fine. Thank you!
from whisper-timestamped.
I want to detect language first, than transcribe using top language
Yes, look at my command above.
I suggest you pass the detected language to the model.
To save computation, and also to guarantee those are the same languages (it could be different in some settings, for instance if VAD is used in whisper.transcribe)
And thinking about it, your usage is not optimal, because you possibly extract mel on a super long audio, to just use the first 30 sec in the end (as language detection is performed on the first 30 seconds).
It seems possible to easily add something in whisper-timestamped to add the language probability as a new key in the output dictionary.
You can open a issue requesting this feature if you are interested (in having a simple optimized way to do what you do)
from whisper-timestamped.
I don't understand, why it's not optimal? I am taking 30 seconds from audio, than transforming it to mel for language detection. After detecting language i am passing to transcribing
from whisper-timestamped.
OK my bad, I missed the use of pad_or_trim
.
Then there is only the problem of guaranteeing that detected languages are the same (between model.detect_language
and whisper.transcribe
). Devils are in the details (butterfly effects...) and there can be some corner cases where it detect 2 different languages.
You can probably start by checking that they are the same.
Note that using VAD (option available in whisper_timestamped.transcribe) can improve a lot language detections, in case where the first 30 sec of audio mainly contain silence or music background.
So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.
from whisper-timestamped.
So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.
It would be interesting
from whisper-timestamped.
@andruxa-smirnov I added the feature.
Now, if you don't specify the language of the audio, you will have a new key in the output dictionary, with the language probabilities.
So the output will look like
{
...
"language": "fr",
"language_probs": {
"en": 0.027954353019595146,
"zh": 0.02743500843644142,
...
"su": 3.0119704064190955e-08,
"yue": 2.2565967810805887e-05
}
}
and it should work with all options.
You can read https://github.com/linto-ai/whisper-timestamped#options-that-may-improve-results to see options that can improve accuracy (like VAD)
from whisper-timestamped.
Related Issues (20)
- German Audios translated instead of transcribed HOT 2
- Large-v3 HOT 11
- How to get the progress bar when transcribing through a Celery task ? HOT 2
- Please consider creating a node version HOT 1
- How to add new _ALIGNMENT_HEADS? HOT 4
- whisper_timestamped doesn't work from an URL in CLI
- whisper_timestamped blocks from an URL in CLI into subprocess module HOT 6
- Error with Whisper v3 HOT 2
- everytime I update this, it bricks my python install HOT 1
- Beam Search Decoding How to Get Beam of Tokens as Output HOT 3
- Error when using -vad_v3.1 HOT 1
- Consider using whisper-distilled HOT 2
- Publication on Pypi failing HOT 7
- Is there a way to use it with whisper.cpp HOT 2
- Cannot find audio file HOT 3
- Only part of audio transcribed HOT 4
- Trouble transcribing list of files HOT 2
- torch hub path is not properly set HOT 1
- Broken link for plotting word alignment section HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisper-timestamped.