Comments (9)
Thank you for this!
It is clear now how it works and I will be able to fix it.
from whisper.cpp.
This is the python code to decode it, from https://github.com/openai/gpt-2/blob/master/src/encoder.py
you need the vocab.json file
import json
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
_chr = unichr if sys.version_info[0] == 2 else chr
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
cs = [_chr(n) for n in cs]
return dict(zip(bs, cs))
with open('./vocab.json', 'r', encoding='utf-8') as f:
vocab = json.loads(f.read())
rev = {v: k for k, v in vocab.items()}
byte_encoder = bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}
def decode(tokens):
text = ''.join([rev[token] for token in tokens])
text = bytearray([byte_decoder[c] for c in text]).decode('utf-8')
return text
print(decode([2415, 229])) # '宇'
from whisper.cpp.
For my part, I can confirm this fixes the issue for me.
Before fix:
./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040] さくらちゃん**神��もすっごくいいし、バトンもうまいんだけど。
After fix:
./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040] さくらちゃん**神経もすっごくいいし、バトンもうまいんだけど。
from whisper.cpp.
Can you provide a short audio sample that fails?
from whisper.cpp.
If it helps I can give an example for Croatian. It also happens for Croatian sometimes.
[01:11.000 --> 01:15.000] Sanadar u završnjoj riječi i nasu��enju za HIPO rekao da nije kriv presuda sljedećeg tjedna.
I do not know what the missing letter is / what the word means. In another audio file, which is not on youtube anymore, I got dovi��enja instead of doviđenja.
I produced the input file with:
youtube-dl -x --audio-format=mp3 $video_url
ffmpeg -i $mp3_file -ar 16000 -ac 1 -c:a pcm_s16le whisper_input.wav
On a side note: I am very impressed. With the normal whisper code on my CPU 1 minute of audio took about 1 hour of runtime with the large model. With your C++ project it is much less, maybe a few minutes per audio minute.
from whisper.cpp.
@ggerganov this is the sample audio, hope it can help. PS: it zip file, unzip it first.
from whisper.cpp.
So I found the reason why it fails to transcribe these characters, but I don't know how to fix it.
The tokenizer is more complicated than I expected. I thought that each token corresponds to a certain text and you simply have to convert each token separately and join the texts. However, it turns out that there are certain tokens that can be "chained" to produce a text.
I tried to understand the decoding algorithm of the tokenizer, using the original Python implementation, but I get lost in the code and cannot figure it out.
What I need to understand is how the following example works:
https://github.com/ggerganov/whisper.cpp/blob/tokenizer/tokenizer-test.py
Notice that the 2 tokens 2415
and 229
individually are decoded to garbage, while together they are decoded as 宇
.
I think the tokenizer somehow uses the merges.txt data, which I currently completely ignore.
Anyway, hopefully someone can give me some tips how this decoding process works. For now, I am not able to fix this.
from whisper.cpp.
@yujinqiu @aufziehvogel
Thanks to @r0y6a3n0 this should be resolved now.
Download the model files again and give it a try.
from whisper.cpp.
it's fixed in some case.
from whisper.cpp.
Related Issues (20)
- With suppress_non_speech_tokens set to true I'm still getting non speech tokens
- Some "Initial prompt" tokens don't seem to have an effect
- Is Kepler GPU (Tesla K80) not supported? HOT 2
- whisper folder and Mel_filters.npz file does not exist. HOT 2
- TTS usage? HOT 4
- Segment fault issue!? HOT 3
- Either -dtw doesn't work as intended or I'm missing something
- [SPM] Unsafe build flags make importing the package by version string impossible
- nixos support
- Huge time differences running in Windows 11 and WSL 2 (Ubuntu) with default params
- What kind of performance can we expect?
- stream.exe without window HOT 1
- Grammar not working HOT 2
- SwiftUI Demo APP and CoreML -> increases app file size HOT 2
- FYI, a pull request submitted to support whisper.cpp in package 'speech_recognition' HOT 1
- Async prediction on iOS 17
- Install
- Instal
- Equivalent of transformer's chunk_length_s in whisper.cpp HOT 2
- Which binary should be used for mac m1?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisper.cpp.