<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

This is the python code to decode it, from <a href="https://github.com/openai/gpt-2/bl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Token decoding issue - some characters are missing about whisper.cpp HOT 9 CLOSED

yujinqiu commented on May 22, 2024

Token decoding issue - some characters are missing

from whisper.cpp.

Comments (9)

ggerganov commented on May 22, 2024 4

Thank you for this!
It is clear now how it works and I will be able to fix it.

from whisper.cpp.

wcchoi commented on May 22, 2024 3

This is the python code to decode it, from https://github.com/openai/gpt-2/blob/master/src/encoder.py

you need the vocab.json file

import json

def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a corresponding list of unicode strings.
    The reversible bpe codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    _chr = unichr if sys.version_info[0] == 2 else chr
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [_chr(n) for n in cs]
    return dict(zip(bs, cs))

with open('./vocab.json', 'r', encoding='utf-8') as f:
    vocab = json.loads(f.read())
rev = {v: k for k, v in vocab.items()}

byte_encoder = bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}

def decode(tokens):
    text = ''.join([rev[token] for token in tokens])
    text = bytearray([byte_decoder[c] for c in text]).decode('utf-8')
    return text

print(decode([2415, 229])) # '宇'

from whisper.cpp.

ChristopherFritz commented on May 22, 2024 1

For my part, I can confirm this fixes the issue for me.

Before fix:

./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040]  さくらちゃん**神��もすっごくいいし、バトンもうまいんだけど。

After fix:

./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040]  さくらちゃん**神経もすっごくいいし、バトンもうまいんだけど。

from whisper.cpp.

ggerganov commented on May 22, 2024

Can you provide a short audio sample that fails?

from whisper.cpp.

commented on May 22, 2024

If it helps I can give an example for Croatian. It also happens for Croatian sometimes.

Input Video

[01:11.000 --> 01:15.000]   Sanadar u završnjoj riječi i nasu��enju za HIPO rekao da nije kriv presuda sljedećeg tjedna.

I do not know what the missing letter is / what the word means. In another audio file, which is not on youtube anymore, I got dovi��enja instead of doviđenja.

I produced the input file with:

youtube-dl -x --audio-format=mp3 $video_url
ffmpeg -i $mp3_file -ar 16000 -ac 1 -c:a pcm_s16le whisper_input.wav

On a side note: I am very impressed. With the normal whisper code on my CPU 1 minute of audio took about 1 hour of runtime with the large model. With your C++ project it is much less, maybe a few minutes per audio minute.

from whisper.cpp.

yujinqiu commented on May 22, 2024

samplecn16k.wav.zip

@ggerganov this is the sample audio, hope it can help. PS: it zip file, unzip it first.

from whisper.cpp.

ggerganov commented on May 22, 2024

So I found the reason why it fails to transcribe these characters, but I don't know how to fix it.

The tokenizer is more complicated than I expected. I thought that each token corresponds to a certain text and you simply have to convert each token separately and join the texts. However, it turns out that there are certain tokens that can be "chained" to produce a text.

I tried to understand the decoding algorithm of the tokenizer, using the original Python implementation, but I get lost in the code and cannot figure it out.

What I need to understand is how the following example works:

https://github.com/ggerganov/whisper.cpp/blob/tokenizer/tokenizer-test.py

Notice that the 2 tokens 2415 and 229 individually are decoded to garbage, while together they are decoded as 宇.
I think the tokenizer somehow uses the merges.txt data, which I currently completely ignore.

Anyway, hopefully someone can give me some tips how this decoding process works. For now, I am not able to fix this.

from whisper.cpp.

ggerganov commented on May 22, 2024

@yujinqiu @aufziehvogel
Thanks to @r0y6a3n0 this should be resolved now.
Download the model files again and give it a try.

from whisper.cpp.

yujinqiu commented on May 22, 2024

it's fixed in some case.

from whisper.cpp.

Token decoding issue - some characters are missing about whisper.cpp HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs