GithubHelp home page GithubHelp logo

Comments (9)

ggerganov avatar ggerganov commented on May 22, 2024 4

Thank you for this!
It is clear now how it works and I will be able to fix it.

from whisper.cpp.

wcchoi avatar wcchoi commented on May 22, 2024 3

This is the python code to decode it, from https://github.com/openai/gpt-2/blob/master/src/encoder.py

you need the vocab.json file

import json

def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a corresponding list of unicode strings.
    The reversible bpe codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    _chr = unichr if sys.version_info[0] == 2 else chr
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [_chr(n) for n in cs]
    return dict(zip(bs, cs))

with open('./vocab.json', 'r', encoding='utf-8') as f:
    vocab = json.loads(f.read())
rev = {v: k for k, v in vocab.items()}

byte_encoder = bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}

def decode(tokens):
    text = ''.join([rev[token] for token in tokens])
    text = bytearray([byte_decoder[c] for c in text]).decode('utf-8')
    return text

print(decode([2415, 229])) # '宇'

from whisper.cpp.

ChristopherFritz avatar ChristopherFritz commented on May 22, 2024 1

For my part, I can confirm this fixes the issue for me.

Before fix:

./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040]  さくらちゃん**神��もすっごくいいし、バトンもうまいんだけど。

After fix:

./main -m models/ggml-large.bin -l ja -f output.wav
[00:00.000 --> 00:04.040]  さくらちゃん**神経もすっごくいいし、バトンもうまいんだけど。

from whisper.cpp.

ggerganov avatar ggerganov commented on May 22, 2024

Can you provide a short audio sample that fails?

from whisper.cpp.

 avatar commented on May 22, 2024

If it helps I can give an example for Croatian. It also happens for Croatian sometimes.

Input Video

[01:11.000 --> 01:15.000]   Sanadar u završnjoj riječi i nasu��enju za HIPO rekao da nije kriv presuda sljedećeg tjedna.

I do not know what the missing letter is / what the word means. In another audio file, which is not on youtube anymore, I got dovi��enja instead of doviđenja.

I produced the input file with:

youtube-dl -x --audio-format=mp3 $video_url
ffmpeg -i $mp3_file -ar 16000 -ac 1 -c:a pcm_s16le whisper_input.wav

On a side note: I am very impressed. With the normal whisper code on my CPU 1 minute of audio took about 1 hour of runtime with the large model. With your C++ project it is much less, maybe a few minutes per audio minute.

from whisper.cpp.

yujinqiu avatar yujinqiu commented on May 22, 2024

samplecn16k.wav.zip

@ggerganov this is the sample audio, hope it can help. PS: it zip file, unzip it first.

from whisper.cpp.

ggerganov avatar ggerganov commented on May 22, 2024

So I found the reason why it fails to transcribe these characters, but I don't know how to fix it.

The tokenizer is more complicated than I expected. I thought that each token corresponds to a certain text and you simply have to convert each token separately and join the texts. However, it turns out that there are certain tokens that can be "chained" to produce a text.

I tried to understand the decoding algorithm of the tokenizer, using the original Python implementation, but I get lost in the code and cannot figure it out.

What I need to understand is how the following example works:

https://github.com/ggerganov/whisper.cpp/blob/tokenizer/tokenizer-test.py

Notice that the 2 tokens 2415 and 229 individually are decoded to garbage, while together they are decoded as .
I think the tokenizer somehow uses the merges.txt data, which I currently completely ignore.

Anyway, hopefully someone can give me some tips how this decoding process works. For now, I am not able to fix this.

from whisper.cpp.

ggerganov avatar ggerganov commented on May 22, 2024

@yujinqiu @aufziehvogel
Thanks to @r0y6a3n0 this should be resolved now.
Download the model files again and give it a try.

from whisper.cpp.

yujinqiu avatar yujinqiu commented on May 22, 2024

it's fixed in some case.

from whisper.cpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.