As a beginner in ML I would really appreciate some help in converting models to the co

GGUF isn't obscure. There's 2000+ projects on hugging face tagged <code class="notrans

Guide for getting models in the right format? about llamafile HOT 4 CLOSED

mozilla-ocho commented on July 30, 2024 1

Guide for getting models in the right format?

from llamafile.

Comments (4)

jart commented on July 30, 2024 1

GGUF isn't obscure. There's 2000+ projects on hugging face tagged gguf. The issue is that the technology is so new and evolving that the best practices change quite rapidly. A few months ago, GGUF didn't even exist, and llama.cpp was using a totally different file format.

from llamafile.

jart commented on July 30, 2024

So you're asking how to convert models to the GGUF file format that llamafile uses. I used to know how to do it. I don't know how to do it anymore. Only TheBloke knows how to convert models to GGUF. Just use his weights. If you want to know more about what his process looks like, usually the way it goes is models are published by model makers in a Python format like Torch. The llama.cpp project has a bunch of Python scripts for converting these model formats into float16 GGUF files.

Having the float16 (f16) weights is sort of like having a FLAC file or CD for a song. The issue is float16 goes slow due to memory bandwidth constraints. But once you have have float16 weights in the GGUF format, it's trivial to use the llamafile-quantize program we publish on our releases page to convert them to a smaller format, e.g. Q4, Q5, Q3, Q8, etc. I think Q4 is fine.

Here's an example of how you can try quantizing weights yourself. My LLaVA weights are the top trending GGUF weights on Hugging Face right now. Unlike TheBloke, I only quantized them to the Q4 and Q8 formats. Let's say you want to try LLaVA using the Q5 format instead. The good news, is I uploaded the float16 files:

wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-f16.gguf
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-mmproj-f16.gguf

You can convert F16 GGUF -> Q5 GGUF using the following commands:

./llamafile-llava-quantize-0.2.1 \
  llava-v1.5-7b-mmproj-f16.gguf \
  llava-v1.5-7b-mmproj-q5.gguf \
  7
./llamafile-quantize-0.2.1 \
  llava-v1.5-7b-f16.gguf \
  17
mv ggml-model-Q5_K.gguf llava-v1.5-7b-q5.gguf

You can now run:

./llamafile-server-0.2.1 -m llava-v1.5-7b-q5.gguf --mproj llava-v1.5-7b-mmproj-q5.gguf

If you're wondering where the numbers 7 and 17 came from, those represent the Q5 quantization format. To get a list of available formats, run:

$ ./llamafile-llava-quantize-0.2.1 -h
Usage: ./llamafile-llava-quantize-0.2.1 INPUT OUTPUT FORMAT
  - 2 is Q4_0
  - 3 is Q4_1
  - 6 is Q5_0
  - 7 is Q5_1
  - 8 is Q8_0

$ ./llamafile-quantize-0.2.1
usage: ./llamafile-quantize-0.2.1 [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type

Allowed quantization types:
   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K   :  5.15G, -0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16    : 13.00G              @ 7B
   0  or  F32    : 26.00G              @ 7B
          COPY   : only copy tensors, no quantizing

from llamafile.

FarbrorMartin commented on July 30, 2024

Thank you. I didn't know that the GGUF format was so obscure that only a few people know how it works. Seemsa bit odd :D I have seen TheBloke's HF site and all his weight files. The reason I asked was that there are some particular fine tuned models of wizardcoder and starcoder I'd like to use with Llamafile, but they are in Pytorch format. Maybe I can ask TheBloke to help out.

…

On Sat, Dec 2, 2023 at 5:38 PM Justine Tunney ***@***.***> wrote: So you're asking how to convert models to the GGUF file format that llamafile uses. I used to know how to do it. I don't know how to do it anymore. Only TheBloke <https://huggingface.co/TheBloke> knows how to convert models to GGUF. Just use his weights. If you want to know more about what his process looks like, usually the way it goes is models are published by model makers in a Python format like Torch. The llama.cpp project has a bunch of Python scripts for converting these model formats into float16 GGUF files. Having the float16 (f16) weights is sort of like having a FLAC file or CD for a song. The issue is float16 goes slow due to memory bandwidth constraints. But once you have have float16 weights in the GGUF format, it's trivial to use the llamafile-quantize program we publish on our releases page <https://github.com/Mozilla-Ocho/llamafile/releases> to convert them to a smaller format, e.g. Q4, Q5, Q3, Q8, etc. I think Q4 is fine. Here's an example of how you can try quantizing weights yourself. My LLaVA weights <https://huggingface.co/jartine/llava-v1.5-7B-GGUF> are the top trending GGUF weights on Hugging Face right now. Unlike TheBloke, I only quantized them to the Q4 and Q8 formats. Let's say you want to try LLaVA using the Q5 format instead. The good news, is I uploaded the float16 files: wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-f16.gguf wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-mmproj-f16.gguf You can convert F16 GGUF -> Q5 GGUF using the following commands: ./llamafile-llava-quantize-0.2.1 \ llava-v1.5-7b-mmproj-f16.gguf \ llava-v1.5-7b-mmproj-q5.gguf \ 7 ./llamafile-quantize-0.2.1 \ llava-v1.5-7b-f16.gguf \ 17 mv ggml-model-Q5_K.gguf llava-v1.5-7b-q5.gguf You can now run: ./llamafile-server-0.2.1 -m llava-v1.5-7b-q5.gguf --mproj llava-v1.5-7b-mmproj-q5.gguf If you're wondering where the numbers 7 and 17 came from, those represent the Q5 quantization format. To get a list of available formats, run: $ ./llamafile-llava-quantize-0.2.1 -h Usage: ./llamafile-llava-quantize-0.2.1 INPUT OUTPUT FORMAT - 2 is Q4_0 - 3 is Q4_1 - 6 is Q5_0 - 7 is Q5_1 - 8 is Q8_0 $ ./llamafile-quantize-0.2.1 usage: ./llamafile-quantize-0.2.1 [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads] --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing --pure: Disable k-quant mixtures and quantize all tensors to the same type Allowed quantization types: 2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B 3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B 8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B 9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B 10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B 12 or Q3_K : alias for Q3_K_M 11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B 12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B 13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B 15 or Q4_K : alias for Q4_K_M 14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B 15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B 17 or Q5_K : alias for Q5_K_M 16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B 17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B 18 or Q6_K : 5.15G, -0.0008 ppl @ LLaMA-v1-7B 7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B 1 or F16 : 13.00G @ 7B 0 or F32 : 26.00G @ 7B COPY : only copy tensors, no quantizing — Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIZHL6ZMAJBACRWRXZS6UQTYHNKSBAVCNFSM6AAAAABAD5W6X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGE4TQMJWGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from llamafile.

FarbrorMartin commented on July 30, 2024

Ok, I understand :) It’s really fascinating how quickly this field is evolving, like you say.

…

On Sun, 3 Dec 2023 at 00:06, Justine Tunney ***@***.***> wrote: GGUF isn't obscure. There's 2000+ projects on hugging face tagged gguf. The issue is that the technology is so new and evolving that the best practices change quite rapidly. A few months ago, GGUF didn't even exist, and llama.cpp was using a totally different file format. — Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIZHL65H4U266Q6474F2SITYHOX6DAVCNFSM6AAAAABAD5W6X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGI3TKNZSGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from llamafile.

Guide for getting models in the right format? about llamafile HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs