GithubHelp home page GithubHelp logo

Comments (4)

jart avatar jart commented on July 30, 2024 1

GGUF isn't obscure. There's 2000+ projects on hugging face tagged gguf. The issue is that the technology is so new and evolving that the best practices change quite rapidly. A few months ago, GGUF didn't even exist, and llama.cpp was using a totally different file format.

from llamafile.

jart avatar jart commented on July 30, 2024

So you're asking how to convert models to the GGUF file format that llamafile uses. I used to know how to do it. I don't know how to do it anymore. Only TheBloke knows how to convert models to GGUF. Just use his weights. If you want to know more about what his process looks like, usually the way it goes is models are published by model makers in a Python format like Torch. The llama.cpp project has a bunch of Python scripts for converting these model formats into float16 GGUF files.

Having the float16 (f16) weights is sort of like having a FLAC file or CD for a song. The issue is float16 goes slow due to memory bandwidth constraints. But once you have have float16 weights in the GGUF format, it's trivial to use the llamafile-quantize program we publish on our releases page to convert them to a smaller format, e.g. Q4, Q5, Q3, Q8, etc. I think Q4 is fine.

Here's an example of how you can try quantizing weights yourself. My LLaVA weights are the top trending GGUF weights on Hugging Face right now. Unlike TheBloke, I only quantized them to the Q4 and Q8 formats. Let's say you want to try LLaVA using the Q5 format instead. The good news, is I uploaded the float16 files:

wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-f16.gguf
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-mmproj-f16.gguf

You can convert F16 GGUF -> Q5 GGUF using the following commands:

./llamafile-llava-quantize-0.2.1 \
  llava-v1.5-7b-mmproj-f16.gguf \
  llava-v1.5-7b-mmproj-q5.gguf \
  7
./llamafile-quantize-0.2.1 \
  llava-v1.5-7b-f16.gguf \
  17
mv ggml-model-Q5_K.gguf llava-v1.5-7b-q5.gguf

You can now run:

./llamafile-server-0.2.1 -m llava-v1.5-7b-q5.gguf --mproj llava-v1.5-7b-mmproj-q5.gguf

If you're wondering where the numbers 7 and 17 came from, those represent the Q5 quantization format. To get a list of available formats, run:

$ ./llamafile-llava-quantize-0.2.1 -h
Usage: ./llamafile-llava-quantize-0.2.1 INPUT OUTPUT FORMAT
  - 2 is Q4_0
  - 3 is Q4_1
  - 6 is Q5_0
  - 7 is Q5_1
  - 8 is Q8_0

$ ./llamafile-quantize-0.2.1
usage: ./llamafile-quantize-0.2.1 [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type

Allowed quantization types:
   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K   :  5.15G, -0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16    : 13.00G              @ 7B
   0  or  F32    : 26.00G              @ 7B
          COPY   : only copy tensors, no quantizing

from llamafile.

FarbrorMartin avatar FarbrorMartin commented on July 30, 2024

from llamafile.

FarbrorMartin avatar FarbrorMartin commented on July 30, 2024

from llamafile.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.