Comments (4)
GGUF isn't obscure. There's 2000+ projects on hugging face tagged gguf
. The issue is that the technology is so new and evolving that the best practices change quite rapidly. A few months ago, GGUF didn't even exist, and llama.cpp was using a totally different file format.
from llamafile.
So you're asking how to convert models to the GGUF file format that llamafile uses. I used to know how to do it. I don't know how to do it anymore. Only TheBloke knows how to convert models to GGUF. Just use his weights. If you want to know more about what his process looks like, usually the way it goes is models are published by model makers in a Python format like Torch. The llama.cpp project has a bunch of Python scripts for converting these model formats into float16 GGUF files.
Having the float16 (f16) weights is sort of like having a FLAC file or CD for a song. The issue is float16 goes slow due to memory bandwidth constraints. But once you have have float16 weights in the GGUF format, it's trivial to use the llamafile-quantize
program we publish on our releases page to convert them to a smaller format, e.g. Q4, Q5, Q3, Q8, etc. I think Q4 is fine.
Here's an example of how you can try quantizing weights yourself. My LLaVA weights are the top trending GGUF weights on Hugging Face right now. Unlike TheBloke, I only quantized them to the Q4 and Q8 formats. Let's say you want to try LLaVA using the Q5 format instead. The good news, is I uploaded the float16 files:
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-f16.gguf
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-mmproj-f16.gguf
You can convert F16 GGUF -> Q5 GGUF using the following commands:
./llamafile-llava-quantize-0.2.1 \
llava-v1.5-7b-mmproj-f16.gguf \
llava-v1.5-7b-mmproj-q5.gguf \
7
./llamafile-quantize-0.2.1 \
llava-v1.5-7b-f16.gguf \
17
mv ggml-model-Q5_K.gguf llava-v1.5-7b-q5.gguf
You can now run:
./llamafile-server-0.2.1 -m llava-v1.5-7b-q5.gguf --mproj llava-v1.5-7b-mmproj-q5.gguf
If you're wondering where the numbers 7
and 17
came from, those represent the Q5 quantization format. To get a list of available formats, run:
$ ./llamafile-llava-quantize-0.2.1 -h
Usage: ./llamafile-llava-quantize-0.2.1 INPUT OUTPUT FORMAT
- 2 is Q4_0
- 3 is Q4_1
- 6 is Q5_0
- 7 is Q5_1
- 8 is Q8_0
$ ./llamafile-quantize-0.2.1
usage: ./llamafile-quantize-0.2.1 [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]
--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
--pure: Disable k-quant mixtures and quantize all tensors to the same type
Allowed quantization types:
2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B
3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B
8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B
9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B
10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
12 or Q3_K : alias for Q3_K_M
11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B
12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B
13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B
15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B
17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B
18 or Q6_K : 5.15G, -0.0008 ppl @ LLaMA-v1-7B
7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B
1 or F16 : 13.00G @ 7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing
from llamafile.
from llamafile.
from llamafile.
Related Issues (20)
- Bug: Add Support of Deepseek-MoE
- Feature Request: Can the Llamafile server be ready prior to model warming? HOT 18
- Feature Request: Automate update of upstream llama.cpp HOT 6
- Bug: --gpu option cannot work on win10, not friendly to WIN. HOT 2
- Bug: fatal error: the cpu feature AVX was required at build time but isn't available on this system exiting process. HOT 4
- antivirus stopping the llamafile to run HOT 3
- <3>WSL (13) ERROR: UtilAcceptVsock:250: accept4 failed 110 HOT 3
- Bug: unknown pre-tokenizer type: ''mistral-bpe" when running the new Mistral-Nemo model HOT 5
- Bug: WSL can't launch llamafile with subprocess module, whereas it works when launching it from terminal HOT 1
- Bug: Unable to allocate memory for image embeddings
- Bug: unsupported op 'MUL_MAT' on bf16 but not f16 on SmolLM HOT 1
- Bug: Not starting in windows HOT 11
- CPU memory alloc on Windows sometimes fails HOT 9
- Bug: NUMA support on Windows
- Bug: low CPU usage on AWS Graviton4 compared to ollama
- Bug: Mixtral 8x7B fails to return a response after a couple of API calls whill running on AWS g6.12xlarge EC2 instance
- Bug: HOT 1
- Feature Request: Add support for GLM4-9B and related models
- UX Request: Update readme to mention `llamafile -m foo.llamafile` as an option HOT 1
- Bug: Unable to load Mixtral-8x7B-Instruct-v0.1-GGUF on Amazon Linux with AMD EPYC 7R13 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llamafile.