Comments (20)
Yes, better (using server):
print_timings: prompt eval time = 4921.98 ms / 17 tokens ( 289.53 ms per token, 3.45 tokens per second)
print_timings: eval time = 27586.19 ms / 84 runs ( 328.41 ms per token, 3.05 tokens per second)
print_timings: total time = 32508.17 ms
Thanks.
from llamafile.
The Q5 perf improvement hasn't made it into a release yet. You have to build llamafile yourself right now using the Source Instructions in the README.
from llamafile.
Your Sandybridge CPU is something we intend to support and I'm surprised it's not working. I have a ThinkPad I bought back in 2011 which should help me get to the bottom of this. I'll update this issue as I learn more. If there's any clues you can provide in the meantime about which specific instruction is faulting, and what it's address is in memory, then that'd help if you shared too.
from llamafile.
I have similar result :
$ ./llamafile-server-0.1-llava-v1.5-7b-q4
(...)
llama_new_context_with_model: kv self size = 1024.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 162.63 MB
Illegal instruction
My CPU is an i3 :
Model name: Intel(R) Core(TM) i3-2120T CPU @ 2.60GHz
Stepping: 7
gdb can't load the cosmo binary, I tried with lldb as suggested in another issue, but it doesn't start :
lldb -o "run" ~/.ape-1.9 ./llamafile-server-0.1-llava-v1.5-7b-q4
(lldb) target create "/home/domi/.ape-1.9"
Current executable set to '/home/domi/.ape-1.9' (x86_64).
(lldb) settings set -- target.run-args "./llamafile-server-0.1-llava-v1.5-7b-q4"
(lldb) run
error: Lost debug server connection
(lldb) exit
from llamafile.
Try using llamafile's --ftrace
flag which logs each C/C++ function call from their prologues. That might give a clue which function is responsible at least.
from llamafile.
Last few lines when running with --ftrace :
FUN 485044 485059 57'262'404'097 592 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'407'510 624 &cosmo_once
FUN 485044 485059 57'262'410'722 592 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'413'886 592 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'416'362 624 &cosmo_once
FUN 485044 485059 57'262'418'630 592 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'421'393 992 &ggml_compute_forward_mul_mat
FUN 485044 485059 57'262'425'497 1'472 &ggml_fp32_to_fp16_row
FUN 485044 485059 57'262'493'584 592 &ggml_cuda_compute_forward
FUN 485044 485044 57'262'494'197 -123'142'188'231'864 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'498'401 624 &cosmo_once
FUN 485044 485060 57'262'494'951 592 &ggml_cuda_compute_forward
FUN 485044 485044 57'262'500'740 -123'142'188'231'832 &cosmo_once
FUN 485044 485059 57'262'502'921 592 &ggml_compute_forward.part.0
FUN 485044 485060 57'262'509'922 624 &cosmo_once
FUN 485044 485044 57'262'512'740 -123'142'188'231'864 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'515'562 992 &ggml_compute_forward_mul_mat
FUN 485044 485060 57'262'519'002 592 &ggml_compute_forward.part.0
FUN 485044 485044 57'262'522'768 -123'142'188'231'448 &ggml_compute_forward_mul_mat
FUN 485044 485059 57'262'524'849 1'472 &ggml_vec_dot_f16
FUN 485044 485060 57'262'528'242 992 &ggml_compute_forward_mul_mat
FUN 485044 485044 57'262'529'680 -123'142'188'230'968 &ggml_vec_dot_f16
FUN 485044 485060 57'262'535'307 1'472 &ggml_vec_dot_f16
Illegal instruction
from llamafile.
It's most likely these two little opcodes here:
While I work on fixing this, are you familiar with the process for building llamafile from source so you can test my changes?
from llamafile.
I just cloned the repository, compiled and ran it with the gguf files from the binary, and it looks "same" as the binary release.
I ran it with :
llamafile$ o//llama.cpp/server/server -m models/llava-v1.5-7b-Q4_K.gguf --mmproj models/llava-v1.5-7b-mmproj-Q4_0.gguf
So, I believe I can try your patch when available :)
from llamafile.
I just realized I can use qemu-x86_64 -cpu core2duo
to easily test backwards compatibility. So I'm nearly positive the change I'm about to push is going to get you back in business. It fixes a bug. I also took special care to ensure that supporting PCs like yours won't slow down folks with modern processors (at least for q4, q8, and f16 weights so far). We'll get more comprehensive as time goes on. Once this issue is closed, please test it and report back. If it doesn't work I'll re-open this.
from llamafile.
One last thing. If you notice llamafile performing poorly compared to llama.cpp upstream on your CPU model, then I consider that a bug and I'd ask anyone who experiences that to please file an issue, so I can address it. Thanks!
from llamafile.
Thanks !
I pulled the latest commit, and now it works, really slow ( expected ), but without crashing.
from llamafile.
Out of curiosity, (1) what weights are you using, and (2) do you know if it's going equally slow as llama.cpp upstream?
from llamafile.
I'm running wtih "llava-v1.5-7b-q4"
I haven't recorded traces when I tried lama.cpp some time ago, and surely it was not the same weights.
But it was really slow too :)
from llamafile.
Glad to know you're back in business. I love fast the most, but even a slow LLM is useful if you're doing backend work. That's one of the reasons I'm happy to support you.
from llamafile.
I have an Intel 2500k (overclocked 4.2) and it's much slower compared to llama.cpp using mistral-7b-instruct-v0.1.Q5_K_M.gguf and this prompt "Can you explain why the sky is blue?"
With llama.cpp (-n 128 -m mistral-7b-instruct-v0.1.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -ins):
llama_print_timings: prompt eval time = 5510.25 ms / 28 tokens ( 196.79 ms per token, 5.08 tokens per second)
llama_print_timings: eval time = 24443.14 ms / 98 runs ( 249.42 ms per token, 4.01 tokens per second)
llama_print_timings: total time = 53965.80 ms
With llamafile-server-0.2 -m mistral-7b-instruct-v0.1.Q5_K_M.gguf
print_timings: prompt eval time = 67785.46 ms / 61 tokens ( 1111.24 ms per token, 0.90 tokens per second)
print_timings: eval time = 121366.07 ms / 103 runs ( 1178.31 ms per token, 0.85 tokens per second)
print_timings: total time = 189151.53 ms
EDIT: updated llama.cpp to the lasted version
from llamafile.
@euh Are you running on Linux? If so, could you follow the source instructions in the readme on how to make o//llama.cpp/main/main
, then run perf record --call-graph dwarf o//llama.cpp/main/main.com.dbg -n 10 -m ...
and once it exits run perf report
. That'll tell us which functions are going the slowest. Once you have that, could you copy/paste it here, or take a screenshot? That way I'll know what to focus on. Thanks!
from llamafile.
Yes linux. I hope this is what you wanted:
from llamafile.
That's exactly what I needed. Looks like this quant hasn't been ported yet. Let me try and do that now.
from llamafile.
@euh You just got your performance back. Build at HEAD and your Q5 weights will go potentially 2.5x faster.
from llamafile.
Thanks so much for all the great work around this issue.
Quick q, what is the new file called now?
from llamafile.
Related Issues (20)
- Bug: Add Support of Deepseek-MoE
- Feature Request: Can the Llamafile server be ready prior to model warming? HOT 18
- Feature Request: Automate update of upstream llama.cpp HOT 6
- Bug: --gpu option cannot work on win10, not friendly to WIN. HOT 2
- Bug: fatal error: the cpu feature AVX was required at build time but isn't available on this system exiting process. HOT 4
- antivirus stopping the llamafile to run HOT 3
- <3>WSL (13) ERROR: UtilAcceptVsock:250: accept4 failed 110 HOT 3
- Bug: unknown pre-tokenizer type: ''mistral-bpe" when running the new Mistral-Nemo model HOT 5
- Bug: WSL can't launch llamafile with subprocess module, whereas it works when launching it from terminal HOT 1
- Bug: Unable to allocate memory for image embeddings
- Bug: unsupported op 'MUL_MAT' on bf16 but not f16 on SmolLM HOT 1
- Bug: Not starting in windows HOT 11
- CPU memory alloc on Windows sometimes fails HOT 9
- Bug: NUMA support on Windows
- Bug: low CPU usage on AWS Graviton4 compared to ollama
- Bug: Mixtral 8x7B fails to return a response after a couple of API calls whill running on AWS g6.12xlarge EC2 instance
- Bug: HOT 1
- Feature Request: Add support for GLM4-9B and related models
- UX Request: Update readme to mention `llamafile -m foo.llamafile` as an option HOT 1
- Bug: Unable to load Mixtral-8x7B-Instruct-v0.1-GGUF on Amazon Linux with AMD EPYC 7R13 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llamafile.