Hi, lscpu gives..... <div class="snippet-clipboard-content notra

It's most likely these two little opcodes here: <a target="_blank" r

Support broken on old Intel/Amd CPUs about llamafile HOT 20 CLOSED

mozilla-ocho commented on July 30, 2024

Support broken on old Intel/Amd CPUs

from llamafile.

Comments (20)

euh commented on July 30, 2024 2

Yes, better (using server):

print_timings: prompt eval time = 4921.98 ms / 17 tokens ( 289.53 ms per token, 3.45 tokens per second)
print_timings: eval time = 27586.19 ms / 84 runs ( 328.41 ms per token, 3.05 tokens per second)
print_timings: total time = 32508.17 ms

Thanks.

from llamafile.

jart commented on July 30, 2024 1

The Q5 perf improvement hasn't made it into a release yet. You have to build llamafile yourself right now using the Source Instructions in the README.

from llamafile.

jart commented on July 30, 2024

Your Sandybridge CPU is something we intend to support and I'm surprised it's not working. I have a ThinkPad I bought back in 2011 which should help me get to the bottom of this. I'll update this issue as I learn more. If there's any clues you can provide in the meantime about which specific instruction is faulting, and what it's address is in memory, then that'd help if you shared too.

from llamafile.

drousseau commented on July 30, 2024

I have similar result :

$ ./llamafile-server-0.1-llava-v1.5-7b-q4
(...)
llama_new_context_with_model: kv self size = 1024.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 162.63 MB
Illegal instruction

My CPU is an i3 :

Model name: Intel(R) Core(TM) i3-2120T CPU @ 2.60GHz
Stepping: 7

gdb can't load the cosmo binary, I tried with lldb as suggested in another issue, but it doesn't start :

lldb -o "run" ~/.ape-1.9 ./llamafile-server-0.1-llava-v1.5-7b-q4
(lldb) target create "/home/domi/.ape-1.9"
Current executable set to '/home/domi/.ape-1.9' (x86_64).
(lldb) settings set -- target.run-args "./llamafile-server-0.1-llava-v1.5-7b-q4"
(lldb) run
error: Lost debug server connection
(lldb) exit

from llamafile.

jart commented on July 30, 2024

Try using llamafile's --ftrace flag which logs each C/C++ function call from their prologues. That might give a clue which function is responsible at least.

from llamafile.

drousseau commented on July 30, 2024

Last few lines when running with --ftrace :

FUN 485044 485059 57'262'404'097 592 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'407'510 624 &cosmo_once
FUN 485044 485059 57'262'410'722 592 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'413'886 592 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'416'362 624 &cosmo_once
FUN 485044 485059 57'262'418'630 592 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'421'393 992 &ggml_compute_forward_mul_mat
FUN 485044 485059 57'262'425'497 1'472 &ggml_fp32_to_fp16_row
FUN 485044 485059 57'262'493'584 592 &ggml_cuda_compute_forward
FUN 485044 485044 57'262'494'197 -123'142'188'231'864 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'498'401 624 &cosmo_once
FUN 485044 485060 57'262'494'951 592 &ggml_cuda_compute_forward
FUN 485044 485044 57'262'500'740 -123'142'188'231'832 &cosmo_once
FUN 485044 485059 57'262'502'921 592 &ggml_compute_forward.part.0
FUN 485044 485060 57'262'509'922 624 &cosmo_once
FUN 485044 485044 57'262'512'740 -123'142'188'231'864 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'515'562 992 &ggml_compute_forward_mul_mat
FUN 485044 485060 57'262'519'002 592 &ggml_compute_forward.part.0
FUN 485044 485044 57'262'522'768 -123'142'188'231'448 &ggml_compute_forward_mul_mat
FUN 485044 485059 57'262'524'849 1'472 &ggml_vec_dot_f16
FUN 485044 485060 57'262'528'242 992 &ggml_compute_forward_mul_mat
FUN 485044 485044 57'262'529'680 -123'142'188'230'968 &ggml_vec_dot_f16
FUN 485044 485060 57'262'535'307 1'472 &ggml_vec_dot_f16
Illegal instruction

from llamafile.

jart commented on July 30, 2024

It's most likely these two little opcodes here:

While I work on fixing this, are you familiar with the process for building llamafile from source so you can test my changes?

from llamafile.

drousseau commented on July 30, 2024

I just cloned the repository, compiled and ran it with the gguf files from the binary, and it looks "same" as the binary release.

I ran it with :

llamafile$ o//llama.cpp/server/server -m models/llava-v1.5-7b-Q4_K.gguf --mmproj models/llava-v1.5-7b-mmproj-Q4_0.gguf

So, I believe I can try your patch when available :)

from llamafile.

jart commented on July 30, 2024

I just realized I can use qemu-x86_64 -cpu core2duo to easily test backwards compatibility. So I'm nearly positive the change I'm about to push is going to get you back in business. It fixes a bug. I also took special care to ensure that supporting PCs like yours won't slow down folks with modern processors (at least for q4, q8, and f16 weights so far). We'll get more comprehensive as time goes on. Once this issue is closed, please test it and report back. If it doesn't work I'll re-open this.

from llamafile.

jart commented on July 30, 2024

One last thing. If you notice llamafile performing poorly compared to llama.cpp upstream on your CPU model, then I consider that a bug and I'd ask anyone who experiences that to please file an issue, so I can address it. Thanks!

from llamafile.

drousseau commented on July 30, 2024

Thanks !
I pulled the latest commit, and now it works, really slow ( expected ), but without crashing.

from llamafile.

jart commented on July 30, 2024

Out of curiosity, (1) what weights are you using, and (2) do you know if it's going equally slow as llama.cpp upstream?

from llamafile.

drousseau commented on July 30, 2024

I'm running wtih "llava-v1.5-7b-q4"
I haven't recorded traces when I tried lama.cpp some time ago, and surely it was not the same weights.
But it was really slow too :)

from llamafile.

jart commented on July 30, 2024

Glad to know you're back in business. I love fast the most, but even a slow LLM is useful if you're doing backend work. That's one of the reasons I'm happy to support you.

from llamafile.

euh commented on July 30, 2024

I have an Intel 2500k (overclocked 4.2) and it's much slower compared to llama.cpp using mistral-7b-instruct-v0.1.Q5_K_M.gguf and this prompt "Can you explain why the sky is blue?"

With llama.cpp (-n 128 -m mistral-7b-instruct-v0.1.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -ins):

llama_print_timings: prompt eval time = 5510.25 ms / 28 tokens ( 196.79 ms per token, 5.08 tokens per second)
llama_print_timings: eval time = 24443.14 ms / 98 runs ( 249.42 ms per token, 4.01 tokens per second)
llama_print_timings: total time = 53965.80 ms

With llamafile-server-0.2 -m mistral-7b-instruct-v0.1.Q5_K_M.gguf

print_timings: prompt eval time = 67785.46 ms / 61 tokens ( 1111.24 ms per token, 0.90 tokens per second)
print_timings: eval time = 121366.07 ms / 103 runs ( 1178.31 ms per token, 0.85 tokens per second)
print_timings: total time = 189151.53 ms

EDIT: updated llama.cpp to the lasted version

from llamafile.

jart commented on July 30, 2024

@euh Are you running on Linux? If so, could you follow the source instructions in the readme on how to make o//llama.cpp/main/main, then run perf record --call-graph dwarf o//llama.cpp/main/main.com.dbg -n 10 -m ... and once it exits run perf report. That'll tell us which functions are going the slowest. Once you have that, could you copy/paste it here, or take a screenshot? That way I'll know what to focus on. Thanks!

from llamafile.

euh commented on July 30, 2024

Yes linux. I hope this is what you wanted:

from llamafile.

jart commented on July 30, 2024

That's exactly what I needed. Looks like this quant hasn't been ported yet. Let me try and do that now.

from llamafile.

jart commented on July 30, 2024

@euh You just got your performance back. Build at HEAD and your Q5 weights will go potentially 2.5x faster.

from llamafile.

kreely commented on July 30, 2024

Thanks so much for all the great work around this issue.
Quick q, what is the new file called now?

from llamafile.

Support broken on old Intel/Amd CPUs about llamafile HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs