GithubHelp home page GithubHelp logo

Comments (20)

euh avatar euh commented on July 30, 2024 2

Yes, better (using server):

print_timings: prompt eval time = 4921.98 ms / 17 tokens ( 289.53 ms per token, 3.45 tokens per second)
print_timings: eval time = 27586.19 ms / 84 runs ( 328.41 ms per token, 3.05 tokens per second)
print_timings: total time = 32508.17 ms

Thanks.

from llamafile.

jart avatar jart commented on July 30, 2024 1

The Q5 perf improvement hasn't made it into a release yet. You have to build llamafile yourself right now using the Source Instructions in the README.

from llamafile.

jart avatar jart commented on July 30, 2024

Your Sandybridge CPU is something we intend to support and I'm surprised it's not working. I have a ThinkPad I bought back in 2011 which should help me get to the bottom of this. I'll update this issue as I learn more. If there's any clues you can provide in the meantime about which specific instruction is faulting, and what it's address is in memory, then that'd help if you shared too.

from llamafile.

drousseau avatar drousseau commented on July 30, 2024

I have similar result :

$ ./llamafile-server-0.1-llava-v1.5-7b-q4
(...)
llama_new_context_with_model: kv self size = 1024.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 162.63 MB
Illegal instruction

My CPU is an i3 :

Model name: Intel(R) Core(TM) i3-2120T CPU @ 2.60GHz
Stepping: 7

gdb can't load the cosmo binary, I tried with lldb as suggested in another issue, but it doesn't start :

lldb -o "run" ~/.ape-1.9 ./llamafile-server-0.1-llava-v1.5-7b-q4
(lldb) target create "/home/domi/.ape-1.9"
Current executable set to '/home/domi/.ape-1.9' (x86_64).
(lldb) settings set -- target.run-args "./llamafile-server-0.1-llava-v1.5-7b-q4"
(lldb) run
error: Lost debug server connection
(lldb) exit

from llamafile.

jart avatar jart commented on July 30, 2024

Try using llamafile's --ftrace flag which logs each C/C++ function call from their prologues. That might give a clue which function is responsible at least.

from llamafile.

drousseau avatar drousseau commented on July 30, 2024

Last few lines when running with --ftrace :

FUN 485044 485059 57'262'404'097 592 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'407'510 624 &cosmo_once
FUN 485044 485059 57'262'410'722 592 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'413'886 592 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'416'362 624 &cosmo_once
FUN 485044 485059 57'262'418'630 592 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'421'393 992 &ggml_compute_forward_mul_mat
FUN 485044 485059 57'262'425'497 1'472 &ggml_fp32_to_fp16_row
FUN 485044 485059 57'262'493'584 592 &ggml_cuda_compute_forward
FUN 485044 485044 57'262'494'197 -123'142'188'231'864 &ggml_cuda_compute_forward
FUN 485044 485059 57'262'498'401 624 &cosmo_once
FUN 485044 485060 57'262'494'951 592 &ggml_cuda_compute_forward
FUN 485044 485044 57'262'500'740 -123'142'188'231'832 &cosmo_once
FUN 485044 485059 57'262'502'921 592 &ggml_compute_forward.part.0
FUN 485044 485060 57'262'509'922 624 &cosmo_once
FUN 485044 485044 57'262'512'740 -123'142'188'231'864 &ggml_compute_forward.part.0
FUN 485044 485059 57'262'515'562 992 &ggml_compute_forward_mul_mat
FUN 485044 485060 57'262'519'002 592 &ggml_compute_forward.part.0
FUN 485044 485044 57'262'522'768 -123'142'188'231'448 &ggml_compute_forward_mul_mat
FUN 485044 485059 57'262'524'849 1'472 &ggml_vec_dot_f16
FUN 485044 485060 57'262'528'242 992 &ggml_compute_forward_mul_mat
FUN 485044 485044 57'262'529'680 -123'142'188'230'968 &ggml_vec_dot_f16
FUN 485044 485060 57'262'535'307 1'472 &ggml_vec_dot_f16
Illegal instruction

from llamafile.

jart avatar jart commented on July 30, 2024

It's most likely these two little opcodes here:

image

While I work on fixing this, are you familiar with the process for building llamafile from source so you can test my changes?

from llamafile.

drousseau avatar drousseau commented on July 30, 2024

I just cloned the repository, compiled and ran it with the gguf files from the binary, and it looks "same" as the binary release.

I ran it with :

llamafile$ o//llama.cpp/server/server -m models/llava-v1.5-7b-Q4_K.gguf --mmproj models/llava-v1.5-7b-mmproj-Q4_0.gguf

So, I believe I can try your patch when available :)

from llamafile.

jart avatar jart commented on July 30, 2024

I just realized I can use qemu-x86_64 -cpu core2duo to easily test backwards compatibility. So I'm nearly positive the change I'm about to push is going to get you back in business. It fixes a bug. I also took special care to ensure that supporting PCs like yours won't slow down folks with modern processors (at least for q4, q8, and f16 weights so far). We'll get more comprehensive as time goes on. Once this issue is closed, please test it and report back. If it doesn't work I'll re-open this.

from llamafile.

jart avatar jart commented on July 30, 2024

One last thing. If you notice llamafile performing poorly compared to llama.cpp upstream on your CPU model, then I consider that a bug and I'd ask anyone who experiences that to please file an issue, so I can address it. Thanks!

from llamafile.

drousseau avatar drousseau commented on July 30, 2024

Thanks !
I pulled the latest commit, and now it works, really slow ( expected ), but without crashing.

from llamafile.

jart avatar jart commented on July 30, 2024

Out of curiosity, (1) what weights are you using, and (2) do you know if it's going equally slow as llama.cpp upstream?

from llamafile.

drousseau avatar drousseau commented on July 30, 2024

I'm running wtih "llava-v1.5-7b-q4"
I haven't recorded traces when I tried lama.cpp some time ago, and surely it was not the same weights.
But it was really slow too :)

from llamafile.

jart avatar jart commented on July 30, 2024

Glad to know you're back in business. I love fast the most, but even a slow LLM is useful if you're doing backend work. That's one of the reasons I'm happy to support you.

from llamafile.

euh avatar euh commented on July 30, 2024

I have an Intel 2500k (overclocked 4.2) and it's much slower compared to llama.cpp using mistral-7b-instruct-v0.1.Q5_K_M.gguf and this prompt "Can you explain why the sky is blue?"

With llama.cpp (-n 128 -m mistral-7b-instruct-v0.1.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -ins):

llama_print_timings: prompt eval time = 5510.25 ms / 28 tokens ( 196.79 ms per token, 5.08 tokens per second)
llama_print_timings: eval time = 24443.14 ms / 98 runs ( 249.42 ms per token, 4.01 tokens per second)
llama_print_timings: total time = 53965.80 ms

With llamafile-server-0.2 -m mistral-7b-instruct-v0.1.Q5_K_M.gguf

print_timings: prompt eval time = 67785.46 ms / 61 tokens ( 1111.24 ms per token, 0.90 tokens per second)
print_timings: eval time = 121366.07 ms / 103 runs ( 1178.31 ms per token, 0.85 tokens per second)
print_timings: total time = 189151.53 ms

EDIT: updated llama.cpp to the lasted version

from llamafile.

jart avatar jart commented on July 30, 2024

@euh Are you running on Linux? If so, could you follow the source instructions in the readme on how to make o//llama.cpp/main/main, then run perf record --call-graph dwarf o//llama.cpp/main/main.com.dbg -n 10 -m ... and once it exits run perf report. That'll tell us which functions are going the slowest. Once you have that, could you copy/paste it here, or take a screenshot? That way I'll know what to focus on. Thanks!

from llamafile.

euh avatar euh commented on July 30, 2024

Yes linux. I hope this is what you wanted:

perfrecord

from llamafile.

jart avatar jart commented on July 30, 2024

That's exactly what I needed. Looks like this quant hasn't been ported yet. Let me try and do that now.

from llamafile.

jart avatar jart commented on July 30, 2024

@euh You just got your performance back. Build at HEAD and your Q5 weights will go potentially 2.5x faster.

from llamafile.

kreely avatar kreely commented on July 30, 2024

Thanks so much for all the great work around this issue.
Quick q, what is the new file called now?

from llamafile.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.