I have multiple NVIDIA GPUs and originally thought it was reporting usage of the wrong

GPU numbering on Windows possibly in wrong order about llamafile HOT 8 OPEN

mozilla-ocho commented on September 6, 2024

GPU numbering on Windows possibly in wrong order

from llamafile.

Comments (8)

jart commented on September 6, 2024 1

Try passing the flag -ngl 43 and let me know if that fixes it.

from llamafile.

danbri commented on September 6, 2024

Ok, with external GPU disconnected, it becomes massively faster and drives onboard GPU usage to 100%, so that's pretty conclusive. Thanks!

Logs include:

llm_load_tensors: ggml ctx size = 0.13 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 107.55 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MB
warning: posix_madvise(.., POSIX_MADV_WILLNEED) failed: No error information (win32 error 998)
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 400.00 MB
llama_new_context_with_model: kv self size = 400.00 MB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 81.63 MB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MB
llama_new_context_with_model: total VRAM used: 9169.21 MB (model: 8694.21 MB, context: 475.00 MB)
Available slots:
-> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8082

Trying multiple NVIDIA GPUs:

I've just noticed --main-gpu and --tensor-split, sorry missed those earlier

1.) with both GPUs available (plus another non-NVIDIA one that's useless here but complicates number scheme in Task Manager):

a) nvidia-smi numbers them like this:

0 NVIDIA GeForce RTX 3080
1 NVIDIA GeForce RTX 3090

b) llamafile numbers them like this:

C:\Users...\LLMs>llamafile -ngl 43 -m slimorca-13b.Q5_K_M.gguf --port 8082
NVIDIA cuBLAS GPU support successfully loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6

c) Windows Task Manager numbers them like this: GPU 0 = Intel(R), GPU 1 = NVIDIA 3080, GPU 2 = NVIDIA 3090.

d) with only -ngl43 I think it defaults to the 3080. However when we watch the load charts in Task Manager, the 3080 shows load for the duration of the task, whereas the 3090 shows a tiny brief flash of 100% load immediately after we send some text in the web UI.

llm_load_tensors: ggml ctx size = 0.13 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device llm_load_tensors: mem required = 107.55 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 8694.21 MB

main-gpu

Trying to make the 3090 main:

`C:\Users\danbr\LLMs>llamafile -ngl 43 -m slimorca-13b.Q5_K_M.gguf --port 8082 --main-gpu 0

NVIDIA cuBLAS GPU support successfully loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6
[...]
llm_load_tensors: ggml ctx size = 0.13 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required = 107.55 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MB`

This setup behaves as before (and the logs report as before). Perhaps this means that it is officially trying to use the 3090 but in practice the second ('scratch?') is doing all the heavy lifing?

Retrying with --main-gpu 1

The behaviour watching Task Manager charts seems similar. The bigger 3090 GPU only shows an initial momentary spike, however the reporting is flipped:

llm_load_tensors: ggml ctx size = 0.13 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce RTX 3080 Ti Laptop GPU) as main device llm_load_tensors: mem required = 107.55 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 8694.21 MB

I've heard that external GPU setup like this gets unhappy when disconnected/reconnected without a full OS reboot so that'll be the thing to investigate next.

from llamafile.

jart commented on September 6, 2024

Thanks for all this interesting information. It's kind of weird that Task Manager lists your GPUs in the opposite order as llamafile. I would imagine that impacts the llama.cpp project upstream too, since it's the kind of thing I could easily imagine not being noticed.

from llamafile.

danbri commented on September 6, 2024

On Fri, 1 Dec 2023 at 12:48, Justine Tunney ***@***.***> wrote: Thanks for all this interesting information. It's kind of weird that Task Manager lists your GPUs in the opposite order as llamafile. I would imagine that impacts the llama.cpp project upstream too, since it's the kind of thing I could easily imagine not being noticed.

Windows Task Manager reporting differently to NVIDIA didn't surprise me so much, since it must report the Intel GPU whereas nvidia-smi won't care about that. I was expecting llama.cpp to match NVIDIA's numbering though. What would you recommend for pushing the work onto the external NVIDIA 3090? * -ts SPLIT --tensor-split SPLIT how to split tensors across multiple GPUs, comma-separated list of proportions, e.g. 3,1 * -mg i, --main-gpu i the GPU to use for scratch and small tensors This wording on main-gpu feels backwards, unless I'm misunderstanding. It suggests ("main") that multiple GPUs are being used, but words like "scratch", "small" seem like what you'd have the least "main" GPU doing. Am I reading it wrong?

…

— Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJSGOQVCCC6Y6MAWZHDIDYHHG3LAVCNFSM6AAAAABACSNNQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZWGA3DMMJRHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from llamafile.

jart commented on September 6, 2024

I'm still working my way up to having a system with multiple GPUs so I'm not the best person to ask. Hopefully someone following can chime in. Would love to learn anything you discover while fiddling with these flags!

from llamafile.

danbri commented on September 6, 2024

Here's something I hadn't noticed - looks like there are more detailed explanations of the flags in the llama.cpp repo, https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md Will look at this again next weekend, in meantime feel free to close the issue if it's clutter (or leave open for visibility, as you like). Congratulations on a cool project! Oh, also ggerganov/llama.cpp#4238

…

I think it will automatically use all GPUs that are visible to it. On Fri, 1 Dec 2023 at 13:10, Justine Tunney ***@***.***> wrote: I'm still working my way up to having a system with multiple GPUs so I'm not the best person to ask. Hopefully someone following can chime in. Would love to learn anything you discover while fiddling with these flags! — Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJSGIWBH4OBUPO3EELSOLYHHJOHAVCNFSM6AAAAABACSNNQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZWGA4TMMJQGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from llamafile.

jart commented on September 6, 2024

Thanks for the resources! I'll leave this open for now, since I'm curious about the GPU ordering.

from llamafile.

danbri commented on September 6, 2024

Looks like you made a ton of progress - maybe you can close this?

the readme still says

Known Issues
Multiple GPUs isn't supported yet

however, also

—main-gpu (although numbering scheme was confusing on windows)
4616816 multiple gpus
also 04d6e93 Introduce --gpu flag

Given an extenal 28G nvidia rtx 3090 and an internal smaller rtx 3080 Ti, is there any testing I could help with here?

from llamafile.

GPU numbering on Windows possibly in wrong order about llamafile HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs