GithubHelp home page GithubHelp logo

Comments (8)

jart avatar jart commented on September 6, 2024 1

Try passing the flag -ngl 43 and let me know if that fixes it.

from llamafile.

danbri avatar danbri commented on September 6, 2024

Ok, with external GPU disconnected, it becomes massively faster and drives onboard GPU usage to 100%, so that's pretty conclusive. Thanks!

Logs include:

llm_load_tensors: ggml ctx size = 0.13 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 107.55 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MB
warning: posix_madvise(.., POSIX_MADV_WILLNEED) failed: No error information (win32 error 998)
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 400.00 MB
llama_new_context_with_model: kv self size = 400.00 MB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 81.63 MB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MB
llama_new_context_with_model: total VRAM used: 9169.21 MB (model: 8694.21 MB, context: 475.00 MB)
Available slots:
-> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8082

Trying multiple NVIDIA GPUs:

I've just noticed --main-gpu and --tensor-split, sorry missed those earlier

1.) with both GPUs available (plus another non-NVIDIA one that's useless here but complicates number scheme in Task Manager):

a) nvidia-smi numbers them like this:

  • 0 NVIDIA GeForce RTX 3080
  • 1 NVIDIA GeForce RTX 3090

b) llamafile numbers them like this:

C:\Users...\LLMs>llamafile -ngl 43 -m slimorca-13b.Q5_K_M.gguf --port 8082
NVIDIA cuBLAS GPU support successfully loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6

c) Windows Task Manager numbers them like this: GPU 0 = Intel(R), GPU 1 = NVIDIA 3080, GPU 2 = NVIDIA 3090.

d) with only -ngl43 I think it defaults to the 3080. However when we watch the load charts in Task Manager, the 3080 shows load for the duration of the task, whereas the 3090 shows a tiny brief flash of 100% load immediately after we send some text in the web UI.

llm_load_tensors: ggml ctx size = 0.13 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device llm_load_tensors: mem required = 107.55 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 8694.21 MB

main-gpu

Trying to make the 3090 main:

`C:\Users\danbr\LLMs>llamafile -ngl 43 -m slimorca-13b.Q5_K_M.gguf --port 8082 --main-gpu 0

NVIDIA cuBLAS GPU support successfully loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6
[...]
llm_load_tensors: ggml ctx size = 0.13 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required = 107.55 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MB`

This setup behaves as before (and the logs report as before). Perhaps this means that it is officially trying to use the 3090 but in practice the second ('scratch?') is doing all the heavy lifing?

Retrying with --main-gpu 1

The behaviour watching Task Manager charts seems similar. The bigger 3090 GPU only shows an initial momentary spike, however the reporting is flipped:

llm_load_tensors: ggml ctx size = 0.13 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce RTX 3080 Ti Laptop GPU) as main device llm_load_tensors: mem required = 107.55 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 8694.21 MB

I've heard that external GPU setup like this gets unhappy when disconnected/reconnected without a full OS reboot so that'll be the thing to investigate next.

from llamafile.

jart avatar jart commented on September 6, 2024

Thanks for all this interesting information. It's kind of weird that Task Manager lists your GPUs in the opposite order as llamafile. I would imagine that impacts the llama.cpp project upstream too, since it's the kind of thing I could easily imagine not being noticed.

from llamafile.

danbri avatar danbri commented on September 6, 2024

from llamafile.

jart avatar jart commented on September 6, 2024

I'm still working my way up to having a system with multiple GPUs so I'm not the best person to ask. Hopefully someone following can chime in. Would love to learn anything you discover while fiddling with these flags!

from llamafile.

danbri avatar danbri commented on September 6, 2024

from llamafile.

jart avatar jart commented on September 6, 2024

Thanks for the resources! I'll leave this open for now, since I'm curious about the GPU ordering.

from llamafile.

danbri avatar danbri commented on September 6, 2024

Looks like you made a ton of progress - maybe you can close this?

the readme still says

Known Issues
Multiple GPUs isn't supported yet

however, also

  • —main-gpu (although numbering scheme was confusing on windows)
  • 4616816 multiple gpus
  • also 04d6e93 Introduce --gpu flag

Given an extenal 28G nvidia rtx 3090 and an internal smaller rtx 3080 Ti, is there any testing I could help with here?

from llamafile.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.