Comments (8)
Try passing the flag -ngl 43
and let me know if that fixes it.
from llamafile.
Ok, with external GPU disconnected, it becomes massively faster and drives onboard GPU usage to 100%, so that's pretty conclusive. Thanks!
Logs include:
llm_load_tensors: ggml ctx size = 0.13 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 107.55 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MB
warning: posix_madvise(.., POSIX_MADV_WILLNEED) failed: No error information (win32 error 998)
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 400.00 MB
llama_new_context_with_model: kv self size = 400.00 MB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 81.63 MB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MB
llama_new_context_with_model: total VRAM used: 9169.21 MB (model: 8694.21 MB, context: 475.00 MB)
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://127.0.0.1:8082
Trying multiple NVIDIA GPUs:
I've just noticed --main-gpu and --tensor-split, sorry missed those earlier
1.) with both GPUs available (plus another non-NVIDIA one that's useless here but complicates number scheme in Task Manager):
a) nvidia-smi numbers them like this:
- 0 NVIDIA GeForce RTX 3080
- 1 NVIDIA GeForce RTX 3090
b) llamafile numbers them like this:
C:\Users...\LLMs>llamafile -ngl 43 -m slimorca-13b.Q5_K_M.gguf --port 8082
NVIDIA cuBLAS GPU support successfully loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6
c) Windows Task Manager numbers them like this: GPU 0 = Intel(R), GPU 1 = NVIDIA 3080, GPU 2 = NVIDIA 3090.
d) with only -ngl43 I think it defaults to the 3080. However when we watch the load charts in Task Manager, the 3080 shows load for the duration of the task, whereas the 3090 shows a tiny brief flash of 100% load immediately after we send some text in the web UI.
llm_load_tensors: ggml ctx size = 0.13 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device llm_load_tensors: mem required = 107.55 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 8694.21 MB
main-gpu
Trying to make the 3090 main:
`C:\Users\danbr\LLMs>llamafile -ngl 43 -m slimorca-13b.Q5_K_M.gguf --port 8082 --main-gpu 0
NVIDIA cuBLAS GPU support successfully loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6
[...]
llm_load_tensors: ggml ctx size = 0.13 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required = 107.55 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MB`
This setup behaves as before (and the logs report as before). Perhaps this means that it is officially trying to use the 3090 but in practice the second ('scratch?') is doing all the heavy lifing?
Retrying with --main-gpu 1
The behaviour watching Task Manager charts seems similar. The bigger 3090 GPU only shows an initial momentary spike, however the reporting is flipped:
llm_load_tensors: ggml ctx size = 0.13 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce RTX 3080 Ti Laptop GPU) as main device llm_load_tensors: mem required = 107.55 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 43/43 layers to GPU llm_load_tensors: VRAM used: 8694.21 MB
I've heard that external GPU setup like this gets unhappy when disconnected/reconnected without a full OS reboot so that'll be the thing to investigate next.
from llamafile.
Thanks for all this interesting information. It's kind of weird that Task Manager lists your GPUs in the opposite order as llamafile. I would imagine that impacts the llama.cpp project upstream too, since it's the kind of thing I could easily imagine not being noticed.
from llamafile.
from llamafile.
I'm still working my way up to having a system with multiple GPUs so I'm not the best person to ask. Hopefully someone following can chime in. Would love to learn anything you discover while fiddling with these flags!
from llamafile.
from llamafile.
Thanks for the resources! I'll leave this open for now, since I'm curious about the GPU ordering.
from llamafile.
Looks like you made a ton of progress - maybe you can close this?
the readme still says
Known Issues
Multiple GPUs isn't supported yet
however, also
- —main-gpu (although numbering scheme was confusing on windows)
- 4616816 multiple gpus
- also 04d6e93 Introduce --gpu flag
Given an extenal 28G nvidia rtx 3090 and an internal smaller rtx 3080 Ti, is there any testing I could help with here?
from llamafile.
Related Issues (20)
- Bug: llamafile do't Load HOT 1
- Feature Request: Support for microsoft/Phi-3-vision-128k-instruct
- Bug: llama 3.1 and variants fail with error "wrong number of tensors; expected 292, got 291" HOT 9
- Feature Request: Add Gemma 2 2B HOT 3
- Bug: Llamafiler SIGSEGV crash HOT 3
- Bug: llamafile-bench signal SIGILL, Illegal instruction. HOT 3
- two unconditional stray printfs in llamafile/cuda.c HOT 1
- o//whisper.cpp/main: No such file or directory HOT 3
- Bug: -ngl doesn't work when running as a systemd service HOT 4
- Bug: unknown argument: --threads‐batch‐draft
- Bug: The token generation speed is slower compared to the upstream llama.cpp project
- Bug: malloc: *** error for object 0x600003310600: pointer being freed was not allocated HOT 1
- Bug: ILL_ILLOPN when trying to run bartowski/DeepSeek-V2-Chat-0628-GGUF HOT 1
- Bug: Uncaught SIGABRT (SI_0) with MiniCPM HOT 2
- Do you have an uncensored LLM as llamafile? HOT 3
- Bug: llamafile completions API hangs on win10 HOT 2
- Bug: whisperfile --help doesn't mention --no-prints HOT 5
- Bug: whisperfile: The --help doesn't match the available options HOT 1
- Bug: libamdhip64.so.6: cannot open shared object file: No such file or directory: failed to load library on llamafile 0.8.13
- Feature Request: Add support for Raspberry Pi Ai Kit HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llamafile.