When I run llamafile on my system, the model loads fine into my GPU VRAM, however when

Great! I'll update all the llamafiles on HuggingFace so their <code class="notranslate

GPU support crashing on Linux in 0.2 releases about llamafile HOT 11 CLOSED

mozilla-ocho commented on July 30, 2024

GPU support crashing on Linux in 0.2 releases

from llamafile.

Comments (11)

jart commented on July 30, 2024

There's no CUDA_CHECK() anywhere near line 6006 in ggml-cuda.cu. Could you upload /home/garcia/.llamafile/ggml-cuda.cu to this issue tracker so I can see what's failing?

from llamafile.

GarciaLnk commented on July 30, 2024

Sure, here it is attached, however after a quick diff it seems to be identical to this repo's llama.cpp/ggml-cuda.cu file (and in fact, there's a CUDA_CHECK() in line 6006).

ggml-cuda.cu

from llamafile.

jart commented on July 30, 2024

You are correct about line 6006. Apologies for any confusion. So here's what failing:

    CUDA_CHECK(cudaMalloc((void **) &ptr, look_ahead_size));

Looks like you're running out of GPU memory. But you said you're not passing the -ngl flag which defaults to zero. I want to understand why it's possible to run out of GPU memory when the GPU isn't being used. Help wanted.

from llamafile.

jart commented on July 30, 2024

Also do you know if this happens if you use llama.cpp upstream?

from llamafile.

r3drock commented on July 30, 2024

I am getting the same error on my system. My system is kind of similar. I also got a mobile Nvidia gpu.
Error message for the mistral server llamafile:

llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   70.42 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.05 MB
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 4232.06 MB (model: 4095.05 MB, context: 137.00 MB)
Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

loading weights...
{"timestamp":1701544135,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37044,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37056,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 5 tokens
slot 0 : kv cache rm - [0, end)

CUDA error 304 at /home/r3d/.llamafile/ggml-cuda.cu:6006: OS call failed or operation not supported on this OS
current device: 0

some sysinfo:

$ lspci | grep -i nvidia
0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev a1)
0000:01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
$ uname -r
6.6.3-arch1-1
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
$ cat /sys/module/nvidia/version
545.29.06

For me, it does not happen if I use mistral with llama.cpp.

from llamafile.

GarciaLnk commented on July 30, 2024

Also do you know if this happens if you use llama.cpp upstream?

It does not happen with llama.cpp, both -ngl 35 and -ngl 0 work fine there using the same model.

from llamafile.

GarciaLnk commented on July 30, 2024

I just tried the earlier releases and everything works fine in v0.1, but it breaks in v0.2, so some breaking change must have been introduced there.

from llamafile.

jart commented on July 30, 2024

I'm reasonably certain if you pass the --unsecure flag, things will work. Could you confirm this?

from llamafile.

r3drock commented on July 30, 2024

works for me

from llamafile.

jart commented on July 30, 2024

Great! I'll update all the llamafiles on HuggingFace so their .args files pass the --unsecure flag. That will rollback the new security until the next release can do better.

from llamafile.

GarciaLnk commented on July 30, 2024

Yup, that works, thank you!

from llamafile.

GPU support crashing on Linux in 0.2 releases about llamafile HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs