Comments (11)
There's no CUDA_CHECK()
anywhere near line 6006 in ggml-cuda.cu. Could you upload /home/garcia/.llamafile/ggml-cuda.cu
to this issue tracker so I can see what's failing?
from llamafile.
Sure, here it is attached, however after a quick diff it seems to be identical to this repo's llama.cpp/ggml-cuda.cu file (and in fact, there's a CUDA_CHECK()
in line 6006).
from llamafile.
You are correct about line 6006. Apologies for any confusion. So here's what failing:
CUDA_CHECK(cudaMalloc((void **) &ptr, look_ahead_size));
Looks like you're running out of GPU memory. But you said you're not passing the -ngl
flag which defaults to zero. I want to understand why it's possible to run out of GPU memory when the GPU isn't being used. Help wanted.
from llamafile.
Also do you know if this happens if you use llama.cpp upstream?
from llamafile.
I am getting the same error on my system. My system is kind of similar. I also got a mobile Nvidia gpu.
Error message for the mistral server llamafile:
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 70.42 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.05 MB
...............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size = 64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 4232.06 MB (model: 4095.05 MB, context: 137.00 MB)
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://127.0.0.1:8080
loading weights...
{"timestamp":1701544135,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37044,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37056,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 5 tokens
slot 0 : kv cache rm - [0, end)
CUDA error 304 at /home/r3d/.llamafile/ggml-cuda.cu:6006: OS call failed or operation not supported on this OS
current device: 0
some sysinfo:
$ lspci | grep -i nvidia
0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev a1)
0000:01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
$ uname -r
6.6.3-arch1-1
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
$ cat /sys/module/nvidia/version
545.29.06
For me, it does not happen if I use mistral with llama.cpp.
from llamafile.
Also do you know if this happens if you use llama.cpp upstream?
It does not happen with llama.cpp, both -ngl 35
and -ngl 0
work fine there using the same model.
from llamafile.
I just tried the earlier releases and everything works fine in v0.1, but it breaks in v0.2, so some breaking change must have been introduced there.
from llamafile.
I'm reasonably certain if you pass the --unsecure
flag, things will work. Could you confirm this?
from llamafile.
works for me
from llamafile.
Great! I'll update all the llamafiles on HuggingFace so their .args
files pass the --unsecure
flag. That will rollback the new security until the next release can do better.
from llamafile.
Yup, that works, thank you!
from llamafile.
Related Issues (20)
- Are embeddings not supported with the mistral-7b-instruct-v0.2 model? HOT 4
- llamafile as LLM server for Mantella mod and Skyrim, is working nice but there is a little problem. HOT 6
- Can't run on AMD GPU, while llama.cpp does
- Add explanation for Windows user to how to Create EXE files HOT 4
- How to run llamafile as a linux service HOT 3
- Build on Windows HOT 2
- fail to load Qwen1.5-MoE-A2.7B-Chat on win10
- How to set context size? Running dolphin mixtral q4km, using too much of my 64gb of ram. want to lower it. HOT 3
- A better gui?
- .llamafile folder corruption?
- May I ask how to export and use LORA? I cannot use the BIN file converted with llama.cpp on my end HOT 2
- unknown pre-tokenizer type: 'qwen2' HOT 2
- CUDA kernel vec_dot_q4_K_q8_1_impl_vmmq has no device code compatible with CUDA arch 600 HOT 1
- Completion of error handling
- Is it possible for llamafile to use Vulkan or OpenCL Acceleration? HOT 9
- Added the ability to use LLAMA_HIP_UMA HOT 6
- Windows 10 GPU support bug HOT 2
- AMD - tinyBLAS windows prebuilt support stopped working with 0.8.5 HOT 24
- get_amd_offload_arch_flag: warning: hipInfo output didn't list any graphics cards HOT 6
- instruct chat templates HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llamafile.