GithubHelp home page GithubHelp logo

Comments (11)

jart avatar jart commented on July 30, 2024

There's no CUDA_CHECK() anywhere near line 6006 in ggml-cuda.cu. Could you upload /home/garcia/.llamafile/ggml-cuda.cu to this issue tracker so I can see what's failing?

from llamafile.

GarciaLnk avatar GarciaLnk commented on July 30, 2024

Sure, here it is attached, however after a quick diff it seems to be identical to this repo's llama.cpp/ggml-cuda.cu file (and in fact, there's a CUDA_CHECK() in line 6006).

ggml-cuda.cu

from llamafile.

jart avatar jart commented on July 30, 2024

You are correct about line 6006. Apologies for any confusion. So here's what failing:

    CUDA_CHECK(cudaMalloc((void **) &ptr, look_ahead_size));

Looks like you're running out of GPU memory. But you said you're not passing the -ngl flag which defaults to zero. I want to understand why it's possible to run out of GPU memory when the GPU isn't being used. Help wanted.

from llamafile.

jart avatar jart commented on July 30, 2024

Also do you know if this happens if you use llama.cpp upstream?

from llamafile.

r3drock avatar r3drock commented on July 30, 2024

I am getting the same error on my system. My system is kind of similar. I also got a mobile Nvidia gpu.
Error message for the mistral server llamafile:

llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   70.42 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.05 MB
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 4232.06 MB (model: 4095.05 MB, context: 137.00 MB)
Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

loading weights...
{"timestamp":1701544135,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37044,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37056,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 5 tokens
slot 0 : kv cache rm - [0, end)

CUDA error 304 at /home/r3d/.llamafile/ggml-cuda.cu:6006: OS call failed or operation not supported on this OS
current device: 0

some sysinfo:

$ lspci | grep -i nvidia
0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev a1)
0000:01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
$ uname -r
6.6.3-arch1-1
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
$ cat /sys/module/nvidia/version
545.29.06

For me, it does not happen if I use mistral with llama.cpp.

from llamafile.

GarciaLnk avatar GarciaLnk commented on July 30, 2024

Also do you know if this happens if you use llama.cpp upstream?

It does not happen with llama.cpp, both -ngl 35 and -ngl 0 work fine there using the same model.

from llamafile.

GarciaLnk avatar GarciaLnk commented on July 30, 2024

I just tried the earlier releases and everything works fine in v0.1, but it breaks in v0.2, so some breaking change must have been introduced there.

from llamafile.

jart avatar jart commented on July 30, 2024

I'm reasonably certain if you pass the --unsecure flag, things will work. Could you confirm this?

from llamafile.

r3drock avatar r3drock commented on July 30, 2024

works for me

from llamafile.

jart avatar jart commented on July 30, 2024

Great! I'll update all the llamafiles on HuggingFace so their .args files pass the --unsecure flag. That will rollback the new security until the next release can do better.

from llamafile.

GarciaLnk avatar GarciaLnk commented on July 30, 2024

Yup, that works, thank you!

from llamafile.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.