Comments (15)
This looks like an error coming from llama.cpp itself, rather than LLamaSharp. Have you tried this model with llama.cpp directly to confirm if you get the same error?
from llamasharp.
I have compiled llama.cpp with CUDA support and it works. I've tried it with a few different 7b models that work with llama.cpp but give this error with LlamaSharp. I've tried sending the same prompts to llama.cpp and that also works. And to make matters more confusing, it started working for a bit, then it started failing again.
from llamasharp.
Could you get a stack trace from the exception? That'll tell us what C# code was running when it crashed.
from llamasharp.
I'm getting this too, and it hard crashes the host app even if you have try/catch all over the place.
In my case It's using Apple Silicon.
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 4096.00 MiB
llama_build_graph: non-view tensors processed: 740/740
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/jameshancock/Repos/TheDailyFactum/Server/Tools/Chat/bin/Debug/net8.0/runtimes/osx-arm64/native/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MiB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 2139.07 MiB
llama_new_context_with_model: max tensor size = 102.54 MiB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 4893.70 MiB, ( 4895.08 / 10922.67)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 4096.02 MiB, ( 8991.09 / 10922.67)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 2136.02 MiB, (11127.11 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: /Users/runner/work/LLamaSharp/LLamaSharp/ggml-metal.m:1611: false
The program '[26619] Chat.dll' has exited with code 0 (0x0).
It should have fallen back automatically to CPU and swapped like crazy.
from llamasharp.
command buffer 0 failed with status 5
seems to indicate an out-of-memory error (ref: ggerganov/llama.cpp#2048).
from llamasharp.
Right. But the actual issue here is that llama.cpp errors are crashing any LlamaSharp based .net application to the desktop. We can't handle these errors.
And in addition to the fact that these errors can't be handled, LLamaSharp can't fall back properly in many cases to CPU from GPU because of memory errors like other systems like LM Desktop do just fine.
The result is a doubly brittle system that is not deployable outside of very tightly controlled environments.
from llamasharp.
Unfortunately I don't think there's any way we can handle a GGML_ASSERT. It's defined here to call abort()
which is about as fatal as it gets!
from llamasharp.
According to MS's docs, the best way to work around abort() is to run in a separate process spun up in C# before calling into the C++ library.
from llamasharp.
Yep that would be the only way to handle it (an abort()
just destroys the process, with no way to recover).
That's not something LLamaSharp does internally at the moment (and personally I would say we're unlikely to, remaining just a wrapper around llama.cpp).
Instead imo the two ways to handle this would be at a higher level (load LLamaSharp in a separate process and interact with it) and at a lower level (contact the llama.cpp team ask them to use a recoverable kind of error detection where possible).
from llamasharp.
Would it not make sense for LlamaSharp as a project to request this? it would also benefit every other language consuming Llama.cpp. (and would help their own server)
from llamasharp.
I can ask if you'd prefer not to, but LLamaSharp doesn't have any special pull in the llama.cpp project. To be honest at the moment I suspect any such request will be largely ignored (unless it's accompanied by PRs to implement better error handling).
from llamasharp.
Could you do so? This really is killing us because it doesn't allow us to fall back to not using the GPU when this occurs.
from llamasharp.
I've opened up ggerganov/llama.cpp#4385
from llamasharp.
Although I will say I wouldn't expect this to change quickly if at all! It woud be a large change in both LLamaSharp and llama.cpp! If it's an issue you currently have you'll want to split off your usage of LLamaSharp into a separate process.
from llamasharp.
Some interesting discussion related to error handling in llama.cpp here: ggerganov/ggml#701
from llamasharp.
Related Issues (20)
- Unable to load SYCL compiled backend HOT 12
- LLamaSharp runtime binaries don't support Rosetta2 HOT 7
- Split the main package
- Make `LLamaSharp.semantic-kernel` depend on `Microsoft.SemanticKernel.Core` instead of `Microsoft.SemanticKernel`. HOT 1
- [Feature]: SemanticKernel FuctionCall HOT 3
- [BUG]: When using large models with the GPU the code crashes with cannot allocate kvcache HOT 13
- Llava DLL issue in Unity HOT 2
- [BUG]: When using the output IAsyncEnumerable<string> of session.ChatAsync() strings are not streamed into client HOT 5
- Semantic Kernel - Home Automation Sample HOT 4
- Phi-3-medium-128k-instruct - error due to tensor shape expected 245, got 243 HOT 1
- [BUG]: Offset and length were out of bounds
- Have an error then try run example "KernelMemorySaveAndLoad" HOT 1
- [Feature]: Expose implementation details of the KV Cache HOT 6
- Unable to Utilize Full CPU Capacity During Inference HOT 2
- [BUG]: Cannot load the backend on MACOS HOT 1
- [BUG]: qwen2 nvidia abnormal occurrence HOT 7
- [Feature]: AuthorRole can custom role labels be supported ?
- SEHException on Tokenize model. HOT 5
- Cannot figure out how to switch backend to OpenCL HOT 2
- [Feature]: Support JSON Schema from llama.cpp HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llamasharp.