Comments (11)
The binaries in the latest release (0.11.1) are a little too old. The ones in the master branch were compiled after that PR was merged, so in theory they should include mamba support. I'd be interested to hear how that goes if you try it!
from llamasharp.
Oh, I'm using the 0.11.2 (from nuget). I tried copying the binaries from the master and replacing the /bin ones with it.
Surprisingly it loaded the model which was not being the case before. But crashed without any printed errors and with exit code -1073740791 after 3 seconds I started an inference, not sure if it is because I'm using the same 0.11.2 or not. Just after the inferenced started and before crashing, it also outputted this to the console:
GGML_ASSERT: D:\a\LLamaSharp\LLamaSharp\llama.cpp:10282: n_threads > 0
(I tested with different configs over 'Threads' in ModelParams)
Is a new build of the nuget packet from master needed in this case?
from llamasharp.
I tried copying the binaries from the master
That won't work I'm afraid. The llama.cpp API is unstable so every time the binaries are updated there are various internal changes on the C# side to work with the changed API. You always need to use the correc set of binaries with the correct version of the C# code.
from llamasharp.
Yep. Just compiled the main package and the CPU version and the results were the same, same exit code and assertion log.
Maybe I could inspect the sources this weekend to try finding any cause for this, or do you have any ideas in mind for why this is happening?
from llamasharp.
I don't have any ideas at the moment. I know Mamba is a bit of an unusual archtecture just because I've seen various comments inside llama.cpp about how certain APIs needs to be adjusted for Mamba, or don't quite make sense in a Mamba context. We'd definitely be interested in any investigations/PRs for Mamba support!
from llamasharp.
Ops, correction.
It actually worked, I suspected it was because nuget was caching the package (0.11.2) from the remote nuget (because I built the project using the 0.11.2 version), then I deleted the cache and now it works.
The outputs are very strange tho, but I suspect this is because I'm not formatting the inputs yet (for the tests), see here:
Also, the token limit is not working so I did my own with the output transformer to test here.
(This is a very small model as well, but compared to something like Phi3 it is very crude)
(On the other hands, even with weird responses the initial token time do not increase absurdly like with Phi3 model, so it seems like at least a partial win)
from llamasharp.
I suspected it was because nuget was caching the package (0.11.2) from the remote nuget (because I built the project using the 0.11.2 version), then I deleted the cache and now it works.
Yes, nuget caches the package and will not take your compiled one if it has the same version tag.
The outputs are very strange
That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess.
from llamasharp.
That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess.
About this, the model was one of the unique I was able to find in hugging face in GGUF that was actually mamba (MambaHermes 3B).
I tested it with the same formatting with the same processor I made for Phi3 and it also "kinda worked" (the responses were then very short, but more coherent). I also got it working a little better with the version quantized with 6-bits instead of 4.
But I realized something a little strange, there is something on the implementation of llama.cpp that makes models run progressively slower? I thought it was because I was using transformed-based models before, but even with mamba the time for initial token is many times increasing absurdly with each message (like, from 1 second to the first token to 5, then 10, then 26, etc).
(I'm asking because later I tested the same Phi3 model [not the mamba yet] in LM Studio and the time per first token was not changing so much, more like 1-3 seconds per message at max)
One of my tests where they performed reasonably well:
Q6 https://gist.github.com/JoaoVictorVP/92f6f30ad9d3c3dc343fdf0d7685685f
Q4 https://gist.github.com/JoaoVictorVP/f4de9ee658108898eaefa2c58c37938d
from llamasharp.
there is something on the implementation of llama.cpp that makes models run progressively slower?
AFAIK, there's no such thing in llama.cpp. Could you please post the huggingface model link here so that we can try to reproduce this case?
(I'm asking because later I tested the same Phi3 model [not the mamba yet] in LM Studio and the time per first token was not changing so much, more like 1-3 seconds per message at max)
Though LM studio is not open-source, if I remember correctly, it also uses llama.cpp as the backend. As you mentioned above, phi-3 works well in LM studio while mamba becomes slower in llama.cpp. It doesn't indicate that it's llama.cpp's problem, but also probably the model's problem. Could you please try mamba in LM-studio, or try phi-3 with llama.cpp/LLamaSharp?
from llamasharp.
You'll get a progressive slowdown if you are using a stateless executor, and submitting larger and larger chat history each time. The stateful executors internally store the chat history and should be around the same time for every token. I'm not sure exactly how the situation there differs for Mamba, but it should be roughly the same afaik.
from llamasharp.
I'll close this one now, since mamba is now supported. If there's still problems please don't hesitate to re-open or to create new issues :)
from llamasharp.
Related Issues (20)
- Always getting a "Ċ" before EOS when using Qwen HOT 1
- [BUG]: Object Reference Error in ApplyPenalty when setting nl_logit HOT 3
- [BUG]: ChatHistory and kv-cache in unsync (Possible MemLeak) HOT 2
- [BUG]: Vulkan - bad output after a while HOT 3
- Ho can i get functions calls? HOT 7
- [BUG]: Null reference in AskAsync() HOT 1
- Error Loading LLAMA 3.1 llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291[BUG]: HOT 4
- llamasharp provider for langchain.net is outdated HOT 2
- [BUG]: v0.14.0 hangs when initializing ModelParams HOT 3
- [BUG]: RTX 4080 - crash while loading Vulkan HOT 8
- [BUG]: Vulkan backend crash on model loading HOT 3
- [BUG]: Different continuation after restoring state HOT 1
- Improve `LLamaEmbedder` HOT 2
- [BUG]: KernelMemory.AskAsync() does not work - exception: object reference not set to an instance of an object HOT 25
- [BUG]: fatal error using gemma-2-2b-it HOT 3
- [BUG]: "The type or namespace 'Common' does not exist in the namespace 'LLama'" HOT 4
- Application Not Using GPU Despite Installing LlamaSharp.Backend.Cuda12 HOT 1
- [Feature]: Add development support for Dev Containers HOT 6
- How do i use RAG by kernel memory and Semantic kernel Handlebar Planner with llama3 HOT 3
- versioning issue HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llamasharp.