Is mamba supported already in the current version of llama.cpp this library uses?

I tried copying the binaries from the master <p dir="au

Mamba about llamasharp HOT 11 CLOSED

JoaoVictorVP commented on September 26, 2024

Mamba

from llamasharp.

Comments (11)

martindevans commented on September 26, 2024

The binaries in the latest release (0.11.1) are a little too old. The ones in the master branch were compiled after that PR was merged, so in theory they should include mamba support. I'd be interested to hear how that goes if you try it!

from llamasharp.

JoaoVictorVP commented on September 26, 2024

Oh, I'm using the 0.11.2 (from nuget). I tried copying the binaries from the master and replacing the /bin ones with it.
Surprisingly it loaded the model which was not being the case before. But crashed without any printed errors and with exit code -1073740791 after 3 seconds I started an inference, not sure if it is because I'm using the same 0.11.2 or not. Just after the inferenced started and before crashing, it also outputted this to the console:

GGML_ASSERT: D:\a\LLamaSharp\LLamaSharp\llama.cpp:10282: n_threads > 0
(I tested with different configs over 'Threads' in ModelParams)

Is a new build of the nuget packet from master needed in this case?

from llamasharp.

martindevans commented on September 26, 2024

I tried copying the binaries from the master

That won't work I'm afraid. The llama.cpp API is unstable so every time the binaries are updated there are various internal changes on the C# side to work with the changed API. You always need to use the correc set of binaries with the correct version of the C# code.

from llamasharp.

JoaoVictorVP commented on September 26, 2024

Yep. Just compiled the main package and the CPU version and the results were the same, same exit code and assertion log.
Maybe I could inspect the sources this weekend to try finding any cause for this, or do you have any ideas in mind for why this is happening?

from llamasharp.

martindevans commented on September 26, 2024

I don't have any ideas at the moment. I know Mamba is a bit of an unusual archtecture just because I've seen various comments inside llama.cpp about how certain APIs needs to be adjusted for Mamba, or don't quite make sense in a Mamba context. We'd definitely be interested in any investigations/PRs for Mamba support!

from llamasharp.

JoaoVictorVP commented on September 26, 2024

Ops, correction.

It actually worked, I suspected it was because nuget was caching the package (0.11.2) from the remote nuget (because I built the project using the 0.11.2 version), then I deleted the cache and now it works.

The outputs are very strange tho, but I suspect this is because I'm not formatting the inputs yet (for the tests), see here:

Also, the token limit is not working so I did my own with the output transformer to test here.

(This is a very small model as well, but compared to something like Phi3 it is very crude)
(On the other hands, even with weird responses the initial token time do not increase absurdly like with Phi3 model, so it seems like at least a partial win)

from llamasharp.

AsakusaRinne commented on September 26, 2024

I suspected it was because nuget was caching the package (0.11.2) from the remote nuget (because I built the project using the 0.11.2 version), then I deleted the cache and now it works.

Yes, nuget caches the package and will not take your compiled one if it has the same version tag.

The outputs are very strange

That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess.

from llamasharp.

JoaoVictorVP commented on September 26, 2024

That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess.

About this, the model was one of the unique I was able to find in hugging face in GGUF that was actually mamba (MambaHermes 3B).

I tested it with the same formatting with the same processor I made for Phi3 and it also "kinda worked" (the responses were then very short, but more coherent). I also got it working a little better with the version quantized with 6-bits instead of 4.

But I realized something a little strange, there is something on the implementation of llama.cpp that makes models run progressively slower? I thought it was because I was using transformed-based models before, but even with mamba the time for initial token is many times increasing absurdly with each message (like, from 1 second to the first token to 5, then 10, then 26, etc).
(I'm asking because later I tested the same Phi3 model [not the mamba yet] in LM Studio and the time per first token was not changing so much, more like 1-3 seconds per message at max)

One of my tests where they performed reasonably well:
Q6 https://gist.github.com/JoaoVictorVP/92f6f30ad9d3c3dc343fdf0d7685685f
Q4 https://gist.github.com/JoaoVictorVP/f4de9ee658108898eaefa2c58c37938d

from llamasharp.

AsakusaRinne commented on September 26, 2024

there is something on the implementation of llama.cpp that makes models run progressively slower?

AFAIK, there's no such thing in llama.cpp. Could you please post the huggingface model link here so that we can try to reproduce this case?

(I'm asking because later I tested the same Phi3 model [not the mamba yet] in LM Studio and the time per first token was not changing so much, more like 1-3 seconds per message at max)

Though LM studio is not open-source, if I remember correctly, it also uses llama.cpp as the backend. As you mentioned above, phi-3 works well in LM studio while mamba becomes slower in llama.cpp. It doesn't indicate that it's llama.cpp's problem, but also probably the model's problem. Could you please try mamba in LM-studio, or try phi-3 with llama.cpp/LLamaSharp?

from llamasharp.

martindevans commented on September 26, 2024

You'll get a progressive slowdown if you are using a stateless executor, and submitting larger and larger chat history each time. The stateful executors internally store the chat history and should be around the same time for every token. I'm not sure exactly how the situation there differs for Mamba, but it should be roughly the same afaik.

from llamasharp.

martindevans commented on September 26, 2024

I'll close this one now, since mamba is now supported. If there's still problems please don't hesitate to re-open or to create new issues :)

from llamasharp.

Mamba about llamasharp HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs