I don't know if this is possible, but I'll just create this for the future :)

There's a discussion in <a class="issue-link js-issue-link" data-error-text="Failed to

Relevant upstream issues: <a class="issue-link js-issue-link"

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Add support for mixtral 8x7b or more generic support for MoE models. about llamasharp HOT 12 CLOSED

darcome commented on June 25, 2024 2

Add support for mixtral 8x7b or more generic support for MoE models.

from llamasharp.

Comments (12)

AshD commented on June 25, 2024 1

Using this
var p = new ModelParams(modelPath) { GpuLayerCount = 0};

Same error. model_ptr returns IntPtr.Zero
var model_ptr = NativeApi.llama_load_model_from_file(modelPath, lparams);

Thanks,
Ash

from llamasharp.

AsakusaRinne commented on June 25, 2024

There's a discussion in #357. It depends on llama.cpp implementation and we'll support it once llama.cpp supports it.

from llamasharp.

martindevans commented on June 25, 2024

Relevant upstream issues:

from llamasharp.

AshD commented on June 25, 2024

I saw that the llama libraries were updated in the llamasharp repo and tried it.

Loading the weights took over a minute and 42 GB out of 128GB memory, 80% CPU, 28% GPU and then threw a native load failed exception.

Regards,
Ash

from llamasharp.

martindevans commented on June 25, 2024

Could you link the GGUF file you were trying to use, I'll see if I can reproduce the problem.

from llamasharp.

AshD commented on June 25, 2024

Thanks @martindevans

GGUF file - https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

These are my parameters

var p = new ModelParams(modelPath) { ContextSize = 4096, GpuLayerCount = 12,UseMemoryLock = true, UseMemorymap = true, Threads = 12,  BatchSize = 128, EmbeddingMode = true };
var w = LLamaWeights.LoadFromFile(p);

CPU - i9 13th Gen, GPU RTX 4090, 128GB

Thanks,
Ash

from llamasharp.

martindevans commented on June 25, 2024

I'm downloading it now, but it's going to take a while!

However I've actually been testing with the Q5_K_M model from that same repo, so I'm expecting it to work.

I'd suggest trying out getting rid of all your params options there, most of them are automatically set and you shouldn't need to change them unless you have a good reason to override the defaults. The only one you do actually need to set is the GpuLayerCount, but I'd suggest setting that to zero as a test first.

from llamasharp.

martindevans commented on June 25, 2024

I tested out that mode, but it seems to work perfectly for me on both CPU and GPU.

if NativeApi.llama_load_model_from_file is failing, that would normall indicate an error with the model file itself or something more fundamental. Have you tried this file with one of the llama.cpp demos?

from llamasharp.

AshD commented on June 25, 2024

Thanks @martindevans for your help in debugging this issue.

It works! The issue was it was picking up the llama dll from cuda11 folder and I assumed was picking it up from cuda11.7.1 folder.

I could offload 18 layers to the GPU. Token generation was around 7.5 tokens/sec.
Are you seeing similar numbers? Is a webpage that you are aware of that has the best parameters to set based on the model?

Model output was better than the Mistral Instruct v0.2 for some of the prompts I tried.

Thanks,
Ash

from llamasharp.

martindevans commented on June 25, 2024

I'm using CPU inference, so it's slower for me. But as a rough guide it should be around the same speed as a 13B model.

Is a webpage that you are aware of that has the best parameters to set based on the model?

Almost all of the parameters should be automatically set (they're baked into the GGUF file).

The GPU layer count I don't know much about. As I understand it you just have to experiment to see how many layers you can fit and what speedup it gets you.

from llamasharp.

AshD commented on June 25, 2024

Thanks @martindevans

As you said, the GPU Layer count setting is more of - try to see how many you can fit in your GPU :-)

from llamasharp.

martindevans commented on June 25, 2024

v0.9.1 added support for Mixtral, so I'll close this issue now.

from llamasharp.

Add support for mixtral 8x7b or more generic support for MoE models. about llamasharp HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs