Comments (12)
Using this
var p = new ModelParams(modelPath) { GpuLayerCount = 0};
Same error. model_ptr returns IntPtr.Zero
var model_ptr = NativeApi.llama_load_model_from_file(modelPath, lparams);
Thanks,
Ash
from llamasharp.
There's a discussion in #357. It depends on llama.cpp implementation and we'll support it once llama.cpp supports it.
from llamasharp.
Relevant upstream issues:
from llamasharp.
I saw that the llama libraries were updated in the llamasharp repo and tried it.
Loading the weights took over a minute and 42 GB out of 128GB memory, 80% CPU, 28% GPU and then threw a native load failed exception.
Regards,
Ash
from llamasharp.
Could you link the GGUF file you were trying to use, I'll see if I can reproduce the problem.
from llamasharp.
Thanks @martindevans
These are my parameters
var p = new ModelParams(modelPath) { ContextSize = 4096, GpuLayerCount = 12,UseMemoryLock = true, UseMemorymap = true, Threads = 12, BatchSize = 128, EmbeddingMode = true };
var w = LLamaWeights.LoadFromFile(p);
CPU - i9 13th Gen, GPU RTX 4090, 128GB
Thanks,
Ash
from llamasharp.
I'm downloading it now, but it's going to take a while!
However I've actually been testing with the Q5_K_M model from that same repo, so I'm expecting it to work.
I'd suggest trying out getting rid of all your params options there, most of them are automatically set and you shouldn't need to change them unless you have a good reason to override the defaults. The only one you do actually need to set is the GpuLayerCount
, but I'd suggest setting that to zero as a test first.
from llamasharp.
I tested out that mode, but it seems to work perfectly for me on both CPU and GPU.
if NativeApi.llama_load_model_from_file
is failing, that would normall indicate an error with the model file itself or something more fundamental. Have you tried this file with one of the llama.cpp demos?
from llamasharp.
Thanks @martindevans for your help in debugging this issue.
It works! The issue was it was picking up the llama dll from cuda11 folder and I assumed was picking it up from cuda11.7.1 folder.
I could offload 18 layers to the GPU. Token generation was around 7.5 tokens/sec.
Are you seeing similar numbers? Is a webpage that you are aware of that has the best parameters to set based on the model?
Model output was better than the Mistral Instruct v0.2 for some of the prompts I tried.
Thanks,
Ash
from llamasharp.
I'm using CPU inference, so it's slower for me. But as a rough guide it should be around the same speed as a 13B model.
Is a webpage that you are aware of that has the best parameters to set based on the model?
Almost all of the parameters should be automatically set (they're baked into the GGUF file).
The GPU layer count I don't know much about. As I understand it you just have to experiment to see how many layers you can fit and what speedup it gets you.
from llamasharp.
Thanks @martindevans
As you said, the GPU Layer count setting is more of - try to see how many you can fit in your GPU :-)
from llamasharp.
v0.9.1 added support for Mixtral, so I'll close this issue now.
from llamasharp.
Related Issues (20)
- CentOS x86_64 Failed Loading 'libllama.so' HOT 4
- System.TypeInitializationException: 'The type initializer for 'LLama.Native.NativeApi' threw an exception.' HOT 12
- How do I continously print the answer word for word when using document ingestion with kernel memory? HOT 1
- How to rebuild LLamaSharp backends HOT 2
- Namespace should be consistent
- Mamba HOT 10
- Android Backend HOT 2
- [Feature] Allow async model loading and cancellation
- [CI] Add more unit test to ensure the the outputs are reasonable HOT 3
- Take multiple chat templates into account
- [Feature]: Support for Function Calling or Tools HOT 4
- [BUG]: DefragThreshold default is not matching llama.cpp and probably not intended HOT 6
- [BUG]: Answer stop abruptly after contextsize, even with limiting prompt size HOT 1
- [BUG]: Linux cuda version detection could be incorrect HOT 2
- [BUG]: WSL2 has problem running LLamaSharp with cuda11
- Add unit test about long context HOT 2
- Add debug mode of LLamaSharp
- How to better provide system information for LLMs HOT 3
- LLAVA Configuration HOT 4
- [Feature]: 不同的LLM模型,代码要以怎样的方式融合到项目里 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llamasharp.