Comments (11)
This seems to be resolved by #43 . I'll close it tomorrow once I've double checked a few more old models
from autogptq.
I'm seeing this issue with many models as well, loading models directly using auto-gptq ( not through the textgen webui )
from autogptq.
@SirWaffle Hi, can you try the main branch and see if this problem still exists?
from autogptq.
And set strict=False in from_pretrained()
on any model that throws this error
from autogptq.
using the latest from the main branch, I fail to import due to missing triton - I'm on windows. The previous version didn't require titon, only spit out a message if I did not have it installed. I will spend some time and see if I can get Triton* compiled manually for windows and setup to see if this addresses the issue
from autogptq.
perhaps, as a temporary mitigation an alternate load_quantize
function should be written. The function would auto-detect the basename based on common model suffixes and guess the quantize_config
from the model/folder name. The function could be marked as deprecated with instructions on how to convert old format models to new without needing to re-quantize.
from autogptq.
Hi @Interpause
Unfortunately quantize_config
doesn't help us: two models can have identical quantization config and one will work and one will not.
This is because there was a silent change in the GPTQ format. Models produced with older GPTQ-for-LLaMa code will throw the error described. There is no obvious indication as to whether a model will fail or not.
But we do already have a solution for this - we can pass strict=False
to AutoGPTQForCausalLM.from_quantized()
and then both old and new format models will load. That is the current recommended solution for this issue.
We have been discussing whether strict=False should therefore be default. I think perhaps it should be, although PanQiWei has been explaining that it might have implications in multi-GPU inference scenarios. So that needs to be tested.
from autogptq.
Im having a bit of an issue getting triton to work in windows, is there any hacky thing i can do to remove the reliance on triton from the main branch? or is triton now a required package that cant be detangled easily?
from autogptq.
using the change here: #85
I can confirm the older models load with strict=false, without triton issues under windows
from autogptq.
i can confirm the new change merged in fixed issue with strict=false
setting on model load, i would say set it as default, most of my gptq model would simply not load without it
from autogptq.
What if the model name & version could be specified within quantize_config.json
? The first would solve auto-loading models with custom names, and the second could be used to automatically determine if strict=False
or other compatibility measures are needed. Maybe using version:-1
for GPTQ-for-Llama
quantized models to automatically turn on strict=False
.
from autogptq.
Related Issues (20)
- TypeError: 'NoneType' object is not subscriptable when inferencing HOT 1
- 卸载通义千问量化版,GPU显存不释放
- [BUG] qwen-14B int8 inference slow
- install auto-gptq error HOT 1
- [BUG]
- 部署AutoGPTQ量化后的Qwen-7B-Chat-Int4,报错RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
- Fail compile autogptq in ppc64le rhel8 HOT 1
- [BUG]RuntimeError: The temp_state buffer is too small in the exllama backend for GPTQ with act-order.
- [FEATURE] Quantization of the Language Model Pedestal for LLAVA Multimodal Models
- [BUG]ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
- [BUG] Qwen-14B-Chat-Int4 GPTQ model is slower than original model Qwen-14B-Chat greatly
- Dequantize to fp16?
- GPTQ LoRA Training is not working on me HOT 2
- [BUG] Rocm can not compile, error: no viable conversion from '__half' to '__fp16' HOT 3
- LLaMa 2 perplexity eval error: 'Cache only has 0 layers, attempted to access layers with index 0' HOT 1
- loss is high and Inference result is incorrect
- Inference speed is 4x slower than full fp16 model when group size is enabled
- [FEATURE] Fast AWQ/Marlin repacking HOT 3
- [BUG]Tokenizer class QWenTokenizer does not exist or is not currently imported.
- question : HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from autogptq.