Is there a paper/article/blog post explaining such decision? Or is it just simply a fe

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="22

Why doesn't AutoGPTQ quantize lm_head layer? about autogptq HOT 5 OPEN

XeonKHJ commented on June 17, 2024

Why doesn't AutoGPTQ quantize lm_head layer?

from autogptq.

Comments (5)

wenhuach21 commented on June 17, 2024 2

Found intel's results of quantization test of lm-head. There was minimal accuracy loss:

@wenhuach21 Do you know how much ram/vram the intel llama3-8B lm_head quantized test saved vs non-quantized? Here is the untested branch that allows loading of quanted lm-head that I plan to test: https://github.com/Qubitium/AutoGPTQ/tree/sym-false-lm-head combined with intel/auto-round#87

https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3

Metric BF16 w4g128 w/o lm-head w4g128 with lm-head qdq
Avg. 0.6352 0.6312 0.6303
mmlu 0.6386 0.6306 0.6318
winogrande 0.7143 0.7238 0.7269
truthfulqa_mc1 0.3623 0.3537 0.3525
rte 0.6751 0.6859 0.6679
piqa 0.7867 0.7797 0.7802
openbookqa 0.3400 0.3300 0.3320
lambada_openai 0.7182 0.7200 0.7173
hellaswag 0.5769 0.5699 0.5701
boolq 0.8297 0.8309 0.8284
arc_easy 0.8152 0.8089 0.8106
arc_challenge 0.5299 0.5102 0.5154

What I know is the model size at W4G128, W/O lm head 5.4G, with lm head 4.7G.

Additionally, if act-order is not enabled or static group is enabled, could Autogptq refrain from dumping the group index into the quantized model, thus conserving some resources

from autogptq.

Qubitium commented on June 17, 2024

@XeonKHJ Good question. I will test this tomorrow with intel/auto-round that does offer the ability to quantize lm-head. If there are no inference issues post quantize, I will make it as an option in new PR.

from autogptq.

Qubitium commented on June 17, 2024

Found intel's results of quantization test of lm-head. There was minimal accuracy loss:

@wenhuach21 Do you know how much ram/vram the intel llama3-8B lm_head quantized test saved vs non-quantized? Here is the untested branch that allows loading of quanted lm-head that I plan to test: https://github.com/Qubitium/AutoGPTQ/tree/sym-false-lm-head combined with intel/auto-round#87

https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3

Metric	BF16	w4g128 w/o lm-head	w4g128 with lm-head qdq
Avg.	0.6352	0.6312	0.6303
mmlu	0.6386	0.6306	0.6318
winogrande	0.7143	0.7238	0.7269
truthfulqa_mc1	0.3623	0.3537	0.3525
rte	0.6751	0.6859	0.6679
piqa	0.7867	0.7797	0.7802
openbookqa	0.3400	0.3300	0.3320
lambada_openai	0.7182	0.7200	0.7173
hellaswag	0.5769	0.5699	0.5701
boolq	0.8297	0.8309	0.8284
arc_easy	0.8152	0.8089	0.8106
arc_challenge	0.5299	0.5102	0.5154

from autogptq.

Qubitium commented on June 17, 2024

#648 can now load quantized lm_head from intel/auto-round but autogptq quantization of lm-head is still in progress.

from autogptq.

Qubitium commented on June 17, 2024

Additionally, if static grouping is not enabled, could Autogptq refrain from dumping the group index into the quantized model, thus conserving some resources.

This is beyond my abilities right now. @fxmarty @LaaZa

from autogptq.

Why doesn't AutoGPTQ quantize lm_head layer? about autogptq HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs