i m not sure i m the only one or not i used this to quantize two mod

the model quantized is not performant about autogptq HOT 3 CLOSED

autogptq commented on May 18, 2024

the model quantized is not performant

from autogptq.

Comments (3)

qwopqwop200 commented on May 18, 2024

This was implemented inefficiently due to the complexity of implementing act-order and groupsize at the same time. This is also why I recommend triton in general.

from autogptq.

cxfcxf commented on May 18, 2024

This was implemented inefficiently due to the complexity of implementing act-order and groupsize at the same time. This is also why I recommend triton in general.

thanks thats actually solved my problem, it seems triton moved all model to VRAM which make sense its faster, was not aware that the default cuda version uses VRAM + DRAM, no wonder its slow

i was working on embedding project, be able to load large model in small VRAM really helped, since most of people would not like to feed sensitive data to openai model.

btw, there is maybe a typo on the warning message when i try to load it
WARNING - use_triton will force moving the hole model to GPU, make sure you have enough VRAM.
this means whole right?

from autogptq.

TheBloke commented on May 18, 2024

btw, there is maybe a typo on the warning message when i try to load it
WARNING - use_triton will force moving the hole model to GPU, make sure you have enough VRAM.
this means whole right?

It does, and I've just pushed a PR to fix the typo: #40

from autogptq.

the model quantized is not performant about autogptq HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs