update web_demo.py：CogVLMModel.from_pretrained(...,device=f'cpu',...) p

This is all the error info from my terminal: <div class="highlight highlight-sourc

How to do Model Quantization？ about cogvlm HOT 20 CLOSED

thudm commented on August 24, 2024

How to do Model Quantization？

from cogvlm.

Comments (20)

1049451037 commented on August 24, 2024 3

We now support 4-bit quantization! See README for more details.

from cogvlm.

aisensiy commented on August 24, 2024 1

But --fp16 will show another error message and the process crush:

Floating point exception (core dumped)

from cogvlm.

1049451037 commented on August 24, 2024 1

It works for two 3090 with model parallel.

from cogvlm.

1049451037 commented on August 24, 2024

It should work. If something went wrong, feel free to post here.

from cogvlm.

aisensiy commented on August 24, 2024

It seems that quantize part of the model will show error like this during inference:

error message expected scalar type BFloat16 but found Half

from cogvlm.

1049451037 commented on August 24, 2024

Yeah I got it. This is the bug of cpm_kernel, which we cannot control actually... You can avoid this by changing --bf16 to --fp16 when running code.

from cogvlm.

1049451037 commented on August 24, 2024

emm... Could you post more detailed error info?

from cogvlm.

aisensiy commented on August 24, 2024

This is all the error info from my terminal:

$ python web_demo.py --from_pretrained cogvlm-chat --version chat --english --fp16 --quant 8

[2023-10-11 10:58:42,487] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-11 10:58:47,433] [INFO] building CogVLMModel model ...
[2023-10-11 10:58:47,437] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-11 10:58:47,438] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-11 10:59:00,976] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 17639685376
[2023-10-11 10:59:08,248] [INFO] [RANK 0] global rank 0 is loading checkpoint /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:53,805] [INFO] [RANK 0] > successfully loaded /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:55,647] [INFO] [RANK 0] > Quantizing model weight to 8 bits
[2023-10-11 11:00:55,699] [INFO] [RANK 0] > Quantized 5033164800 parameters in total.
web_demo.py:168: GradioDeprecationWarning: 'scale' value should be an integer. Using 4.5 will cause issues.
  with gr.Column(scale=4.5):
web_demo.py:182: GradioDeprecationWarning: 'scale' value should be an integer. Using 5.5 will cause issues.
  with gr.Column(scale=5.5):
web_demo.py:183: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  result_text = gr.components.Chatbot(label='Multi-round conversation History', value=[("", "Hi, What do you want to know about this image?")]).style(height=550)
3.47.1
3.47.1
Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
history []
error message 'NoneType' object has no attribute 'read'
history []
Floating point exception (core dumped)

from cogvlm.

1049451037 commented on August 24, 2024

I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now.

from cogvlm.

aisensiy commented on August 24, 2024

Ok, and only quantize the lm part seems not make a lot memory usage shrink...so it is acceptable...

Here is a screenshot using bf16:

This is really a huge memory usage...is it possible to make it work with 4090 in the future?

from cogvlm.

aisensiy commented on August 24, 2024

I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now.

So may I just remove quantization related code right now? Or waiting for some progress?

from cogvlm.

miandai commented on August 24, 2024

@aisensiy Try add a line of code to web_demo.py：

from cogvlm.

1049451037 commented on August 24, 2024

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

from cogvlm.

aisensiy commented on August 24, 2024

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

You mean quantizing the full model is possible? (I know litte about this stuff)

from cogvlm.

1049451037 commented on August 24, 2024

Yes, but it depends on the cuda kernel support. cpm_kernels misses some implementation for bf16 and slicing fp16. I'm not sure whether bitsandbytes works.

Theoretically, quantization everything is possible. But practically, some packages may have bug.

from cogvlm.

Blankit commented on August 24, 2024

It works for two 3090 with model parallel.

How to set?

from cogvlm.

1049451037 commented on August 24, 2024

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

from cogvlm.

Blankit commented on August 24, 2024

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun: error: unrecognized arguments: --nproc-per-node=2

from cogvlm.

Blankit commented on August 24, 2024

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun: error: unrecognized arguments: --nproc-per-node=2

It is caused by the PyTorch version

from cogvlm.

rahimentezari commented on August 24, 2024

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

while I wait for quantization support, would like to try bitsandbytes
Is this correct as I read from bitsandbytes?
in cogvlm_model.py change the GLU class linear layers to

class GLU(nn.Module):
    def __init__(self, args, in_features):
        super().__init__()
        # self.linear_proj = nn.Linear(in_features, args.hidden_size, bias=False)
        # self.norm1 = nn.LayerNorm(args.hidden_size)
        # self.act1 = nn.GELU()
        # self.act2 = nn.functional.silu
        # self.dense_h_to_4h = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
        # self.gate_proj = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
        # self.dense_4h_to_h = nn.Linear(args.inner_hidden_size, args.hidden_size, bias=False)

        self.linear_proj = bnb.nn.Linear8bitLt(in_features, args.hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.norm1 = nn.LayerNorm(args.hidden_size)
        self.act1 = nn.GELU()
        self.act2 = nn.functional.silu
        self.dense_h_to_4h = bnb.nn.Linear8bitLt(args.hidden_size, args.inner_hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.gate_proj = bnb.nn.Linear8bitLt(args.hidden_size, args.inner_hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.dense_4h_to_h = bnb.nn.Linear8bitLt(args.inner_hidden_size, args.hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)

    def forward(self, x):
        x = self.linear_proj(x)
        x = self.act1(self.norm1(x))
        x = self.act2(self.gate_proj(x)) * self.dense_h_to_4h(x)
        x = self.dense_4h_to_h(x)
        return x

Interestingly, when using has_fp16_weights=False, not only the quality of caption deteriorates alot, but also the time taken to caption images increases. has_fp16_weights=True takes almost same time as normal nn.Linear layer.

from cogvlm.

How to do Model Quantization？ about cogvlm HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs