GithubHelp home page GithubHelp logo

How to do Model Quantization? about cogvlm HOT 20 CLOSED

thudm avatar thudm commented on August 24, 2024
How to do Model Quantization?

from cogvlm.

Comments (20)

1049451037 avatar 1049451037 commented on August 24, 2024 3

We now support 4-bit quantization! See README for more details.

from cogvlm.

aisensiy avatar aisensiy commented on August 24, 2024 1

But --fp16 will show another error message and the process crush:

Floating point exception (core dumped)

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024 1

It works for two 3090 with model parallel.

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

It should work. If something went wrong, feel free to post here.

from cogvlm.

aisensiy avatar aisensiy commented on August 24, 2024

It seems that quantize part of the model will show error like this during inference:

error message expected scalar type BFloat16 but found Half

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

Yeah I got it. This is the bug of cpm_kernel, which we cannot control actually... You can avoid this by changing --bf16 to --fp16 when running code.

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

emm... Could you post more detailed error info?

from cogvlm.

aisensiy avatar aisensiy commented on August 24, 2024

This is all the error info from my terminal:

$ python web_demo.py --from_pretrained cogvlm-chat --version chat --english --fp16 --quant 8

[2023-10-11 10:58:42,487] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-11 10:58:47,433] [INFO] building CogVLMModel model ...
[2023-10-11 10:58:47,437] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-11 10:58:47,438] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-11 10:59:00,976] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 17639685376
[2023-10-11 10:59:08,248] [INFO] [RANK 0] global rank 0 is loading checkpoint /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:53,805] [INFO] [RANK 0] > successfully loaded /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:55,647] [INFO] [RANK 0] > Quantizing model weight to 8 bits
[2023-10-11 11:00:55,699] [INFO] [RANK 0] > Quantized 5033164800 parameters in total.
web_demo.py:168: GradioDeprecationWarning: 'scale' value should be an integer. Using 4.5 will cause issues.
  with gr.Column(scale=4.5):
web_demo.py:182: GradioDeprecationWarning: 'scale' value should be an integer. Using 5.5 will cause issues.
  with gr.Column(scale=5.5):
web_demo.py:183: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  result_text = gr.components.Chatbot(label='Multi-round conversation History', value=[("", "Hi, What do you want to know about this image?")]).style(height=550)
3.47.1
3.47.1
Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
history []
error message 'NoneType' object has no attribute 'read'
history []
Floating point exception (core dumped)

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now.

from cogvlm.

aisensiy avatar aisensiy commented on August 24, 2024

Ok, and only quantize the lm part seems not make a lot memory usage shrink...so it is acceptable...

Here is a screenshot using bf16:

screenshot

This is really a huge memory usage...is it possible to make it work with 4090 in the future?

from cogvlm.

aisensiy avatar aisensiy commented on August 24, 2024

I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now.

So may I just remove quantization related code right now? Or waiting for some progress?

from cogvlm.

miandai avatar miandai commented on August 24, 2024

@aisensiy Try add a line of code to web_demo.py:

image

image

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

from cogvlm.

aisensiy avatar aisensiy commented on August 24, 2024

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

You mean quantizing the full model is possible? (I know litte about this stuff)

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

Yes, but it depends on the cuda kernel support. cpm_kernels misses some implementation for bf16 and slicing fp16. I'm not sure whether bitsandbytes works.

Theoretically, quantization everything is possible. But practically, some packages may have bug.

from cogvlm.

Blankit avatar Blankit commented on August 24, 2024

It works for two 3090 with model parallel.

How to set?

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024
torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

from cogvlm.

Blankit avatar Blankit commented on August 24, 2024
torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun: error: unrecognized arguments: --nproc-per-node=2

from cogvlm.

Blankit avatar Blankit commented on August 24, 2024
torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun: error: unrecognized arguments: --nproc-per-node=2

It is caused by the PyTorch version

from cogvlm.

rahimentezari avatar rahimentezari commented on August 24, 2024

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

while I wait for quantization support, would like to try bitsandbytes
Is this correct as I read from bitsandbytes?
in cogvlm_model.py change the GLU class linear layers to

class GLU(nn.Module):
    def __init__(self, args, in_features):
        super().__init__()
        # self.linear_proj = nn.Linear(in_features, args.hidden_size, bias=False)
        # self.norm1 = nn.LayerNorm(args.hidden_size)
        # self.act1 = nn.GELU()
        # self.act2 = nn.functional.silu
        # self.dense_h_to_4h = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
        # self.gate_proj = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
        # self.dense_4h_to_h = nn.Linear(args.inner_hidden_size, args.hidden_size, bias=False)

        self.linear_proj = bnb.nn.Linear8bitLt(in_features, args.hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.norm1 = nn.LayerNorm(args.hidden_size)
        self.act1 = nn.GELU()
        self.act2 = nn.functional.silu
        self.dense_h_to_4h = bnb.nn.Linear8bitLt(args.hidden_size, args.inner_hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.gate_proj = bnb.nn.Linear8bitLt(args.hidden_size, args.inner_hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.dense_4h_to_h = bnb.nn.Linear8bitLt(args.inner_hidden_size, args.hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)

    def forward(self, x):
        x = self.linear_proj(x)
        x = self.act1(self.norm1(x))
        x = self.act2(self.gate_proj(x)) * self.dense_h_to_4h(x)
        x = self.dense_4h_to_h(x)
        return x

Interestingly, when using has_fp16_weights=False, not only the quality of caption deteriorates alot, but also the time taken to caption images increases. has_fp16_weights=True takes almost same time as normal nn.Linear layer.

from cogvlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.