GithubHelp home page GithubHelp logo

mit-han-lab / smoothquant Goto Github PK

View Code? Open in Web Editor NEW
1.0K 19.0 116.0 6.79 MB

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Home Page: https://arxiv.org/abs/2211.10438

License: MIT License

Python 100.00%

smoothquant's Introduction

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

[paper] [slides][video]

intuition

News

  • [2024/03] We show SmoothQuant can enable W8A8 quantization for Llama-1/2/3, Falcon, Mistral, and Mixtral models with negligible loss. Results.
  • [2023/10] SmoothQuant is integrated into NVIDIA TensorRT-LLM.
  • [2023/03] SmoothQuant is integrated into Intel Neural-Compressor.

Abstract

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy or do not run efficiently on hardware. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT-175B, BLOOM-176B, GLM-130B, and MT-NLG 530B. SmoothQuant has better hardware efficiency than existing techniques. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. We integrate SmoothQuant into FasterTransformer, a state-of-the-art LLM serving framework, and achieve faster inference speed with half the number of GPUs compared to FP16, enabling the serving of a 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs.

Installation

conda create -n smoothquant python=3.8
conda activate smoothquant
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install transformers==4.36.0 accelerate datasets zstandard

python setup.py install

Usage

SmoothQuant INT8 Inference for PyTorch

We implement SmoothQuant INT8 inference for PyTorch with CUTLASS INT8 GEMM kernels, which are wrapped as PyTorch modules in torch-int. Please install torch-int before running the SmoothQuant PyTorch INT8 inference.

We implement the quantized OPT model class in smoothquant/opt.py, which uses INT8 linear layers and bundles quantization scales. We provide the already smoothed and quantized OPT model at https://huggingface.co/mit-han-lab/opt-[MODEL-SIZE]-smoothquant, where [MODEL-SIZE] can be 125m, 1.3B, 2.7B, 6.7B, 13B, 30b, and 66b. You can load the INT8 model with the following code:

from smoothquant.opt import Int8OPTForCausalLM
model = Int8OPTForCausalLM.from_pretrained("mit-han-lab/opt-30b-smoothquant")

You can also check generate_act_scales.py and export_int8_model.py to see how we smooth, quantize and export INT8 models.

In examples/smoothquant_opt_real_int8_demo.ipynb, we use OPT-30B model to demonstrate the latency and memory advantages of SmoothQuant. We demonstrate on OPT-30B because it is the largest model we can run both the FP16 and INT8 inference on a single A100 GPU. For larger models requiring multiple GPUs, we recommend using the FasterTransformer implementation of SmoothQuant.

Activation Channel Scales and Calibration

We provide the activation channel scales for Llama, Mistral, Mixtral, Falcon, OPT, and BLOOM models in act_scales/. We get those scales with 512 random sentences in the Pile validation set. You can use the OPT demo (examples/smoothquant_opt_demo.ipynb) and Llama demo (examples/smoothquant_llama_demo.ipynb) to test smoothing and quantizing those models.

We also provide the script to get the activation channel scales for your models. Please refer to examples/generate_act_scales.py. You can use the following command to get the scales for your models:

python examples/generate_act_scales.py \
    --model-name <model_name_or_path> \
    --output-path <output_act_scales_file_path> \
    --num-samples <num_samples> \
    --seq-len <sequence_length> \
    --dataset-path <path_to_the_calibration_dataset>

Demo on OPT-13B with W8A8 Fake Quantization

In examples/smoothquant_opt_demo.ipynb, we use OPT-13B as an example to demonstrate SmoothQuant can match the accuracy of FP16 and INT8 inference, while the naive baseline cannot. We simulate INT8 inference with FP16 (smoothquant/fake_quant.py), i.e., fake quantization.

Perplexity Results on Llama-1/2/3, Falcon, Mistral, and Mixtral with W8A8 Quantization

We provide an evaluation script to evaluate the language modeling perplexity of OPT, BLoom, Llama, Falcon, Mistral, and Mixtral models with W8A8 simulated quantization. Please refer to smoothquant/ppl_eval.py. You can use the following command to evaluate the models:

python smoothquant/ppl_eval.py \
    --model_path <model_name_or_path> \
    --act_scales_path <act_scales_file_path> \
    --smooth \
    --alpha <alpha> \
    --quantize

Results:

Model Method PPL Alpha
Llama-2-7B FP16 5.474
SQ W8A8 5.515 0.85
Llama-2-13B FP16 4.950
SQ W8A8 4.929 0.85
Llama-2-70B FP16 3.320
SQ W8A8 3.359 0.9
Llama-3-8B FP16 6.138
SQ W8A8 6.258 0.85
Llama-3-70B FP16 2.857
SQ W8A8 2.982 0.85
Mistral-7B FP16 5.253
SQ W8A8 5.277 0.8
Mixtral-8x7B FP16 3.842
SQ W8A8 3.893 0.8
Falcon-7B FP16 6.590
SQ W8A8 6.629 0.6
Falcon-40B FP16 5.228
SQ W8A8 5.255 0.7

For measured speedup, we recommend using the NVIDIA TensorRT-LLM implementation of SmoothQuant.

Results

  • SmoothQuant migrates part of the quantization difficulties from activation to weights, which smooths out the systematic outliers in activation, making both weights and activations easy to quantize.

migrate

  • SmoothQuant can achieve W8A8 quantization of LLMs (e.g., OPT-175B) without degrading performance.

accuracy

  • SmoothQuant can achieve faster inference compared to FP16 when integrated into PyTorch, while previous work LLM.int8() does not lead to acceleration (usually slower).

torch_latency_mem

  • We also integrate SmoothQuant into the state-of-the-art serving framework FasterTransformer, achieving faster inference speed using only half the GPU numbers compared to FP16 (1 instead of 2 for OPT-66B, 4 instead of 8 for OPT-175B).

ft_latency_mem

Citation

If you find SmoothQuant useful or relevant to your research, please kindly cite our paper:

@InProceedings{xiao2023smoothquant,
    title = {{S}mooth{Q}uant: Accurate and Efficient Post-Training Quantization for Large Language Models},
    author = {Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song},
    booktitle = {Proceedings of the 40th International Conference on Machine Learning},
    year = {2023}
}

smoothquant's People

Contributors

guangxuan-xiao avatar songhan avatar tonylins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smoothquant's Issues

circular import

Hello everyone, recently I encountered the following issue while installing this code repository. Does anyone know how to resolve it? If you could give me some guidance, I would be extremely grateful.

Traceback (most recent call last):
File "./generate_act_scales.py", line 7, in
import torch
File "/opt/conda/envs/smoothquant/lib/python3.8/site-packages/torch/init.py", line 778, in
_C._initExtension(manager_path())
AttributeError: partially initialized module 'torch' has no attribute 'UntypedStorage' (most likely due to a circular import)

Latency calculation for OPT 175B

Thank you for sharing your impressive work. I have a question regarding Figure 8 in the paper where you reported latency measurements for OPT 175B with FasterTransformer FP16 and SmoothQuant. Latency appears to be less than 1ms / token (e.g. 228 ms for a sequence of 256 output tokens), but from our previous experience with FasterTransformer FP16 and 8-way tensor parallelism, latency was often around 40ms / token for a batch size of 4 sentences. The GPT guide from the FasterTransformer repo also doesn't show any numbers near 1ms / token.

Can you please explain a bit more about the setup here? If possible, would you mind sharing your benchmark script? We also look forward to the release of SmoothQuant code as well.

The Naive W8A8 Quantized model accuracy of medium size model (e.g opt-2.7b)

I tried to use Naive W8A8 method (quantize_model method only, dynamic scale) to quantize a 2.9 b gpt model,and found that the ppl is 15.1 which is closed to fp16 ppl (14.6). In your smoothquant_opt_demo.ipynb, the Naive W8A8 accuracy is very slow. Is this because of dynamics quantization?

Thank you very much.

different smoothquant levels

In the paper there are three levels of smoothquant, O1/O2/O3, with latency decreasing as level increases. Is the implementation in this repo O3? I didn't immediately see how to switch the level in the repo.

Accuracy drop for Llama

I tried to quantize a Llama model (Llama 13b) by smooth quant, and found that if I only quantize LlamaDecoderLayer then the accuracy would not drop even directly quantize weights and activations, but accuracy would drop a lot when quantizing LlamaMLP, which contains 3 Linear layers and 1 activation layer:

  • model: decapoda-research/llama-13b-hf
  • dataset: wikitext-2-raw-v1
  • split: validation[:1000]
  • fp16 accuracy: 0.545
  • quantized accuracy (w/o quantized MLP): 0.446
  • smooth quant accuracy (w/o quantized MLP): 0.481
  • quantized accuracy (w quantized MLP): 0.026
  • smooth quant accuracy (w quantized MLP): 0.067

No module named 'torch_int'

Traceback (most recent call last):
File "examples/export_int8_model.py", line 10, in
from smoothquant.opt import Int8OPTForCausalLM
File "", line 259, in load_module
File "/root/anaconda3/envs/smoothquant/lib/python3.8/site-packages/smoothquant-0.0.0-py3.8.egg/smoothquant/opt.py", line 15, in
ModuleNotFoundError: No module named 'torch_int'

Bloom code

Thank you for your great work. I am very interested in Bloom int8 models. Could you please share the code and checkpoints for Int8 Bloom models ?

Test smoothquant accuracy for just fc2 layer

I'm integrating smoothquant into a different large autoregressive transformer model with a somewhat different arch.

Say I want to test smoothquant accuracy just quantizing and smoothing fc2 to int8. Could you help me with the steps for this? I'm using the real PyTorch W8A8B8O8Linear layer from torch-int.

Roughly looks like:

  1. Get act scales, static decoder layer scales for whole model (or just fc2, but might be easier to do whole model).
  2. Use act scales + just the fc part of smooth_lm
  3. Call the W8A8 linear kernel

I'd appreciate some help with step 3 (or just some example code for getting scales/quantizing/calling a single linear). In particular, I'm confused by what the input and output scale should be. Also looks like some of the scaling factors are fused into previous layers so not sure if that will introduce any issues. Thanks!

SmoothQuant for llama

Hi, authors!
Are you planning on supporting LLAMA in smoothquant?I am looking forward to the application of Smoothquant on LLAMA.

Thank you!

what is the transformers' version

python3.9/site-packages/transformers/modeling_utils.py", line 2529, in from_pretrained
dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_index=offload_index)
TypeError: dispatch_model() got an unexpected keyword argument 'offload_index'

I use python3.9 with transformers==4.26.1. when i run python3 generate_act_scales.py ..., it will cause error

How to reproduce the performance described in the paper

I tested the latency of OPT-13B on a single NVIDIA A100-80GB GPU using a PyTorch implementation.
With batch=1, input seq length=512, and output seq length=1, the fp16 latency is about 74.60ms and the SmoothQuant-O3 latency is about 78.192ms. This does not show the performance improvement mentioned in the paper. Is there something wrong?

use smoothquant in different models architucture [proposed Label] Question

@zhijian-liu
[proposed Label] Question
i was reading a lot of papers related to the efficiency run time models and performance, I have a question is :
this method you introduced is only applicable to LLM and is used to tackle huge parameterization and quantization weights, or it can be used in different types of models in Computer Vision tasks, or does it need to adapt,?
I went through the code I noticed that is adapted in certain cases to implement.

How to reproduce the result with lm-evaluation-harness

I noticed that in the newest commit, you mentioned that all the results are implemented based on the lm-evaluation-harness, so can you show how to evaluate the model using this framework?
I tried to directly evaluate the smoothquant model on lm-evaluation, like this:

python main.py --model hf-causal --model_args pretrained=mit-han-lab/opt-1.3b-smoothquant --tasks lambada_openai --device cuda:0

only get the result:

RuntimeError: Error(s) in loading state_dict for OPTForCausalLM:
        size mismatch for model.decoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1, 2048]) from checkpoint, the shape in current model is torch.Size([2048]).
        size mismatch for model.decoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1, 2048]) from checkpoint, the shape in current model is torch.Size([2048]).
        size mismatch for model.decoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1, 2048]) from checkpoint, the shape in current model is torch.Size([2048]).
        size mismatch for model.decoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([1, 2048]) from checkpoint, the shape in current model is torch.Size([2048]).
...

Size mismatch

Hey!
First thanks so so much for releasing your awesome work!!!

Iv'e tried to export my fine-tuned OPT-13B model using the export_int8_model.py

When then trying to load this model in Int8OPTForCausalLM.from_pretrained()
I run into the errors below.

The model was fine-tuned with a block size of 2048 (using the HF clm example script)
I have tried re-exporting in int8, using the seq length of 2048 and using 'ignore_mismatched_sizes=True' but still no luck

I'd really appreciate any insights here to get this working!

raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") RuntimeError: Error(s) in loading state_dict for Int8OPTForCausalLM: size mismatch for model.decoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.0.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.0.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.1.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.1.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.1.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.2.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.2.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.2.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.3.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.3.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.3.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.4.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.4.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.4.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.5.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.5.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.5.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.6.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.6.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.6.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.6.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.6.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.6.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.7.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.7.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.7.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.7.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.7.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.7.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.8.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.8.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.8.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.8.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.8.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.8.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.9.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.9.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.9.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) fromcheckpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.9.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.9.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint,the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.9.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.10.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.10.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.10.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.10.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.10.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.10.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.11.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.11.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.11.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.11.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.11.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.11.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.12.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.12.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.12.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.12.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.12.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.12.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.13.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.13.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.13.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.13.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.13.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.13.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.14.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.14.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.14.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.14.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.14.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.14.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.15.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.15.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.15.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.15.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.15.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.15.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.16.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.16.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.16.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.16.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.16.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.16.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.17.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.17.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.17.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.17.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.17.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.17.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.18.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.18.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.18.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.18.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.18.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.18.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.19.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.19.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.19.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.19.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.19.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.19.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.20.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.20.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.20.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.20.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.20.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.20.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.21.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.21.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.21.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.21.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.21.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.21.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.22.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.22.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.22.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.22.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.22.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.22.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.23.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.23.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.23.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.23.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.23.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.23.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.24.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.24.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.24.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.24.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.24.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.24.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.25.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.25.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.25.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.25.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.25.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.25.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.26.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.26.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.26.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.26.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.26.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.26.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.27.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.27.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.27.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.27.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.27.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.27.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.28.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.28.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.28.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.28.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.28.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.28.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.29.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.29.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.29.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.29.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.29.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.29.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.30.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.30.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.30.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.30.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.30.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.30.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.31.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.31.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.31.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.31.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.31.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.31.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.32.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.32.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.32.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.32.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.32.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.32.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.33.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.33.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.33.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.33.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.33.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.33.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.34.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.34.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.34.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.34.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.34.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.34.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.35.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.35.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.35.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.35.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.35.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.35.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.36.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.36.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.36.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.36.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.36.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.36.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.37.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.37.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.37.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.37.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.37.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.37.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.38.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.38.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.38.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.38.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.38.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.38.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.39.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.39.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.39.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.39.self_attn.out_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]). size mismatch for model.decoder.layers.39.fc1.bias: copying a param with shape torch.Size([20480]) from checkpoint, the shape in current model is torch.Size([1, 20480]). size mismatch for model.decoder.layers.39.fc2.bias: copying a param with shape torch.Size([5120]) from checkpoint,the shape in current model is torch.Size([1, 5120]). You may consider adding ignore_mismatched_sizes=Truein the modelfrom_pretrained method.

git lfs pull ERROR

cannot pull .pt files in act_scales.

When run:

git lfs pull

got ERROR:

batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Input to ReLU is quantized to int8? An error in quantization_flow.png?

In quantization_flow.png which is referenced in the int8 demo
https://github.com/mit-han-lab/smoothquant/blob/main/examples/smoothquant_opt_real_int8_demo.ipynb, input to ReLU seems to be int8 as well. Is this an error?
As far as I know activation function's input is usually kept in floating point.
Also in your quantize_model function in the other demo notebook, for fc layers, only the inputs are quantized. So input to ReLU should remain full-precision?

How can smoothquant be used in ConvNets

Thanks for your good work, I was wondering that how smoothquant can be used in ConvNets since Conv kernels' shape is often [in_ch, out_ch, ks, ks]? Or it can not be used in ConvNets actually? Looking forward to your reply, thanks!

How to use SmoothQuant in FasterTransformer?

I have build and run FasterTransformer. I see there is a parameter --int8_mode in FasterTransformer,. will it use SmoothQuant as default, if I set int8_mode =1?

if not is there any example of using SmoothQuant in FasterTransformer ?

thank you!

No module named 'torch_int'

When I run the export_int8_model.py to export INT8 models, an error occurred: No module named 'torch_int'
I have installed the following packages. Thanks a lot, looking forward to reply.
torch 1.12.1+cu113
torchaudio 0.12.1+cu113
torchvision 0.13.1+cu113

Visualization tool

Hi,

I'm trying to reproduce the visualizations presented in the paper. The 3D visualization showing consistent high values across tokens for a channel is the one I'm working on. I tried matplotlib and vega but they both fail due to lack of memory. Can you share details/code on how that visualization was made?

To be precise, the visualization in slide#5.

Thank you!

error encounctered when loading act_scales

Hit invalid load key, 'v' when running the demo notebook smoothquant_opt_demo.ipynb in colab (GPU runtime).

---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
[<ipython-input-4-2b8a4412555c>](https://localhost:8080/#) in <module>
1 model = OPTForCausalLM.from_pretrained('facebook/opt-13b', torch_dtype=torch.float16, device_map='auto')
 ----> 2 act_scales = torch.load('smoothquant/act_scales/opt-13b.pt')
1 frames
[/usr/local/lib/python3.8/dist-packages/torch/serialization.py](https://localhost:8080/#) in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
   1000             "functionality.")
   1001 
-> 1002     magic_number = pickle_module.load(f, **pickle_load_args)
   1003     if magic_number != MAGIC_NUMBER:
   1004         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, 'v'.

Post-LayerNorm support

Hi, thanks for the great work. I applied smooth quant to post-layernorm models, but with significant performance downgrade since I have to introduce an elementwise-div or elementwise-mul operation in the skip path. Illustrated below:
image

Do you have any other suggestions to boost the performance? Thank you!

Minghao

git lfs is currently down,could you solve this problem?

The idea is perfect, but i can't download scale file from git-lfs now~
Error downloading object: act_scales/bloom-176b.pt (83318e8): Smudge error: Error downloading act_scales/bloom-176b.pt (83318e8c5be645ed6ed606a9ad14502e81db8bfad980f3ec214822ba7e424866): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Why do different models have the same size?

Thanks for your great work! But I'm a bit puzzled.
I run the smoothquant_opt_demo.ipynb and get three model size of model_FP16, model_w8a8 and model_smoothquant_w8a8 with print_model_size(model) from smoothquant_opt_real_int8_demo.ipynb.
I know that you simulate INT8 inference with FP16 with fake_quant.py, so the model size should be same as model_FP16.
But when I change w.div_(scales).round_().mul_(scales) to w.div_(scales).round_() in fake_quant.py, I got the similar model size.
I'm confused about how 'fake_quant.py' works and how to achieve real INT8?
looking forward to hearing from you soon. 🤩

paper says smoothing all linear layers, but code seems to smooth only the qkv projection in attention and the first fc in ffn?

In the smooth_lm function that is used to export int8 model, only the qkv projection layer and the first fc layer in FFN module go through the smoothing-based-on-range process, where the scaling of input is merged to the layernom before the linear layer, and the scaling of weights is merged to the weights of the linear layer.

def smooth_lm(model, scales, alpha=0.5):

How about for the second linear layer of attention and FFN module? They don't have a layernom before it, how is the scaling of input implemented? Pointer to the code will be appreciated!

Error loading `AutoModelForCausalLM` in `examples/generate_act_scales.py`

Thank you for providing this great repo!

I'm trying to generate activation scales for OPT-350M:

$ python examples/generate_act_scales.py --model-name ~/huggingface-models/opt-350m/ --output-path act_scales/opt-350m.pt

And I get the following error from transformers:

Traceback (most recent call last):
  File "examples/generate_act_scales.py", line 51, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "examples/generate_act_scales.py", line 35, in main
    model, tokenizer = build_model_and_tokenizer(args.model_name)
  File "examples/generate_act_scales.py", line 15, in build_model_and_tokenizer
    model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
  File "/usr/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/usr/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2225, in from_pretrained
    model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model(
  File "/usr/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2279, in _load_pretrained_model
    if device_map is not None and "disk" in device_map.values() and offload_folder is None:
AttributeError: 'str' object has no attribute 'values'

Output of pip show transformers:

Name: transformers
Version: 4.20.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache
Location: /usr/lib/python3.8/site-packages
Requires: pyyaml, filelock, packaging, requests, huggingface-hub, tokenizers, tqdm, numpy, regex
Required-by:

Eta for PyTorch

What's the eta for release in PyTorch and/or FasterTransformer?

In PyTorch, what will the interface to use smoothquant look like? Will it be a wrapper around a linear layer or a custom linear layer?

Doesn't work on gpt models.

Hi, GuangXuan, Thanks for your amazing work!
I'm now working on gpt model quantization, unlike opt models based on nn.linear, the gpt models are based on Cov1D which is exactly the same with nn.linear by a transpose operation. Another difference between gpt model and opt model is the positional embedding, which gpt directly use the nn.embedding class and opt use the OPTLearnedPositionalEmbedding class.
With the conclution above, I wrote a gpt2opt converter, and modified the opt class forward method, and finally got the converted model a exactly same accuracy with original gpt model. However, when I tried to generate the act scales, export the int8 model, and then run the evaluate code with the converted opt model, it occurs a accuracy=0.0 problem. Also the gpt-converted-opt model act scales are much larger than the scales generated by the original opt models.
Also, when I run the test_opt_decoder.py script provide by the torch-int repo, the original opt-125m got an result of 0.0359 and the converted opt model is 0.2607.
I've reviewd the code for many times but really can't work this out. Hope you can give me some advices if you have any idea about this. If you need more information, I can provide the converted model and the modified OPTForCausalLM code. Thanks!

Out of memory

Hi, i have tried the model gpt-neox-20b, but i encounter a problem is that the python process will be killed by system, because the python code as follows consumes too much memory

        int8_model = Int8OPTForCausalLM.from_float(model, decoder_layer_scales)

How can i fix this problem

thanks

Support for LLAMA

Hi Authors,

Are you planning on supporting LLAMA in smoothquant when it hits the market? - I've always liked working with your projects and find LLAMA to be the next evolution of LLMs.

Thank you!

How to conduct zero-shot experiments?

Hello, I am a beginner in the field of NLP, and I would like to ask how to conduct zero-shot experiments in research papers, especially when dealing with different tasks like the Hellaswag dataset, which involves selecting one option out of four. How should I construct my input and how should I handle the output?

[BUG] Int8 inference with torch-int encounter errors

Hi! I used export_int8_model.py to smooth, quantize and export INT8 models for opt-1.3b. The model size has changed from 2509MB to 1357MB, which seems that the quantization is successful. But when I evaluated the int8 model, the following error occurred. It seems to be a problem with cutlass. How do I solve this problem?
Looking forward to your reply! Thanks!

Here is my environment:

  • GPU: V100
  • ubuntu-18.04
  • cuda-11.3, cudnn-8.2.0
  • cutlass-2.11

Here is the detailed error message:

RuntimeError                              Traceback (most recent call last)
Cell In[5], line 7
      6 print_model_size(model_int8)
----> 7 acc_smoothquant, lantecy_smoothquant = evaluator.evaluate(model_int8)
      8 print(f'SmoothQuant INT8 accuracy: {acc_smoothquant}, per-sample lantecy: {lantecy_smoothquant:.3f}ms')

File ~/anaconda3/envs/smoothquant/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

Cell In[2], line 29, in Evaluator.evaluate(self, model)
     27 torch.cuda.synchronize()
     28 start.record()
---> 29 outputs = model(input_ids)
     30 end.record()
     31 torch.cuda.synchronize()

File ~/anaconda3/envs/smoothquant/lib/python3.8/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
...
     43 return y

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
------------------------------------------------------------------------------------------------------------------------------------
tlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [218,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [219,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [220,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [221,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [222,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [223,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [96,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [97,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [98,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [99,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [100,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [101,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [102,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [103,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [104,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [105,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [106,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [107,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [108,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [109,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [110,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [111,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [112,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [113,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [114,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [115,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [116,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [117,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [118,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [119,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [120,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [121,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [122,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [123,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [124,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [125,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [126,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.
/home/l50024761/llm/torch-int/submodules/cutlass/include/cutlass/arch/memory_sm75.h:208: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor; int MatrixCount = 4]: block: [0,9,0], thread: [127,0,0] Assertion `0 && __PRETTY_FUNCTION__` failed.

Calculating quantization scales for new models?

I saw that you've uploaded activation scales (equation 4: $S_j = max(|X_j|) α / max(|W_j|)(1−α)$ ) for a number of models, but if calculating this for a new model, how do you use a calibration dataset when doing that? Do you take the maximum across all of the calibration values, or calculate the maximum for each value individually and then average?

Apologies if this will be addressed in soon-to-be-released code.

How to implement this method combinded with decoder

Details:

1、the open souce code only support first round text input, if we set the past_key_values=True, and we can not got the error because the dims of attention mask not match:

The follows code can not works normal as the opt-6.7b before quantization

from opt import Int8OPTForCausalLM
from transformers import AutoTokenizer
model_smoothquant = Int8OPTForCausalLM.from_pretrained(
    '/data1/lileilai/opt-6.7b-smoothquant/', torch_dtype=torch.float16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("/data1/lileilai/opt-6.7b/")
input_sentences = ['In the last couple of days, a',
 'The New Jersey Department of Transportation is aware',
 'The New York Giants have a new head',
 'The New York Times has published its annual',
 'In a move that will likely make it',
 "The New York Giants' offensive linemen",
 'The Canadian Press has unanimously condemned the new',
 'The first time I saw the movie,'
]
batch_size = 8
inputs = input_sentences[: 8]

generate_kwargs = dict(max_new_tokens=100, do_sample=False)
def generate(model=None):
    input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)
    for t in input_tokens:
        if torch.is_tensor(input_tokens[t]):
            input_tokens[t] = input_tokens[t].to("cuda:0")

    outputs = model.generate(**input_tokens, **generate_kwargs)

    input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids]
    output_tokens_lengths = [x.shape[0] for x in outputs]

    total_new_tokens = [o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)]
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return zip(inputs, outputs, total_new_tokens)

generate(model=model_smoothquant)

image

The ppl value of the opt-6.7b-smoothquant model shows abnormal performance

I tested the mit-han-lab/opt-6.7b-smoothquant model and the opt-6.7b model on HuggingFace. The ppl obtained using the WikiText-2 dataset was 20.65 and 10.92, respectively. The tests were conducted on an A30 device. The increase in perplexity is difficult to comprehend, do you know the reason?

the code of ppl:
import datasets as dataset
import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoTokenizer,OPTForCausalLM
from torch.nn import CrossEntropyLoss
from opt import Int8OPTForCausalLM
import math

test = dataset.load_from_disk("../wikitext-2")
model_path = "../opt-6.7b-smoothquant"
token_path = "../opt-6.7b"

model = Int8OPTForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(token_path)
input_ids = tokenizer("\n\n".join(test["text"]), return_tensors="pt").input_ids
seq_len = input_ids.size(1)

max_length = model.config.max_position_embeddings
stride = model.config.max_position_embeddings

num_chunks = seq_len // max_length
print(f'Calculating perplexity over {num_chunks} chunks, stride={stride}')

nlls = []
prev_end_loc = 0
count = 0
total_ppl = 0
for i in range(num_chunks):
begin_loc = i * max_length
end_loc = begin_loc + i * max_length
end_loc = min(begin_loc + max_length, seq_len)
input_ids_ = input_ids[:, begin_loc:end_loc].cuda()
input_ids_[0][0] = model.config.bos_token_id

with torch.no_grad():
    outputs = model(input_ids_)
    shift_logits = outputs.logits[..., prev_end_loc:-1, :].float().contiguous()  
    shift_labels = input_ids_[..., prev_end_loc+1:].contiguous()          
    log_probs = -torch.log_softmax(shift_logits, dim=-1)
    ppl = log_probs.dim=(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
    total_ppl += ppl.mean().item()

average_ppl = total_ppl / num_chunks
print("Perplexity:", pow(math.e,average_ppl))

How to calculate Alpha?

Hi, thanks for sharing this fancy paper Smoothquant.
I just have a few simple questions about the parameter Alpha. Will appreciate it if you guys can provide more details about it.

  1. How to define the outliers? Is it for per-channel? or calculate among the whole activation tensor?
  2. How to get the ratio of outliers. As you mentioned in the paper, for example, 30% outliers, how do you get such a ratio?
  3. How to get the Alpha scale according to the ratio?

Thank you.

support GPTNEOX model

Hi
I am working on quantization using the gptneox model.
During the quantization process, I got the following message.

"You are using a model of type gpt_neox to instantiate a model of type opt. This is not supported for all configurations of models and can yield errors."

Is GPTNEOX conversion not possible?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.