GithubHelp home page GithubHelp logo

mobiusml / hqq Goto Github PK

View Code? Open in Web Editor NEW
614.0 16.0 59.0 332 KB

Official implementation of Half-Quadratic Quantization (HQQ)

Home Page: https://mobiusml.github.io/hqq_blog/

License: Apache License 2.0

Python 89.60% C++ 2.63% Cuda 7.77%
machine-learning quantization llm

hqq's Introduction

Half-Quadratic Quantization (HQQ)

This repository contains the official implementation of Half-Quadratic Quantization (HQQ) presented in our articles:

What is HQQ?

HQQ is a fast and accurate model quantizer that skips the need for calibration data. Quantize the largest models, without calibration data, in just a few minutes at most ๐Ÿš€.

FAQ Why should I use HQQ instead of other quantization methods?
  • HQQ is very fast to quantize models.
  • It supports 8,4,3,2,1 bits.
  • You can use it on any model (LLMs, Vision, etc.).
  • The dequantization step is a linear operation, this means that HQQ is compatbile with various optimized CUDA/Triton kernels.
  • HQQ is compatible with peft training.
  • We try to make HQQ fully compatible `torch.compile` for faster inference and training.

What is the quality of the quantized models?
We have detailed benchmarks on both language and vision models. Please refer to our blog posts: HQQ, HQQ+.

What is the speed of the quantized models?
4-bit models with axis=1 can use optimized inference fused kernels like torchao's int4_gemm. This is the same kernel used in gpt-fast and based on our benchmarks, it's the fastest kernel available right now. We also support the Marlin kernel. Moreover, we focus on making hqq fully compatible with torch.compile which speeds-up both training and inference. For more details, please refer to the backend section below.

What quantization settings should I use?
You should start with nbits=4, group_size=64, axis=1. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to axis=0 and use the ATEN backend. If you want to use lower like nbits=2, you should use axis=0with a low group-size via HQQ+, meaning adding low-rank adapters and fine-tune with a small dataset.

What does the axis parameter mean?
The axis parameter is the axis along which grouping is performed. In general axis=0 gives better results than axis=1, especially at lower bits. However, the optimized inference runtime only supports axis=1 for the moment.

What is the difference between HQQ and HQQ+?
HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.

Installation

First, make sure you have a Pytorch 2 version that matches your CUDA version: https://pytorch.org/

You can install hqq via pip install hqq.

To get the latest version, you can install the core library directly via pip install git+https://github.com/mobiusml/hqq.git.

Alternatively, clone the repo and run pip install . from this current folder.

Basic Usage

To perform quantization with HQQ, you simply need to replace the linear layers ( torch.nn.Linear) as follows:

from hqq.core.quantize import *
#Quantization settings
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)

#Replace your linear layer 
hqq_layer = HQQLinear(your_linear_layer, #torch.nn.Linear or None 
                      quant_config=quant_config, #quantization configuration
                      compute_dtype=torch.float16, #compute dtype
                      device='cuda', #cuda device
                      initialize=True, #Use False to quantize later
                      del_orig=True #if True, delete the original layer
                      )

The quantization parameters are set as follows:

  • nbits (int): supports 8, 4, 3, 2, 1 bits.
  • group_size (int): no restrictions as long as weight.numel() is divisible by the group_size.
  • quant_zero (bool): if True, it quantizes the zero-point to 8-bit without grouping.
  • quant_scale (bool): if True, it quantizes the scaling factor to 8-bit with a group_size of 128.
  • offload_meta (bool): if True, meta-data is offloaded to the CPU.
  • view_as_float (bool): if True, the quantized parameter is viewed as float instead of a int type.

Setting offload_meta=True drastically decreases the GPU memory requirements but makes processing slower for smaller group-sizes. When turned on, you can run Llama2-70B and Mixtral with HQQ 2-bit using only 18.8GB and 13GB VRAM respectively.

Backend

Native Backends

The following native backends can be used by the HQQLinear module:

HQQLinear.set_backend(HQQBackend.PYTORCH)          #Pytorch backend
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)  #Compiled Pytorch
HQQLinear.set_backend(HQQBackend.ATEN)             #Aten/CUDA backend

The HQQBackend.ATEN backend is automatically installed and used by default when available. Note that HQQBackend.ATEN only supports axis=0. For axis=1 you need to use HQQBackend.PYTORCH or HQQBackend.PYTORCH_COMPILE.

Below you can find the speed-up benchmark with various backends, HQQBackend.PYTORCH being the baseline:

Titan RTX A100

Faster Inference

We support external backends for faster inference with fused kernels. You can enable one of the backends after the model was quantized as follows:

from hqq.utils.patching import prepare_for_inference

#Pytorch backend that makes the model compatible with fullgrah torch.compile: works with any settings
#prepare_for_inference(model) 

#Torchao's tiny_gemm backned (fastest): nbits=4, compute_dtype=bfloat16, axis=1
prepare_for_inference(model, backend="torchao_int4") 

#Marlin backend: nbits=4, axis=1, compute_dtype=float16, group_size=None
#prepare_for_inference(model, backend="marlin", allow_merge=True) 

#Bitblas backend: nbits=4/2/1, axis=1, compute_dtype=float16, group_size=None
#prepare_for_inference(model, backend="bitblas") 

These backends only work with 4-bit quantization and axis=1. Additionally, for Marlin, we only support group_size=None. Below you can find a comparison between the different backends. The torchao kernel reaches 195 tokens/sec (generation speed) on a 4090.

backend 4090

Usage with Models

Transformers ๐Ÿค—

For usage with HF's transformers, see the example below from the documentation:

from transformers import AutoModelForCausalLM, HqqConfig

# All linear layers will use the same quantization config
quant_config = HqqConfig(nbits=4, group_size=64, quant_zero=False, quant_scale=False, axis=1)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map="cuda", 
    quantization_config=quant_config
)

Note: You can't save/load quantized models directly via save_pretrained with this approach. Use the save/load calls from the hqq lib instead.

HQQ Lib

You can also utilize the HQQ library to quantize transformers models:

#Load the model on CPU
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)

#Quantize
from hqq.models.hf.base import AutoHQQHFModel
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1) 
AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)

Save/Load

You can save/load quantized models as follows:

from hqq.models.hf.base import AutoHQQHFModel

#Save: Make sure to save the model BEFORE any patching
AutoHQQHFModel.save_quantized(model, save_dir)

#Load
model = AutoHQQHFModel.from_quantized(save_dir)

Setting a backend

You can set a native backned as follows:

HQQLinear.set_backend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCH_COMPILE)

You can patch for faster inference as explained in the backend section:

from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model, backend="torchao_int4")

Custom HF Models

AutoHQQHFModel is meant to be compatible with any transformers model. However, its adaptability comes with a drawback - it may encounter issues or experience sluggishness when processing layers. If you encounter such problems, you have the option to create a custom model with clearly defined patching logic to replace AutoHQQHFModel. Below are examples of popular models you can utilize or expand upon for this purpose:

from hqq.models.hf.llama import LlamaHQQ #Llama
from hqq.models.hf.mistral import MistralHQQ #Mistral
from hqq.models.hf.mixtral import MixtralHQQ #Mixtral

Custom Quantization Configurations โš™๏ธ

You can set up various quantization configurations for different layers by specifying the settings for each layer name:

Transformers ๐Ÿค—

# Each linear layer with the same tag will use a dedicated quantization config
q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False}
q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False}

quant_config  = HqqConfig(dynamic_config={
  'self_attn.q_proj':q4_config,
  'self_attn.k_proj':q4_config,
  'self_attn.v_proj':q4_config,
  'self_attn.o_proj':q4_config,

  'mlp.gate_proj':q3_config,
  'mlp.up_proj'  :q3_config,
  'mlp.down_proj':q3_config,
})

HQQ lib

from hqq.core.quantize import *
q4_config    = BaseQuantizeConfig(nbits=4, group_size=64, quant_zero=False, quant_scale=False) 
q3_config    = BaseQuantizeConfig(nbits=3, group_size=32, quant_zero=False, quant_scale=False)

quant_config = {'self_attn.q_proj':q4_config,
  'self_attn.k_proj':q4_config,
  'self_attn.v_proj':q4_config,
  'self_attn.o_proj':q4_config,

  'mlp.gate_proj':q3_config,
  'mlp.up_proj'  :q3_config,
  'mlp.down_proj':q3_config,
}

Peft Training

You can use HQQ for LoRA training as follows:

#First, quantize/load a quantized HQQ model the
from hqq.core.peft import PeftUtils

base_lora_params = {'lora_type':'default', 'r':32, 'lora_alpha':64, 'dropout':0.05, 'train_dtype':torch.float32}
lora_params      = {'self_attn.q_proj': base_lora_params,
                    'self_attn.k_proj': base_lora_params,
                    'self_attn.v_proj': base_lora_params,
                    'self_attn.o_proj': base_lora_params,
                    'mlp.gate_proj'   : None,
                    'mlp.up_proj'     : None,
                    'mlp.down_proj'   : None}


#Add LoRA to linear/HQQ modules
PeftUtils.add_lora(model, lora_params)

#Optional: set your backend
HQQLinear.set_backend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCH_COMPILE)

#Train ....

#Convert LoRA weights to the same model dtype for faster inference
model.eval()
PeftUtils.cast_lora_weights(model, dtype=compute_dtype)

#Save LoRA weights
PeftUtils.save_lora_weights(model, filename)

#Load LoRA weights: automatically calls add_lora 
PeftUtils.load_lora_weights(model, filename)

We provide a complete example to train a model with HQQ/LoRA that you can find in examples/lora/train_hqq_lora_example.py.

If you want to use muti-gpu training via FSDP, check out this awesome repo by Answer.AI: https://github.com/AnswerDotAI/fsdp_qlora

Examples

We provide a variety of examples demonstrating model quantization across different backends within the examples directory.

Citation ๐Ÿ“œ

@misc{badri2023hqq,
title  = {Half-Quadratic Quantization of Large Machine Learning Models},
url    = {https://mobiusml.github.io/hqq_blog/},
author = {Hicham Badri and Appu Shaji},
month  = {November},
year   = {2023}

hqq's People

Contributors

appoose avatar bclavie avatar envomp avatar fahadh4ilyas avatar keremturgutlu avatar liberatedwinner avatar markiemark avatar mobicham avatar qubitium avatar sonald avatar viraatdas avatar warner-benjamin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hqq's Issues

Support for Mixtral with vLLM

Hi, I'm trying your quantized Mixtral and it's amazing!! Huge congrats! This is super useful!

I've been able to try it without vLLM but it seems it's not possible to load with vLLM yet. I guess adding this support is already in your backlog; I just wanted to point out that I'm looking forward to it, and thank you for this amazing work!

Best,
Haritz

Reuse Huggingface model cache directory as standard

As the idea is it seems (or should be?) minimizing the compute resources needed for these models, one useful optimization is to avoid unnecessarily re-downloading models from Huggingface (when for instance they have already been downloaded with the huggingface_hub API to the standard cache directory such as ~/.cache/huggingface/hub/)

diff --git a/hqq/models/base.py b/hqq/models/base.py
index e305391..2ba7fa1 100755
--- a/hqq/models/base.py
+++ b/hqq/models/base.py
@@ -282,7 +283,10 @@ class BaseHQQModel:
         save_dir = pjoin(cache_dir, save_dir_or_hub)
 
         if not os.path.exists(save_dir):
-            save_dir = snapshot_download(repo_id=save_dir_or_hub, cache_dir=cache_dir)
+            if cache_dir == "":
+                save_dir = snapshot_download(repo_id=save_dir_or_hub)
+            else:
+                save_dir = snapshot_download(repo_id=save_dir_or_hub, cache_dir=cache_dir)
             save_dir = pjoin(save_dir)
 
         # Check

may need some adjustment according to how you were intending it should work; for instance, I think my code would not create a (new) copy in a directory more local to the cwd (though it should? handle older files downloaded to such locations), you may think you need to add that for the sake of matching the original functionality

Your session crashed after using all available RAM

I am using Free Google Colab-Notebook and GPU

I want to quantize a 7B model but am not able to quantize even getting an error when downloading the model from hugging face I simply pass the model id-"senseable/WestLake-7B-v2" but when it starts to download it occupied all the RAM. even though I passed Cuda to use Free Colabe GPU it again loaded on RAM and gave me an error that it uses all available RAM.

Screenshot 2024-04-01 173244
Screenshot 2024-04-01 173224

when I use BitsAndBytes quantization I simply pass BitsAndBytesConfig to AutoModelForCausalLM and it quantizes the model during Downloading on 4bit quantization it almost takes 5.5 GB of GPU so I use free Colab GPU

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type= "nf4",
bnb_4bit_use_double_quant=False,
)
model= AutoModelForCausalLM.from_pretrained(
"senseable/WestLake-7B-v2",
device_map="auto",
quantization_config=bnb_config,
trust_remote_code=True,
use_flash_attention_2=False,
torch_dtype=torch.bfloat16,
)


Can I perform HQQ-Quantization using AutoModelForCausalLM, I did not want to download the model and then perform HQQ quantization it may not be possible on free Colab GPU.

How can I perform HQQ-Quantization during downloading of the model as I did in BitsAndBytes-Quantization during downloading the model ?

Initializing the model from state_dict

I would like to only modify certain layers of a nn model. Example way of doing so:

from transformers.models.mixtral.modeling_mixtral import *
from hqq.core.quantize import *

def hqq_init(self, config: MixtralConfig):
    nn.Module.__init__(self)
    self.ffn_dim = config.intermediate_size
    self.hidden_dim = config.hidden_size

    qcfg = BaseQuantizeConfig(nbits=2, group_size=64)
    self.w1 = HQQLinear(nn.Linear(self.hidden_dim, self.ffn_dim, bias=False), quant_config=qcfg)
    self.w2 = HQQLinear(nn.Linear(self.ffn_dim, self.hidden_dim, bias=False), quant_config=qcfg)
    self.w3 = HQQLinear(nn.Linear(self.hidden_dim, self.ffn_dim, bias=False), quant_config=qcfg)
    self.act_fn = ACT2FN[config.hidden_act]

MixtralBlockSparseTop2MLP.__init__ = hqq_init

^ Sparse layers of MoE have high level of redundancy unlike dense layers, so only quantizing them makes sense.

Which means that we need to convert the respective weight matrices to quantizised form and load them into the model:

from safetensors import safe_open
from hqq.core.quantize import Quantizer

tensors = {}
for i in range(1, 20):
    path = MODEL_DIR + f"/models--mistralai--Mixtral-8x7B-Instruct-v0.1/snapshots/1e637f2d7cb0a9d6fb1922f305cb784995190a83/model-{i:0>{5}}-of-00019.safetensors"

    with safe_open(path, framework="pt", device="cpu") as f:
        for k in f.keys():
            tensor = f.get_tensor(k)
            if "expert" in k:
                print("quantizising:", k)
                W_q, meta = Quantizer.quantize(tensor, nbits=2, group_size=64)
                tensors[str(k).replace(".weight", ".W_q")] = W_q
                # tensors[str(k).replace(".weight", ".meta")] = meta 
                # meta is not marked as a parameter/tensor, so can't load it using nn.Module._load_from_state_dict
                # HQQLinear(nn.Module) doesn't overwrite the method either to a suitable one
            else:
                tensors[k] = tensor

config = MixtralConfig()
modified_model = MixtralForCausalLM(config=config)

modified_model.load_state_dict(tensors)

Which results in a model:

MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MixtralDecoderLayer(
        (self_attn): MixtralSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=4096, out_features=8, bias=False)
          (experts): ModuleList(
            (0-7): 8 x MixtralBlockSparseTop2MLP(
              (w1): HQQLinear()
              (w2): HQQLinear()
              (w3): HQQLinear()
              (act_fn): SiLU()
            )
          )
        )
        (input_layernorm): MixtralRMSNorm()
        (post_attention_layernorm): MixtralRMSNorm()
      )
    )
    (norm): MixtralRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

But the resulting model doesn't have the correct meta parameters.

I see that the HQQLinear has a method load_state_dict, but it doesn't get called. By default _load_from_state_dict gets called instead. Additionally, I'm not entirely sure the weight matrices get loaded in properly either

Respective model serialization and deserialization could be done like this instead then:

path = EXPERIMENTS_DIR + f"/models/Mixtral-8x7B-Instruct-v0.1-HQQ-2bit"
modified_model.save_pretrained(path)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1", revision="1e637f2d7cb0a9d6fb1922f305cb784995190a83")
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.float16).to("cuda")

Readme save_quantized issue

In the readme part:

model.save_quantized(model, save_dir=save_dir)

seems should be

model.save_quantized(save_dir=save_dir)

to avoid passing the same structure twice

Speed benchmark

Hello guys,

Congrats for the wonderful package / paper.

I ma just curious, before implementing this in OpenNMT-y, if you have somewhere some speed benchmark in tok/sec with other methods given an inference framework whether it's pure pytorch or vLLM or whatever.

Cheers.

NB: I have mixed feelings with torch.compile because the first pass/load it often very slow.

question about the \kappa in your blog and code

Hi:

Thanks for your wonderful code and method!

I used to work on compress sensing, but only with L-1 optimization. I'm unfamiliar with the half-quadratic method. I notice that the \kappa is set to 1.01 in your code. If I'm not mistaking, the \kappa acts as a Laplacian penalty argument to me. I thought that if this was the case, the \kappa should be a large parameter, that will guarantee the \beta_t to be infinite after some iterations. 1.01 seems too small for this mission in my point of view? Or the \beta_t does not need to be infinite in half-quadratic optimization?

Thanks for your code again and looking forward to the reply!

SafetensorError: Error while deserializing header: HeaderTooLarge

using fintune mixtral but getting

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:4091, in PreTrainedModel._load_pretrained_model(cls, model, state_dict, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage, device_map, offload_folder, offload_state_dict, dtype, is_quantized, keep_in_fp32_modules)
4089 if shard_file in disk_only_shard_files:
4090 continue
-> 4091 state_dict = load_state_dict(shard_file)
4093 # Mistmatched keys contains tuples key/shape1/shape2 of weights in the checkpoint that have a shape not
4094 # matching the weights in the model.
4095 mismatched_keys += _find_mismatched_keys(
4096 state_dict,
4097 model_state_dict,
(...)
4101 ignore_mismatched_sizes,
4102 )

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:503, in load_state_dict(checkpoint_file)
498 """
499 Reads a PyTorch checkpoint file, returning properly formatted errors if they arise.
500 """
501 if checkpoint_file.endswith(".safetensors") and is_safetensors_available():
502 # Check format of the archive
--> 503 with safe_open(checkpoint_file, framework="pt") as f:
504 metadata = f.metadata()
505 if metadata.get("format") not in ["pt", "tf", "flax"]:

Packing Format

Great project and clean implementation!

Wondering what the motivation for packing 4-bit tensors (into u8) such that the first half of the tensor is interleaved with the second half.

More specifically, each byte contains a 4-bit val from the first half of the tensor (in the upper bits) and a 4-bit val from the second half of the tensor (in the lower bits), as opposed to packing consecutive values, such that unpacking could more easily yield contiguous segments of the original tensor.

ImportError: cannot import name 'PagedAttentionWithRoPE'

First of all, thank you for your wonderful work!

I would like to report a minor issue with library import.
Due to a refactoring of the vLLM library, HQQ's vllm engine is not working as shown below.

The latest version of vLLM is 0.2.6, and the HQQ's vllm engine only works with vllm <= 0.2.2.

I hope this helps you.

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2023-12-22 แ„‹แ…ฉแ„’แ…ฎ 4 56 28

Issue with HQQLinear Layer in Stable Diffusion Model on Aten Backend

Hi,

I'm currently interested in using the HQQLinear layer in the Stable Diffusion Model. To do this, I replaced all the nn.Linear layers in the transformer block with HQQLinear while leaving the normal nn.Conv2d layers unchanged.

This naive implementation works on the PyTorch backend, but I encountered the following error when using the Aten backend:

File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

I believe there might be a way to use the default PyTorch functions only on the nn.Conv2d layer, but I'm not sure how to proceed further.

Could you suggest a starting point or provide guidance on how I can resolve this issue?

Supported Model in README

Can you please update the README, specifying which model do you support? All LLMs from the hugging face Or specific architecture?
If not will you support other model in future?
image

Extended from this issue: #30

By the way, this is great library for quantizing model.

How to load quantized model with flash_attn?

Hi, I have saved a quantized model using AutoHQQHFModel.save_quantized(model, save_dir).
Now I want to load the quantized model using model = AutoHQQHFModel.from_quantized(save_dir). What should I do if I want to set params like low_cpu_mem_usage,attn_implementation and torch_dtype to conduct training.

KeyError: 'self_attn.dense'


KeyError Traceback (most recent call last)
in <cell line: 31>()
29
30 #Apply LoRA
---> 31 PeftUtils.add_lora(model, lora_params)
32
33 #Dataset

1 frames
/usr/local/lib/python3.10/dist-packages/hqq/models/hf/phi.py in patch_linearlayers(cls, model, patch_fct, patch_params, verbose)
53 )
54 layers[i].self_attn.dense = patch_fct(
---> 55 layers[i].self_attn.dense, patch_params["self_attn.dense"]
56 )
57 layers[i].mlp.fc1 = patch_fct(layers[i].mlp.fc1, patch_params["mlp.fc1"])

KeyError: 'self_attn.dense'

Unable to add lora module for training

AttributeError: module 'torch' has no attribute 'compile'

I did install it on a fresh machine with

!pip install -U torch
!pip install -U transformers
!pip install -U hqq

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [1], in <cell line: 6>()
      4 model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ'
      5 #Load the model
----> 6 from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
      7 tokenizer = AutoTokenizer.from_pretrained(model_id)
      8 model     = HQQModelForCausalLM.from_quantized(model_id)

File /usr/local/lib/python3.9/dist-packages/hqq/engine/hf.py:6, in <module>
      2 from typing import Dict
      4 _HQQ_REGISTRY = {}
----> 6 from ..models.hf.llama import LlamaHQQ
      7 _HQQ_REGISTRY['LlamaForCausalLM'] = LlamaHQQ
      9 from hqq.models.hf.mixtral import MixtralHQQ

File /usr/local/lib/python3.9/dist-packages/hqq/models/hf/llama.py:1, in <module>
----> 1 from ..base import *
      2 from .base  import *
      4 #Patch LLama functions

File /usr/local/lib/python3.9/dist-packages/hqq/models/base.py:9, in <module>
      6 from abc import abstractmethod
      8 from huggingface_hub import snapshot_download
----> 9 from ..core.quantize import HQQLinear
     11 #Cleanup GPU vram
     12 def cleanup():

File /usr/local/lib/python3.9/dist-packages/hqq/core/quantize.py:246, in <module>
    243 		return grad_input, grad_weight, grad_bias
    245 #Main linear layer 
--> 246 class HQQLinear(torch.nn.Module):
    247 	backend = HQQBackend.PYTORCH #Default
    249 	def __init__(self, linear_layer, quant_config, del_orig=True, compute_dtype=torch.float16, device_n=0):

File /usr/local/lib/python3.9/dist-packages/hqq/core/quantize.py:389, in HQQLinear()
    386 	weight = self.dequantize() 
    387 	return torch.matmul(x, weight.t() if (transpose) else weight)
--> 389 @torch.compile()
    390 def matmul_compile(self, *args, **kwargs):
    391 	return self.matmul(*args, **kwargs)
    393 def forward_pytorch_backprop(self, x):

AttributeError: module 'torch' has no attribute 'compile'

this probably occurs because it needs the wrong pytorch version (1.X) which doesnt feature compile .

A hack to prevent this is to install pytorch 2.X after installing HQQ

torch.compile fails with view_as_float=True

Seems like torch.compile doesn't like using views on dtypes. This causes the PYTORCH_COMPILE backend and model=torch.compile(model) to break when view_as_float is set to True :

BackendCompilerFailed: backend='inductor' raised:
LoweringException: NotImplementedError: bitcast torch.float16 to different bitwidth type torch.uint8 is not supported yet.

Wrapping the view with torch.jit.ignore doesn't work in this case.
Minimal code to reproduce the issue:

import torch
from hqq.core.quantize import *

HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)

#######################################################################################
batch_size    = 1
context_size  = 512
compute_dtype = torch.float16
linear_layer  = torch.nn.Linear(4096, 4096)

quant_config  = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, offload_meta=False, view_as_float=True) 
hqq_linear    = HQQLinear(linear_layer, quant_config, compute_dtype=compute_dtype, del_orig=False)

@torch.jit.ignore
def dequantize_Wq_aten(W_q, meta):
	if meta['view_as_float']: W_q = W_q.view(meta['unpack_view_dtype'])
	return hqq_aten.dequantize(W_q, meta['scale'], meta['zero'], meta['shape'], meta['group_size'] if (meta['group_size']) else -1, meta['nbits'], meta['axis'], meta['packing'])

@torch.compile()
def dequantize(hqq_layer):
	return dequantize_Wq_aten(hqq_layer.W_q, hqq_layer.meta)

######################################################################################

#This works: 
hqq_linear.W_q.data = hqq_linear.W_q.data.view(hqq_linear.meta['unpack_view_dtype']) 
W_r = dequantize(hqq_linear)

#This breaks
hqq_linear.W_q.data = hqq_linear.W_q.data.view(compute_dtype) 
W_r = dequantize(hqq_linear)

A work around would be moving the view call outside dequantize but this will make the code more complicated and will require another call to revert back to float bitpacking.

This is mainly a Pytorch bug, so I created the issue there as well: pytorch/pytorch#120998

@KeremTurgutlu fyi

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Traceback (most recent call last):
File "/home/luhao/orpo/2bit-qz.py", line 1, in
from hqq.core.quantize import *
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/hqq/core/quantize.py", line 8, in
from .utils import is_divisible
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/hqq/core/utils.py", line 18, in
tensor: torch.Tensor, num_rows: int, dtype: torch.dtype | None = None
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Support dynamic quantization config to improve mixtral perplexity?

Hi! First of all, thank you for publishing the HQQ algorithm insightful blog post. It was a pleasure to tinker with an algorithm that quantizes the model in minutes instead of hours.
Weโ€™d appreciate it if you could comment on some of our observations about HQQ performance.

We tried several quantization regimes for Mixtral-8x7b-Instruct-v0.1 models and your pre-quantized checkpoints from huggingface hub. To recall, the mixtral model consists of ~45B mlp โ€œexpertsโ€ and ~1.5B other parameters (attention, gates, embeddings, logits, etc).
In your checkpoints, it appears that you are quantizing all linear layers (except logits and MoE gates) with the same quantization config, e.g. 2-bit or 4-bit.

From our experiments, it appears that the best quality-to-size trade-offs come from quantizing experts coarsely, but keeping non-expert parameters in higher precision. We tried several combinations of both HQQ and NF4 algorithms separately for experts and non-expert (attention layers) to the following results.

The table below reports the perplexity on a slice of wikitext calibration data (not the same as wiki2 test set), but it is large enough to be statistically significant.
The โ€œExpertsโ€ column has the quantization regime for the MixtralBLockSparseTop2MLP layers.

Pexplexity Memory, GB
Stem Experts
fp16 nf4 4.24 25.76
hqq_4bit_g128 4.25 26.82
hqq_3bit_g128 4.51 22.59
hqq_2bit_g16_s 4.89 20.21
hqq_2bit_g128 6.87 15.54
nf4 nf4 4.29 23.75
hqq_4bit_g128 4.30 24.80
hqq_3bit_g128 4.57 20.58
hqq_2bit_g16_s 4.98 18.20
hqq_2bit_g128 7.14 13.53

Curiously, if we consider the published 2bit g16 s128 configuration, the original checkpoint scores ~5.85 on our validation split at 18.03GB model size. However, if we switch attention layers from 2-bit HQQ to NF4 (~4-bit), we observe 4.98 perplexity with about 1.5% increase in model size (18.2GB).

Does this agree or disagree with your intuition about how HQQ should work for mixtral?

If it does, would you consider switching to this hybrid coarse/fine quantization scheme for Mixtral by default or as a built-in functionality of the hqq library?

If possible, weโ€™d also appreciate it if you can verify these results on your side, if you have time.

The code for reproducing the values in the table can be found at this link: https://gist.github.com/dvmazur/c4d32c753e06d8fc9a78606f84224176
We also include the versions of the main libraries we used in the readme of that gist.

Yours,
Artem Eliseev (myself @lavawolfiee) and Denis Mazur ( @dvmazur )

torch cpu only?

Hi,

I've tried adding a few device=torch.device('cuda' if torch.cuda.is_available() else 'cpu') replacements for device=device or device="cuda" in hqq/models/base.py or hqq/core/quantize.py, the main error I am seeing now is that the calls to self.stream_* = torch.cuda.Stream(*) in hqq/core/quantize.py:HQQLinear.__init__() error out with the Stream()'s super() call going to a dummy base class.

My guess is that the Stream has no practical meaning on torch_cpu, so it would be either necessary to persuade the torch maintainers to handle that, or (assuming that streams are just the tip of the iceberg) adapt the HQQ code to work without cuda GPU when there is only torch_cpu?

CPU-only seems a worthwhile objective, especially as the models themselves are increasingly small

Support MPS

Would it be possible to support Apple M1/M2/M3 hardware via the MPS backend for PyTorch?
I tried running things but hit some issues with hqq_aten that was not built, then I compiled the hqq_aten_torch.py variant with a little fix (it had a mistake in the setup_torch.py linking the wrong file), but it seems like this variant doesn't support MPS acceleration and so it fell back to CPU execution which wasn't really viable.

Issue with torchao patching with loaded model

Basically, when I quantize a model and patch it to use torchao_int4 ops, it works, but if I then save this model and load it again the patching fails. Am I doing something wrong ? I have been trying to follow the instructions.

This works:

import torch
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

#Model and setttings
model_id      = 'mistralai/Mixtral-8x22B-Instruct-v0.1'
compute_dtype = torch.bfloat16
device        = 'cuda:0'

#Load model on the CPU
######################
model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id) 

#Quantize the model
######################
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device) 

#Save the quantized model
model.save_quantized(save_dir="./quantized_mixtral_huge_attempt2/")

#Load from local directory or Hugging Face Hub on a specific device
#model = HQQModelForCausalLM.from_quantized(save_dir_or_hfhub, device='cuda', compute_dtype=torch.bfloat16)

from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model, backend="torchao_int4") #torchao's int4mm kernel, use compute_dtype=bfloat16
#prepare_for_inference(model, backend="marlin", allow_merge=True) #marlin int4 kernel.

model = torch.compile(model, mode='max-autotune')

#Text Generation
prompt = "<s> [INST] How do I build a car? [/INST] "

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
outputs = model.generate(**(inputs.to('cuda')), max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 59/59 [33:36<00:00, 34.18s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 56/56 [00:01<00:00, 41.41it/s]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 56/56 [16:14<00:00, 17.40s/it]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
How do I build a car? 1. Gather resources: You will need a variety of tools, materials, and equipment to build a car. This includes a chassis, engine, transmission, suspension, brakes, wheels, tires, body panels, interior components, and electrical systems. You will also need specialized tools such as welding equipment, a lift, and diagnostic tools.

... truncated

However, when I then try to load the quantized and saved model the patching step fails:

import torch
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

model_id      = 'mistralai/Mixtral-8x22B-Instruct-v0.1'
tokenizer = AutoTokenizer.from_pretrained(model_id) 

#Load from local directory or Hugging Face Hub on a specific device
model = HQQModelForCausalLM.from_quantized("./quantized_mixtral_huge/", device='cuda', compute_dtype=torch.bfloat16)

from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model, backend="torchao_int4") #torchao's int4mm kernel, use compute_dtype=bfloat16
#prepare_for_inference(model, backend="marlin", allow_merge=True) #marlin int4 kernel.

model = torch.compile(model, mode='max-autotune')

#Text Generation
prompt = "<s> [INST] How do I build a car? [/INST] "

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
outputs = model.generate(**(inputs.to('cuda')), max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Traceback (most recent call last):
  File "/home/rohitg/vision_llm/scratch/infer_saved_llm.py", line 12, in <module>
    prepare_for_inference(model, backend="torchao_int4") #torchao's int4mm kernel, use compute_dtype=bfloat16
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/utils/patching.py", line 74, in prepare_for_inference
    patch_linearlayers(model, patch_hqq_to_aoint4)
  File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/utils/patching.py", line 13, in patch_linearlayers
    model.base_class.patch_linearlayers(model, fct, dict([(k, patch_param) for k in model.base_class.get_linear_tags()]), verbose=verbse)
  File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/models/hf/mixtral.py", line 53, in patch_linearlayers
    layers[i].self_attn.q_proj = patch_fct(
                                 ^^^^^^^^^^
  File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/backends/torchao.py", line 243, in patch_hqq_to_aoint4
    w_q_config = hqq_layer.quant_config['weight_quant_params']
                 ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^

Pack/Unpack to Different Dtypes for FSDP

Thanks for this great package!

I've noticed that the existing packing/unpacking only works with certain dtypes. FSDP requires all the params to be float dtype for sharding, so are there any plans to extend them to different dtypes?

TypeError when load from_pretrain

Hi, I met the following error when I tried to load a llama model:

Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 59/59 [00:35<00:00,  1.66it/s]
Traceback (most recent call last):
  File "/workspace/code/quant.py", line 11, in <module>
    model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype,trust_remote_code=True)
  File "/usr/local/lib/python3.10/dist-packages/hqq/engine/hf.py", line 67, in from_pretrained
    cls._make_quantizable(model, quantized=False)your PyTorch and Transformers version
  File "/usr/local/lib/python3.10/dist-packages/hqq/engine/hf.py", line 35, in _make_quantizable
    model.arch_key = model.config.architectures[0]
TypeError: 'NoneType' object is not subscriptable

I use PyTorch==2.2.0 and Transformers==4.39.0.
How to solve this problem? looking forward to your reply.

An error occurred when I was training a 1bit model using lora........(element 0 of tensors does not require grad and does not have a grad_fn)


	model = AutoHQQHFModel.from_quantized(base_model).cuda()
	# Add Peft
	######################################################################################
	train_dtype = torch.float32
	base_lora_params = {'lora_type': 'default', 'r': 16, 'lora_alpha': 32, 'dropout': 0.05, 'train_dtype': train_dtype}
	lora_params = {'self_attn.q_proj': base_lora_params,
				   'self_attn.k_proj': base_lora_params,
				   'self_attn.v_proj': base_lora_params,
				   'self_attn.o_proj': base_lora_params,
				   'mlp.gate_proj': base_lora_params,
				   'mlp.up_proj': base_lora_params,
				   'mlp.down_proj': base_lora_params, }

	# Apply LoRA
	PeftUtils.add_lora(model, lora_params)

	data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
	trainer = transformers.Trainer(
		model=model,
		train_dataset=train_data,
		eval_dataset=val_data,
		args=transformers.TrainingArguments(
			per_device_train_batch_size=micro_batch_size,
			gradient_accumulation_steps=gradient_accumulation_steps,
			warmup_steps=0,
			num_train_epochs=num_epochs,
			learning_rate=learning_rate,
			fp16=True,
			logging_steps=50,
			optim="adamw_torch",
			gradient_checkpointing=True,
			gradient_checkpointing_kwargs={'use_reentrant': True},
			evaluation_strategy="steps" if val_set_size > 0 else "no",
			save_strategy="steps",
			eval_steps=100 if val_set_size > 0 else None,
			save_steps=200,
			output_dir=output_dir,
			save_total_limit=2,
			report_to="wandb" if use_wandb else None,
			run_name=wandb_run_name if use_wandb else None,
			remove_unused_columns=False,
			max_grad_norm=1.0,
		),
		data_collator=data_collator
	)



	signal.signal(signal.SIGINT, save_model)
	model.train()
	trainer.train()
	model.save_pretrained(output_dir)

	print(
		"\n If there's a warning about missing keys above, please disregard :)"
	)


if __name__ == "__main__":
	fire.Fire(train)

Map: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1000/1000 [00:00<00:00, 4859.96 examples/s]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 258/258 [00:00<00:00, 286.99it/s]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 449/449 [00:16<00:00, 27.09it/s]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 22550.02it/s]
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
/root/miniconda3/envs/train/lib/python3.10/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
0%| | 0/1 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False.
/root/miniconda3/envs/train/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/root/miniconda3/envs/train/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "/root/autodl-tmp/ft_4bit_freedom-hqq-lora2.py", line 271, in
fire.Fire(train)
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/root/autodl-tmp/ft_4bit_freedom-hqq-lora2.py", line 262, in train
trainer.train()
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 1999, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/miniconda3/envs/train/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
0%| | 0/1 [00:03<?, ?it/s]

Inference doesn't finish

Hello, I wanted to 2-bit quantize a couple of long context llama based models.
I did so for causallm-72b-35k, yi-34b-200k, giraffe-70b-32k, all with group size 16.
I've uploaded causallm:
https://huggingface.co/KnutJaegersberg/CausalLM-72B-preview-hqq-2bit

The problem I face is when I use the example code, inference never finishes. My GPU get's busy, but nothing else happens. All models I did have this. I should have tried first, but instead I just executed the quantization code several times.
Are these quantizations now useless?

HQQ + Brevitas

Hi everyone,

First of all, thanks for this amazing work!

I am one of the main developers of Brevitas, and I have been recently working to include this optimization as part of our library.

Here you can find the PR where I am working on it, including an extension to also optimize the scale, based on the suggestion given in this other issue, and some practical experimentation with it.

I've been doing some experiments on CNNs (mostly for the ability to quickly iterate between configurations), and I have few questions that I was hoping you could help me with:

  • With per channel quantization, HQQ (zero point) + bias correction gives me the same accuracy as just bias correction without HQQ. I assume this is more or less normal since in both cases the goal is to reduce quantization error through an additional term, but I was curious about your opinion.
  • While implementing HQQ for scale point optimization, I noticed that I had to considerably increase the value of beta, otherwise the final accuracy would drop. Apparently, even though the mean error would go down, the max error would increase way too much, hindering the quantization process. Would you have any intuition why this is the case?
  • If I try to optimize both scale and zero point at the same time (in this order), I notice a considerable drop in accuracy compared to the case where either one of them is optimized. This seems a bit strange since a similar setup for MSE seems to work just fine.

I'll be working and testing the implementation on a few more use cases, including Transformers and Stable Diffusion, and expanding to per group quantization. I'll let you know in case I have more questions.

Thanks,
Giuseppe

HQQ OOMs on large models

Hey, I have a machine with two 4090 GPUs (24 GB VRAM each). When I try to run HQQ quantization of Llama-2-70B:

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

#Model and setttings
model_id      = 'meta-llama/Llama-2-70b-chat-hf'
compute_dtype = torch.float16
device        = 'cuda:0'

#Load model on the CPU
######################
model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id) 

#Quantize the model
######################
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device) 

the first half of the layers seem to work fine, but then it OOMs, presumably because it tries to put the entire quantized model on a single GPU device. For Llama-2-70B, I could try renting an A100 machine and that should work, but for even larger models (eg. Grok-1) it would be impossible to fit the entire thing on a single GPU. Is splitting quantization across multiple GPUs supported, or planned to be supported in the future? Thanks :)

`.to` is not supported for HQQ-quantized models

I want to quantize "senseable/WestLake-7B-v2" as I do using transformer and BnB,

As you mention in #34 (comment)
I downloaded the transformer library from your repo https://github.com/mobiusml/transformers.git from Branch "stable".
then I downloaded the hqq using : pip install git+https://github.com/mobiusml/hqq.git.

then I use hqq_transformers_llama_example.py to quantize my model but facing the issue ".to is not supported for HQQ-quantized models", is it due to I use cuda?

Screenshot 2024-04-02 173733
Screenshot 2024-04-02 173648

tensorflow or keras implementation

do you support tensorflow or keras models ? any pointers on how to port it to those libraries
Also curios if this quantization techniques have been evaluated on smaller models like bert or DLRM

transfer learning?

Thank you for this great project.
I wanted to ask you, when I reduce it to 4 bits, for example, and then train it with the new data, I'm not sure if it's possible to merge the 4-bit model with the complete model to perform knowledge transfer?

Collaboration: Unsloth + HQQ

Hey HQQ team! Happy New Year!

I actually found out about HQQ from some Reddit posts about Mixtral - and had a look at #2 which was super insightful! Quantizing the attention layers in 4bit, and the MLP layers in 2bit is ingenious!

Also I compared BnB and HQQ's quantizations and it does seem HQQ is generally always better on the reconstruction error between the quantized and non quantized weight matrices.

I forgot to intro myself, but I'm the maintainer of Unsloth. We made QLoRA 2.2x faster and use 62% less memory!

I was thinking of supporting HQQ since it looks like a fabulous replacement to BnB's nf4, and HQQ also works for 3 and 2 bit quantization with 0 data calibration - so that's super sick!

It'll be cool if we can collaborate in anyway!

torch.compile() for quantized model

Hi,

I'm encountering errors when trying to compile a custom model using torch.compile() after replacing nn.Linear() with HQQLinear. Below is a minimal code snippet that reproduces the issue, along with the error messages.

Environment:

PyTorch version: 2.3.0

Code to Reproduce Issue:

import torch
import torch.nn as nn
from hqq.core import HQQLinear, BaseQuantizeConfig

class TestModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(320, 640)
        self.fc2 = nn.Linear(640, 320)

    def forward(self, x):
        return self.fc2(self.fc1(x))

model = TestModel().to('cuda')
rand_inps = torch.randn(1, 320).to('cuda')

quant_config = BaseQuantizeConfig(
    nbits=4, group_size=32, quant_zero=False,
    quant_scale=False, offload_meta=False, axis=0
)

hqq_layer = HQQLinear(
    model.fc1, quant_config=quant_config, compute_dtype=torch.float32,
    device='cuda', initialize=True, del_orig=True
)
setattr(model, "fc1", hqq_layer)

model = torch.compile(model, mode='max-autotune', fullgraph=True)
rand_out = model(rand_inps)

Error Messages:

Unsupported: NYI - autograd.Function with custom setup_context method
from user code:
   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)

In some model, an ArgsMismatchError occurs:

ArgsMismatchError: too many positional arguments.
  func = 'forward' /opt/conda/lib/python3.10/site-packages/hqq/core/quantize.py:254, args = [<class 'torch.autograd.function.Function'>, <class 'torch.Tensor'>, <class 'method'>, <class 'torch.nn.parameter.Parameter'>], kwargs = {}

However, when I run the examples/hf/mixtral_13GB_example.py provided in this repo, I checked that torch.compile() is working without issues. I'm not sure why my approach isn't working.

Could you advise on the correct implementation steps for compiling a quantized model?

Thank you in advance for your assistance.

load the model into GPU or device_map using HQQModelForCausalLM.from_pretrained?

Hello, it seems like using HQQModelForCausalLM.from_pretrained cant load the model using device_map or load in GPU causing the computer crash due to not enough RAM. But when I use the original AutoModelForCausalLM, I can pass device map and then it will offload layers between CPU and GPU and it won't crash. Because of this I am unable to use this library to load large model. Is there any method to solve this? Thanks.

TypeError: HQQWrapper.from_quantized() got an unexpected keyword argument 'adapter'

hi there!

youre results seems very promising, unfortunatly i was unable to get the code to work

lambdalabs H10 instance

installed by:

pip install git+https://github.com/mobiusml/hqq.git

trying to run:

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

#Load the model
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq'
model = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)

#Setup Inference Mode
tokenizer.add_bos_token = False
tokenizer.add_eos_token = False
if not tokenizer.pad_token: tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.config.use_cache = True
model.eval();

the exception

TypeError Traceback (most recent call last)
/tmp/ipykernel_3024/2511671864.py in
3 #Load the model
4 model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq'
----> 5 model = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora')
6 tokenizer = AutoTokenizer.from_pretrained(model_id)
7

TypeError: HQQWrapper.from_quantized() got an unexpected keyword argument 'adapter'

[Request] Remove pytorch from the project requirements

PyTorch cannot be effectively managed through a requirements.txt due to the need for specific channels.

After adding the hqq requirement to text-generation-webui, unwanted pytorch updates happened after pip install -r requirements.txt: oobabooga/text-generation-webui#4993. For this reason I had to remove hqq from the requirements.

I think that it would be best to remove PyTorch from

hqq/setup.py

Line 12 in bb422c9

install_requires=['numpy>=1.24.4','tqdm>=4.64.1', 'torch>=2.1.1', 'huggingface_hub', 'accelerate', 'timm', 'transformers>=4.36.1', 'termcolor'], #add vllm/langchain?
and just mention the minimum PyTorch version in the README, or at least not specify the minimum version.

smaple code doesn't run

I followed the installation guide using pip install hqq and then copied the code from huggingface: https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-2bit_g16_s128-HQQ

It gave me this error:
Traceback (most recent call last):
File "/home/alden/hqq/hqq.py", line 4, in
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
File "/home/alden/hqq/hqq.py", line 4, in
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
ModuleNotFoundError: No module named 'hqq.engine'; 'hqq' is not a package

I've tried to install it in another way which is pip install git+https://github.com/mobiusml/hqq.git, got the same result.

OS is Ubuntu 22.04, CUDA version is 12.2 and torch version is 2.2.2

Does anyone face this and solve it? Thanks

Load saved quant to continue training

Thanks for your efforts , looking forward to new updates.

I'm experimenting with ternary quant and 2 bits, I'm using the fine tuning example file and saved the quant successfully into these files:
qmodel.pt
config.json

###########################################################
trainer = SFTTrainer(
model=WrappedModel(model),
tokenizer=tokenizer,
max_seq_length=max_tokens,
train_dataset=dataset,
eval_dataset=None,
peft_config=None,
args=training_args,
packing=True,
#neftune_noise_alpha=5,
#data_collator=data_collator,
)

model.is_parallelizable = False
trainer.is_model_parallel = False
trainer.place_model_on_device = False

#print('perplexity', compute_perplexity_batched(model=model, tokenizer=tokenizer, predictions=[s['text'] for s in dataset_val], batch_size=1, max_length=max_tokens))
model.train()
trainer.train()

save_dir = './quantized_models/mistral7b02instruct'
print(f"Saving into {save_dir}")
model.save_quantized(save_dir)

###########################################################

However, when I reload the quant to continue the fine tuning but when I reload it and train it with the same arguments and SFTTrainer I get this error:

Traceback (most recent call last):
File "/home/abc/workspace/distill.py", line 29, in
model = HQQModelForCausalLM.from_quantized(save_dir)
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/engine/base.py", line 86, in from_quantized
model = cls._get_hqq_class(arch_key).from_quantized(
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/models/base.py", line 364, in from_quantized
cls.patch_model(
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/models/base.py", line 160, in patch_model
cls.patch_linearlayers(model, patch_linear_fct, patch_params, verbose=verbose)
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/models/hf/mistral.py", line 44, in patch_linearlayers
layers[i].self_attn.q_proj = patch_fct(
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/models/base.py", line 337, in _load_module
return module.to(device=device, dtype=compute_dtype, non_blocking=True)
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

###########################################################

Here's the code I modified in your example, you can see how I load the model:

###########################################################
hf_auth = None #HuggingFace token
cache_path = 'cache' #cache directory to store data
device = 'cuda:0'

#Chose a model
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
save_dir = './quantized_models/mistral7b02instruct'
#model_id = "meta-llama/Llama-2-13b-hf"
#model_id = "meta-llama/Llama-2-70b-hf"

#HQQ Quantize
######################################################################################
######################################################################################
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
import torch
from hqq.core.quantize import *
import gc
#model = HQQModelForCausalLM.from_quantized(save_dir, device='cuda')
#tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)
train_dtype = torch.float32
#model = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=train_dtype)
model = HQQModelForCausalLM.from_quantized(save_dir)
tokenizer = AutoTokenizer.from_pretrained(model_id)
#quant_config = BaseQuantizeConfig(nbits=2, group_size=16, quant_scale=False, quant_zero=False)
#model.quantize_model(quant_config=quant_config)
model.to('cuda')

QuantLinear new feature requests

Hello, I am very impressed with your great work. I am not quite familiar with CUDA programming. Would you please kindly give me an instruction about how to call the pack_2bit_u8 of your optimized CUDA (C++) version? I just need to pack and unpack the weights, without quantizing them. Thanks!

Issues with HQQLinear compatability when quantized from bf16.

I am trying to integrate hqq as drop-in replacement for Bitsandbytes-4bit in Huggingface's TGI.

E.g. my approach is to quantize on the fly, similar to this line in bitsandbytes.
https://github.com/huggingface/text-generation-inference/blob/630800eed37b15c4b0c9eb8e6ab47212026720f7/server/text_generation_server/utils/layers.py#L317

elif quantize.startswith("hqq"):
    HQQLinear.set_backend(HQQBackend.PYTORCH)
    layer = nn.Linear(weight.shape[1], weight.shape[0], bias=bias is not None, dtype=weight.dtype)
        with torch.no_grad():
            layer.weight.data = weight
            if bias is not None:
                layer.bias.data = bias
    linear = HQQLinear(layer, BaseQuantizeConfig(nbits=4, group_size=64, quant_zero=True, quant_scale=False), del_orig=True)

Using a float16 model, this works, however with bfloat16, it fails.
Example model: TinyLlama/TinyLlama-1.1B-Chat-v0.3

Produces Error in Backend.PYTORCH

  File "../.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 217, in forward_pytorch
    out   = torch.matmul(x, W_est.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Half

Produces Error in Backend.PYTORCH_COMPILE

tgi/utils/layers.py", line 531, in forward
    rotary_emb.apply_rotary(x1, x2, cos, sin, x1, x2, False)
RuntimeError: Expected x1.dtype() == cos.dtype()

x1 is float32, cos is bfloat16 / same as unquantized layer.

How important is Axis and QLLM support

Hi HQQ team, Happy new year.
I's a great work and thanks for the insightful idea of which is really faster during quantization. I installed hqq and ran examples in a minutes.

When I dive into your code and I found there is a parement named "axis" to indicate which direction the compression will go on. I got a question on my mind, would axis=0 be better then axis=1? I just checked that it seems GPTQ and AWQ both used axis=1 by default. However it's 0 in HQQ.

I tried to set it to 0 and do the quantization, it looks good as well. So, Do you have any comments on the quantization axis?

Besides, I integrate HQQ into my general quantization QLLM, so it works for all models in huggingface for now. The same time, it could be easily support mix-bits quantization which is naturelly supported by qllm.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.