Comments (7)
I've started working on making unit tests for quantize_model
, and I had a few questions:
-
quantize_model
seems to expect a certain model. E.g., inptq_evaluate
,quantize_model
is called after eitherpreprocess_for_quantization
orpreprocess_for_flexml_quantize
is called, depending on the quantization backend used.Should I make tests with the understanding that the model should be pre-processed accordingly, or just test
quantize_model
as it is with any vanilla model? If they should only ever be called together, should those functions be packaged together? -
The function doesn't work for quantizing Transformer models, because it quantizes the inputted tokens. Once those quantized (now float) tokens get to the embedding layer, it fails because of the non-integer tokens. This may be expected behavior for this function if it is built for quantizing CNN-based models, but I thought I would raise the issue. This is also an effect of the quantization of inputs being
True
by default inquantize_model
(quantization is True for bias, input, and weight, but not for output), which is contrary to what is said to be the standard for layers in one of the tutorials.:By default weight_quant=Int8WeightPerTensorFloat, while bias_quant, input_quant and output_quant are set to None. That means that by default weights are quantized to 8-bit signed integer with a per-tensor floating-point scale factor (a very common type of quantization adopted by e.g. the ONNX standard opset), while quantization of bias, input, and output are disabled. We can easily verify all of this at runtime on an example:
This may be desired/expected behavior for this function, but I wasn't sure.
-
I haven't yet found a case where
weight_bit_width
oract_bit_width
input variables forquantize_model
have an impact on the model. I.e., if I set both equal to 4 with symmetric quantization, and feed in a strictly positive tensor to a QuantizedConv, then the input activation tensor is quantized at 128 values (i.e. int8 symmetric quantization with a strictly positive input), and the weight tensor is quantized with int8 values.
So yeah, I was wondering if this was all expected behavior. If so, I can add some appropriate documentation! If not, I can start working on "fixes".
from brevitas.
quantize_model seems to expect a certain model
The parts of the pre-processing that might be needed are mostly the following: https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/quantize.py#L275-L280
These are not always needed and there are cases when they can be skipped, except maybe only for symbolic trace which is required with FX quantization backend. Having them makes the quantization process easier.
Depending on how you were planning to write the tests, maybe you can just apply symbolic trace to obtain an FX graph, and ignore all the other ones.
If they should only ever be called together, should those functions be packaged together?
Conceptually, they do very different things. They are coupled for the sake of these examples but there are cases where those transformations should not be applied or they are not interesting for the model in case.
The function doesn't work for quantizing Transformer models
That is expected. We have a separate entrypoint for LLM quantization and we would like to unify the two at some point. To do that, first we might need tests to ensure we preserve all the correct functionalities.
I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model.
Could you post an example?
from brevitas.
The parts of the pre-processing that might be needed are mostly the following: https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/quantize.py#L275-L280
These are not always needed and there are cases when they can be skipped, except maybe only for symbolic trace which is required with FX quantization backend. Having them makes the quantization process easier. Depending on how you were planning to write the tests, maybe you can just apply symbolic trace to obtain an FX graph, and ignore all the other ones.
If they should only ever be called together, should those functions be packaged together?
Conceptually, they do very different things. They are coupled for the sake of these examples but there are cases where those transformations should not be applied or they are not interesting for the model in case.
Sounds good! I'll experiment a bit with the pre-processing, but will use symbolic trace as the default for now.
The function doesn't work for quantizing Transformer models
That is expected. We have a separate entrypoint for LLM quantization and we would like to unify the two at some point. To do that, first we might need tests to ensure we preserve all the correct functionalities.
That makes sense, in that case I'll delay my Transformer-based test until then.
I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model.
Could you post an example?
I'm not entirely sure what the issue was: I was getting 8-bit quantization in every case in my larger example. I'm going to dig into it and see what my error was, and post a minimal example.
However, in the meantime below is an example using the fx
backend where all tests now pass successfully and as expected for arbitraryweight_bit_width
and act_bit_width
values:
import pytest
from copy import deepcopy
import torch
import torch.nn as nn
from brevitas_examples.imagenet_classification.ptq.ptq_common import quantize_model
from brevitas.quant_tensor import QuantTensor
# CONSTANTS
IMAGE_DIM = 16
##################
# EXAMPLE MODELS #
##################
@pytest.fixture
def minimal_model():
"""
Inputs:
Implicitly takes in a torch.Tensor, size: (batch_size, 3, height, width).
Outputs:
Implicitly returns a torch.Tensor, size: (batch_size, 16, height, width).
"""
return nn.Sequential(
nn.Conv2d(3, 16, kernel_size=3, padding=1),
nn.ReLU(),
)
# Unit tests
def test_quantize_model(minimal_model):
# Tested parameters
weight_bit_width = 3
bias_bit_width = 16
act_bit_width = 6
prepared_model = torch.fx.symbolic_trace(minimal_model)
quant_model = quantize_model(
model=deepcopy(prepared_model),
backend='fx',
weight_bit_width=weight_bit_width,
act_bit_width=act_bit_width,
bias_bit_width=bias_bit_width,
weight_quant_granularity='per_tensor',
act_quant_percentile=99.9,
act_quant_type='sym',
scale_factor_type='float_scale',
quant_format='int'
)
# Assert it is a GraphModule
assert isinstance(quant_model, torch.fx.graph_module.GraphModule)
# Make sure we can feed data through the model
_ = quant_model(torch.rand(1,3,IMAGE_DIM, IMAGE_DIM))
# Get first layer for testing its quantization.
# We also test we can feed data through the first layer and quant stub in isolation
initial_quant = quant_model.get_submodule('input_1_quant')
first_layer = quant_model.get_submodule('0')
first_quant_input = initial_quant(torch.rand(1,3,IMAGE_DIM, IMAGE_DIM))
first_layer_output = first_layer(first_quant_input)
# Assert only weight and bias are quantized by default
assert first_layer.is_weight_quant_enabled
assert first_layer.is_bias_quant_enabled
assert not first_layer.is_input_quant_enabled
assert not first_layer.is_output_quant_enabled
# Assert quantization bit widths are as desired
# Bias
assert first_layer.bias_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == bias_bit_width
# Weight
assert first_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == weight_bit_width
# Activation
# Output of initial quant stub
assert initial_quant.act_quant.fused_activation_quant_proxy.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == act_bit_width
assert isinstance(first_quant_input, QuantTensor)
assert first_quant_input.bit_width.item() == act_bit_width
# Output of Conv
assert first_layer.output_quant._zero_hw_sentinel._buffers['value'].item() == 0 # quantization of input disabled
assert not isinstance(first_layer_output, QuantTensor) and isinstance(first_layer_output, torch.Tensor)
from brevitas.
So the issue I ran into, where the weight_bit_width
and act_bit_width
values didn't seem to be used, occurred when using the layerwise
backend.
Using the same minimal_model
as above, the following test fails (in particular, the last 2 assertions):
def test_layerwise_quantize_model(minimal_model):
# Tested parameters
weight_bit_width = 3
bias_bit_width = 16
act_bit_width = 6
quant_model = quantize_model(
model=deepcopy(minimal_model),
backend='layerwise',
weight_bit_width=weight_bit_width,
act_bit_width=act_bit_width,
bias_bit_width=bias_bit_width,
weight_quant_granularity='per_tensor',
act_quant_percentile=99.9,
act_quant_type='sym',
scale_factor_type='float_scale',
quant_format='int'
)
assert isinstance(quant_model, nn.Sequential)
# Get first layer for testing its quantization.
first_layer = quant_model.get_submodule('0')
# Assert quantization bit widths are as desired
# Biases
assert first_layer.bias_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == bias_bit_width
# Weights
assert first_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == weight_bit_width
# Activations
assert first_layer.input_quant.fused_activation_quant_proxy.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == act_bit_width
I stepped through the model, and saw that the input activation and weight tensor were being quantized to 8 bits, not the desired 6 and 3 respectively. Is this expected behavior?
from brevitas.
For layerwise quantization, there are special rules that we have for first and last layer.
In particular, there are flags that specify the activation and weight bit width of first/last, since they tend to be more susceptible to lower precision.
The way we identify first/last in that function is a bit hard-coded around imagenet examples, where we check that the first layer has 3 input channel, and the last has 1000 output channels.
If you change the number of inp channel in your conv, you should see a difference
from brevitas.
I found the function that does the 3/1000 input/output channel identification:
def layerwise_bit_width_fn(module, base_bit_width, first_last_bit_width):
if isinstance(module, torch.nn.Conv2d) and module.in_channels == 3:
return first_last_bit_width
elif isinstance(module, torch.nn.Linear) and module.out_features == 1000:
return first_last_bit_width
else:
return base_bit_width
I can confirm that changing the number of input channels made the quantization have the "normal" behavior. Is there any desire to un-hard-code the 3 vs 1000 channels thing? It seems a bit fragile, but I'd imagine that one would need to use FX mode to get insight into what is the first or last layer.
Alternatively, we could add a logged warning that we're defaulting to the default values if one chooses layerwise
quantization, as the first/last layer being treated differently may be unexpected behavior. On that note, does Brevitas have a logger that it uses? I can see that there's a logger defined in src/brevitas_examples/bnn_pynq/logger.py.
I've opened up a PR with some preliminary tests, and I will be adding more tests (e.g. whatever tests you guys want!). The ones that are currently failing are in some cases when I give invalid inputs (e.g. when I give zero-valued or negative-valued bit widths), where quantize_model
does not throw an error.
from brevitas.
I've done a bit of testing, and negative/zero bit widths are considered valid if the model isn't used for anything. E.g.
quant_model = quantize_model(
model=fx_model,
backend='fx',
weight_bit_width=0, # NOTE: this is considered valid, which may be an issue
act_bit_width=0,
bias_bit_width=32,
weight_quant_granularity='per_tensor',
act_quant_percentile=99.9,
act_quant_type='sym',
scale_factor_type='float_scale',
quant_format='int',
)
first_conv_layer = quant_model.get_submodule('0')
print(first_conv_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item())
>> 0.0
If one feeds data through the model with zero bit-widths, it outputs NaNs. However, if one feeds data through a model with negative bit widths, it still outputs values.
I'm digging into this a bit more because I'm curious as to what's happening inside the model, but in either case I would imagine we should add add some asserts to make sure all provided bit widths are positive integers. I opened a PR for it. I'm not sure if the integer constraint I added is desired behavior, or if I should add it to other functions as well.
from brevitas.
Related Issues (20)
- Missing Proxy tests
- Export ONNX QOperator HOT 5
- Fix Value Tracer
- Activation Equalization co-optimize flag
- Update entrypoint for LLM
- Add squeeze / unsqueeze operations to quant invariant functions in `torch_handler.py` HOT 4
- Add support for minifloat ptq with fx backend on residual models
- Implement `torch.where` STE for minifloat clamping
- Remove maximum assumptions about NaN/inf values for minifloat configurations
- Change way of setting `NaN` and `inf` values for custom minifloat formats
- Update signature check
- Deprecate use of MacOS (Darwin) runners in CI
- Call for better/more documentation
- Per-channel zero points but per-tensor scales HOT 6
- Documentation setup thoughts HOT 3
- update dependencies=2.0.1 requirement HOT 4
- Mac OSX Tests for `torch==1.9.1` fail when installing dependencies HOT 3
- Weights not quantized after using qnn.QuantConv2d layers for QAT HOT 1
- Missing minifloat testing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from brevitas.