Comments (5)
Normal torch quantization works on the larger models, so to anyone reading could check that out as an alternative: https://snappishproductions.com/blog/2020/05/03/big-models-hate-this-one-weird-trick-quantization-t5--pytorch-1.4.html.html
My result was 4x smaller (with qint8) and 3x faster, so better than nothing, although I lost a little bit of accuracy.
from fastt5.
I've not tested the library for t5-11b
. I'm glad that you were able to export the model by adding use_external_data_format=True
.
I suggest you do the same for quantizing as well.
and also make sure that you have enough memory.
from fastt5.
Thank you for getting back, it's highly appreciated.
I tried adding use_external_data_format=True
to quantize_dynamic:
quantize_dynamic(
model_input=model_name,
model_output=output_model_name,
per_channel=True,
activation_type=QuantType.QUInt8,
weight_type=QuantType.QUInt8,
optimize_model=False,
use_external_data_format=True
) # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
Still get the exact same error:
ValueError Traceback (most recent call last)
<ipython-input-4-032d95bca1c8> in <module>
1 os.chdir(r'/home/jupyter/models/')
----> 2 quant_model_paths = quantize(onnx_model_paths)
~/fastT5/fastT5/onnx_exporter.py in quantize(models_name_or_path)
274 weight_type=QuantType.QUInt8,
275 optimize_model=False,
--> 276 use_external_data_format=True
277 ) # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
278 quant_model_paths.append(output_model_name)
/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
278 nodes_to_quantize,
279 nodes_to_exclude,
--> 280 op_types_to_quantize)
281
282 quantizer.quantize_model()
/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/onnx_quantizer.py in __init__(self, model, per_channel, reduce_range, mode, static, weight_qType, input_qType, tensors_range, nodes_to_quantize, nodes_to_exclude, op_types_to_quantize)
30
31 # run shape inference on the model
---> 32 model = onnx.shape_inference.infer_shapes(model)
33 self.value_infos = {vi.name: vi for vi in model.graph.value_info}
34 self.value_infos.update({ot.name: ot for ot in model.graph.output})
/opt/conda/lib/python3.7/site-packages/onnx/shape_inference.py in infer_shapes(model, check_type, strict_mode)
34 def infer_shapes(model, check_type=False, strict_mode=False): # type: (ModelProto, bool, bool) -> ModelProto
35 if isinstance(model, ModelProto):
---> 36 model_str = model.SerializeToString()
37 inferred_model_str = C.infer_shapes(model_str, check_type, strict_mode)
38 return onnx.load_from_string(inferred_model_str)
ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 19459248612
A bit strange though, since in the documentation you sent it says that setting use_external_data_format=True should solve this error...
from fastt5.
it is strange indeed! The problem seems to be in the onnxruntime library. you could follow this issue and try to solve the problem. if this does not help then, I suggest you create a new issue in onnxruntime regarding this issue.
from fastt5.
I'm getting this same error when trying to export t5-3b. Seems like this may be the more relevant onnx issue. Seems like the infer_shapes
method doesn't work with large models, and is supposed to be replaced with infer_shapes_path
. So that would need to be fixed in the onnxruntime project. I modified the code in onnx_quantizer
to look like:
onnx.shape_inference.infer_shapes_path(model_name, model_name + ".inferred")
model = onnx.load(model_name + ".inferred")
while passing in a model_name
to the method as well. The code was able to get pass the shape inference step, but failed with this information now:
Quantizing... |########## | 1/3
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-e72945460842> in <module>
1 # Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
----> 2 quant_model_paths = quantize(onnx_model_paths)
3
4 # step 3. setup onnx runtime
5 model_sessions = get_onnx_runtime_sessions(quant_model_paths)
~/.local/lib/python3.6/site-packages/fastT5/onnx_exporter.py in quantize(models_name_or_path)
274 weight_type=QuantType.QUInt8,
275 optimize_model=False,
--> 276 use_external_data_format=True,
277 ) # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
278 quant_model_paths.append(output_model_name)
~/.local/lib/python3.6/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
281 op_types_to_quantize)
282
--> 283 quantizer.quantize_model()
284 quantizer.model.save_model_to_file(model_output, use_external_data_format)
285
~/.local/lib/python3.6/site-packages/onnxruntime/quantization/onnx_quantizer.py in quantize_model(self)
195 op_quantizer = CreateDefaultOpQuantizer(self, node)
196
--> 197 op_quantizer.quantize()
198
199 self._dequantize_outputs()
~/.local/lib/python3.6/site-packages/onnxruntime/quantization/operators/matmul.py in quantize(self)
17
18 (quantized_input_names, zero_point_names, scale_names, nodes) = \
---> 19 self.quantizer.quantize_inputs(node, [0, 1])
20
21 matmul_integer_output = node.output[0] + "_output_quantized"
~/.local/lib/python3.6/site-packages/onnxruntime/quantization/onnx_quantizer.py in quantize_inputs(self, node, indices, initializer_use_weight_qType)
613 if initializer is not None:
614 q_weight_name, zp_name, scale_name = self.quantize_weight(
--> 615 initializer, self.weight_qType if initializer_use_weight_qType else self.input_qType)
616
617 quantized_input_names.append(q_weight_name)
~/.local/lib/python3.6/site-packages/onnxruntime/quantization/onnx_quantizer.py in quantize_weight(self, weight, qType)
654
655 # Update packed weight, zero point, and scale initializers
--> 656 weight_data = self.tensor_proto_to_array(weight)
657 _, _, zero_point, scale, q_weight_data = quantize_data(weight_data.flatten().tolist(),
658 get_qrange_for_qType(qType, self.reduce_range), qType)
~/.local/lib/python3.6/site-packages/onnxruntime/quantization/onnx_quantizer.py in tensor_proto_to_array(initializer)
215 def tensor_proto_to_array(initializer):
216 if initializer.data_type == onnx_proto.TensorProto.FLOAT:
--> 217 weights = onnx.numpy_helper.to_array(initializer)
218 else:
219 raise ValueError('Only float type quantization is supported. Weights {} is {}. '.format(
~/.local/lib/python3.6/site-packages/onnx/numpy_helper.py in to_array(tensor)
52 return np.frombuffer(
53 tensor.raw_data,
---> 54 dtype=np_dtype).reshape(dims)
55 else:
56 data = getattr(tensor, storage_field), # type: Sequence[np.complex64]
ValueError: cannot reshape array of size 16777216 into shape (1024,4096)
from fastt5.
Related Issues (20)
- Support for py3.10 HOT 1
- Fails to convert T0-3B HOT 2
- Mt5 model loading fails HOT 11
- Is it suit for other translation model like "Helsinki-NLP / opus-mt-en-de"? HOT 1
- Decoder's encoder_attention_mask input should be called decoder_attention_mask HOT 1
- Thank You and Demo Running in the Browser HOT 2
- M2M100 to ONNX
- Upgrade ONNX runtime HOT 1
- fastt5 not working with FastAPI gunicorn and docker HOT 1
- No such file or directory: '/content/encoder.embed_tokens.weight' HOT 3
- In the source code, the use of attention_mask is contradictory HOT 1
- Issue with onnxruntime HOT 4
- flan-t5 support HOT 5
- GPU support
- Segmentation fault (core dumped) HOT 5
- Is fastT5 qunatization slower than pytorch dynamic quantization?
- TypeError: quantize_dynamic() got an unexpected keyword argument 'activation_type' HOT 1
- Dead kernel
- TypeError: quantize_dynamic() got an unexpected keyword argument 'activation_type' HOT 4
- Does UMT5 Supports? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastt5.