The asr-quantization from rumeysakeskin

Speed Up Speech Recognition Inference on CPU Devices with Post-Training Quantization to Nvidia NeMo ASR Model

Model quantization is performance optimization technique that allows speeding up inference and decreasing memory requirements by performing computations and storing tensors at lower bitwidths (such as INT8 or FLOAT16) than floating-point precision. This is particularly beneficial during model deployment.

There are two model quantization methods,

Quantization Aware Training (QAT)
Post-training Quantization (PTQ)

QAT mimics the effects of quantization during training: The computations are carried-out in floating-point precision but the subsequent quantization effect is taken into account. The weights and activations are quantized into lower precision only for inference, when training is completed.

PTQ focuses on quantize the fine-tuned model without retraining. The weights and activations of ops are converted into lower precision for saving the memory and computation losses.

In this project, Post Training Static Quantization quantizes both weights and activations of the model statically. Follow below steps to aplly post training static quantization.

Pre-trained Model
Prepare
Fuse Modules
Insert Stubs and Observers
Calibration
Quantization

Prepare Quantization Backend for Hardware

PyTorch currently has two quantization backends that support quantization.

FBGEMM is specific to x86 CPUs and is intended for deployments of quantized models on server CPUs.
QNNPACK has a range of targets that includes ARM CPUs (typically found in mobile/embedded devices).

model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
torch.quantization.prepare(first_asr_model, inplace=True)

Insert Stubs to the inputs and outputs

# Insert the QuantStub() before the first layer of the model
model.quant = torch.quantization.QuantStub()
model.encoder.quant = torch.quantization.QuantStub()

# Insert a DeQuantStub() at the end of the model
model.dequant = torch.quantization.DeQuantStub()
model.decoder.dequant = torch.quantization.DeQuantStub()

Reference pages:

rumeysakeskin / asr-quantization Goto Github PK

asr-quantization's Introduction

Speed Up Speech Recognition Inference on CPU Devices with Post-Training Quantization to Nvidia NeMo ASR Model

Prepare Quantization Backend for Hardware

Insert Stubs to the inputs and outputs

asr-quantization's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs