nju-jet / sr_mobile_quantization Goto Github PK

Winner solution of mobile AI (CVPRW 2021).

Python 98.71% Shell 1.29%

super-resolution network-quantization int8-quantization residual-learning

sr_mobile_quantization's Introduction

Introduction

A winner solution for MAI2021 Competition(CVPR2021 Workshop). Our model outperforms other participtants by a large margin in terms of both inference speed and reconstruction performance.

Challenge report: Mobile AI 2021 Real-Time Image Super-Resolution Challenge.

Our paper: Anchor-based Plain Net for Mobile Image Super-Resolution.

Contribution for INT8 Quantization SR Mobile Network

Investigation of meta-node latency

We conduct an experiment about meta-node latency by decomposing lightweight SR architectures, which determines the portable operations we can utilize. This step is curcial important if you want to deploy your model across mobile device.

Anchor-based residual learning

For full-integer quantization which means all the weights and activations are int8, it's obvious a better choice to learn residual(always close to zero) rather than directly mapping low-resolution image to high-resolution image. In existing methods, residual learning can be divided into two categories: (1). Image space residual learning means passing the interpolated-input(bilinear, bicubic) to network output. (2).Feature space residual learning means passing the output of shallow convolutional layer to network output. For float32 quantized model, feature space residual learning is slightly better(+0.08dB). For int8 quantized model, image space residual learning is always better(+0.3dB) because it forces the whole network to learn subtle change, thus a set of continuous real-valued numbers can be represented more accurately using a fixed discrete set of numbers. However, bilinear resize and nearest neighbor resize is really slow on mobile device due to pixel-wise multiplication when doing coordinate mapping. Our anchor-based residual learning can enjoy the good property of image space residual learning while being as fast as feature space residual learning. The core operation is repeating input nine times(for x3 scale) and add it to the feature before depth-to-space. See our architecture in model.

Another more convolution after deep feature extraction

After deep feature extraction, existing methods use one convolution to map features to origin image space, followed by a depth-to-space(PixelShuffle in Pytorch) layer. We find that in image space, one more convolution can significantly improve the performance compared with adding one convolution in deep feature extraction stage(+0.11dB).

Requirements

It should be noted that tensorflow version matters a lot because old versions don't include some layers such as depth-to-space, so you should make sure tf version is larger than 2.4.0. Another important thing is that only tf-nightly larger than 2.5.0 can perform arbitrary input shape quantization. I provide two conda environments, tf.yaml for training and tfnightly.yaml for Post-Training Quantization(PTQ) and Quantization-Aware Training(QAT). You can use the following scripts to create two separate conda environments.

conda env create -f tf.yaml
conda env create -f tfnightly.yaml

Pipeline

Train and validate on DIV2K. We can achieve 30.22dB with 42.54K parameters.
Post-Training Quantization: after int8 quantization, PSNR drops to 30.09dB.
Quantization-Aware Training: Insert fake quantization nodes during training. PSNR increases to 30.15dB, which means the model size becomes 4x smaller with only 0.07dB performance loss.

Prepare DIV2K Data

Download DIV2K and put DIV2K in data folder. Then the structure should look like:

data

DIV2K

DIV2K_train_HR

0001.png

...

0900.png

DIV2K_train_LR_bicubic

X2

0001x2.png

...

0900x2.png

Training

python train.py --opt options/train/base7.yaml --name base7_D4C28_bs16ps64_lr1e-3 --scale 3  --bs 16 --ps 64 --lr 1e-3 --gpu_ids 0

Note: The argument --name specifies the following save path:

Log file will be saved in log/{name}.log
Checkpoint and current best weights will be saved in experiment/{name}/best_status/
Visualization of Train and Validate will be saved in Tensorboard/{name}/

You can use tensorboard to monitor the training and validating process by:

tensorboard --logdir Tensorboard

Quantization-Aware Training

If you haven't worked with Tensorflow Lite and network quantization before, please refer to official guideline. This technology inserts fake quantization nodes to make the weights aware that themselves will be quantized. For this model, you can simply use the following script to perform QAT:

python train.py --opt options/train/base7_qat.yaml --name base7_D4C28_bs16ps64_lr1e-3_qat --scale 3  --bs 16 --ps 64 --lr 1e-3 --gpu_ids 0 --qat --qat_path experiment/base7_D4C28_bs16ps64_lr1e-3/best_status

Convert to TFLite which can run on mobile device

python generate_tflite.py

Then the converted tflite model will be saved in TFMODEL/. TFMODEL/{name}.tflite is used for predicting high-resolution image(arbitary low-resolution input shape is allowed), while TFMODEL/{name}_time.tflite fixes model input shape to [1, 360, 640, 3] for getting inference time.

Run TFLite Model on your own devices

Download AI Benchmark from the Google Play / website and run its standard tests.
After the end of the tests, enter the PRO Model and select the Custom Model tab there.
Send your tflite model to your device and remember its location, then run the model.

Contact

:) If you have any questions, feel free to contact [email protected]

sr_mobile_quantization's People

Contributors

Stargazers

Watchers

sr_mobile_quantization's Issues

Code for Investigation of meta-node latency

Would be great if you could share the code or explain how to investigate for other devices

How to genereate the pt file which required by DIV2K_train.txt?

Hi,

For training process, it require, the pt format, while the original DIV2k is in png format.
Is there some script which could be used to convert?

Thanks,

During QAT training for scale = 2, do I need change below "3" to "2" ?

I have already trained a model that scale = 2, and after that I wanna to do QAT training for it(scale = 2) using command "python train.py --opt options/train/base7_qat.yaml --name base7_D4C28_bs16ps64_lr12-3_qat_x2 --scale 2 --bs 16 --ps 64 --lr 1e-3 --gpu_ids 1 --qat --qat_path experiment/ base7_D4C28_bs16ps64_lr12-3_x2/best_status", do I need change below red box "3" to "2" ?

Do we need change like below？

thank you very much!

How to deal with QuantizeLinear and DequantizeLinear node when I do qutization using openvino/tnn/mnn?

I train a x2 model, and after that I finetune it using QAT training using below command:

"python train.py --opt options/train/base7_qat.yaml --name base7_D4C28_bs16ps64_lr12-3_qat_x2 --scale 2 --bs 16 --ps 64 --lr 1e-3 --gpu_ids 1 --qat --qat_path experiment/ base7_D4C28_bs16ps64_lr12-3_x2/best_status".

and then I convert it to ONNX model using below cammand: "python -m tf2onnx.convert --saved-model

./experiment/base7_D4C28_bs16ps128_lr1e-3_x2_20210603/best_status --opset 13 --output ./ONNX/base7_D4C28_bs16ps128_lr1e-3_x2_20210603.onnx"

then I open this onnx model using netron:

I want to do qutization using openvino/tnn/mnn for this onnx model, my question is do I need to remove the
QuantizeLinear and DequantizeLinear in red box first and then do qutization?

or should I just do qutization, and the openvino/tnn/mnn will remove it automatically?

and I also check the tflite model(using generate_tflite.py) --> onnx model, it seems the quantizated tflite model/onnx model contains node QuantizeLinear and DequantizeLinear, is it normal?

There is a problem when I convert the model to tflite

RuntimeError: Quantization not yet supported for op: 'DEPTH_TO_SPACE'.
I would appreciate it if you give me some solutions about this problem.

generate_tflite failed

when run generate_tfilte . it comes error:
Connected to pydev debugger (build 202.6948.78)
2021-06-03 06:13:39.240280: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-03 06:13:39.240442: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-03 06:13:56.370643: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-06-03 06:13:58.464006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:84:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2021-06-03 06:13:58.464464: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.464742: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.464960: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.476443: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-06-03 06:13:58.477513: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-06-03 06:13:58.477765: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.481692: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.481940: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.481986: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-06-03 06:13:58.507288: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-03 06:13:58.707850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-03 06:13:58.713893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]
2021-06-03 06:14:12.144968: I tensorflow/core/grappler/devices.cc:69] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2021-06-03 06:14:12.146383: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-06-03 06:14:12.323394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:84:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2021-06-03 06:14:12.323562: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-06-03 06:14:16.570160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-03 06:14:16.570507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-06-03 06:14:16.570664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-06-03 06:14:16.694057: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2100005000 Hz
2021-06-03 06:14:16.766796: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1144] Optimization results for grappler item: graph_to_optimize
function_optimizer: Graph size after: 188 nodes (139), 288 edges (239), time = 46.723ms.
function_optimizer: function_optimizer did nothing. time = 0.165ms.

2021-06-03 06:14:19.303393: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:345] Ignored output_format.
2021-06-03 06:14:19.303491: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:348] Ignored drop_control_dependency.
2021-06-03 06:14:19.410415: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2021-06-03 06:14:19.670119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:84:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2021-06-03 06:14:19.670482: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-06-03 06:14:19.670555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-03 06:14:19.670631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-06-03 06:14:19.670664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
representative data: [0]/[5]
external/ruy/ruy/apply_multiplier.cc:52: RUY_CHECK_LE condition not satisfied: [ shift <= 7 ] with values [ 16 <= 7 ].

Process finished with exit code 134

what is the possible reason? thank you !

About more convolution after deep feature extraction

Great work~
I have a question about readme mentioned: Another more convolution after deep feature extraction
Looks like your model here not add addtional convolution?

thank you.

how to add gan loss in this code

how to add gan loss to this code? this results look so blur

I want to train the model and then I need a int8 onnx model, what should I do step by step?

Dear @NJU-Jet :

I feel confused that if I use the command "python train.py --opt options/train/base7.yaml --name base7_D4C28_bs16ps64_lr1e-3 --scale 3 --bs 16 --ps 64 --lr 1e-3 --gpu_ids 0" to train a model, this model is a float32 model or a int8 model?

I want to train the model and then I need a int8 onnx model, what should I do step by step?
do I need to run or modify generate_tflite.py to get a int8 model and then convert to onnx int8 model?
or just the pb model we got from the command "python train.py --opt options/train/base7.yaml --name base7_D4C28_bs16ps64_lr1e-3 --scale 3 --bs 16 --ps 64 --lr 1e-3 --gpu_ids 0" is enough to convert to a int8 onnx model?

Thank you very much!

The model is trained without input normalization, so you don't need to divide 255 when testing.

it seems have a bug when I train a x2(scale=2) model?

Hi Dear @NJU-Jet:

I wanna to train a model that scale=2. (as you know, the defaut scale is 3)
it seems have a bug here:

you can see that this [inp, inp, inp, inp, inp, inp, inp, inp, inp] is for scale=3, so how to fix it if I want to train a x2 model?

can we fix it this way: [inp, inp, inp, inp, inp, inp, inp, inp, inp] --> [inp] x scale x scale ?

thank you very much!

how to evaluate by pretrained pb?

I try to evaluate with pre-training pb, but got an error
I rarely use tf2.0, trying to read the document or google did not solve the problem
Do you have any solutions？ Thank you

code example

def evaluate_by_pb(model_path, save_path):
    model = tf.saved_model.load(model_path)
    model = model.signatures[tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
    for i in range(601, 701):
        lr_path = 'data/DIV2K/DIV2K_train_LR_bicubic/X3_pt/0{}x3.pt'.format(i)
        with open(lr_path, 'rb') as f:
            lr = pickle.load(f)
        h, w, c = lr.shape
        lr = np.expand_dims(lr, 0).astype(np.float32)
        input_tensor = tf.convert_to_tensor(lr)
        print(input_tensor.shape)
        output = model(input_tensor)
        print(output)

error TypeError: signature_wrapper(*, input_1) missing required arguments: input_1

running time on AI Benchmark App

Hello:
I tested 'base7_D4C28_bs16ps64_lr1e-3_qat_time.tflite' running time via AI Benchmark App.
My device is Snapdragon 888 and the device's AI score is 54.4. It takes about 200ms NNAPI. In the paper, your device is Snapdragon 820 and it takes ~30ms. Do you have any idea about the running time?

Thanks so much.