GithubHelp home page GithubHelp logo

nju-jet / sr_mobile_quantization Goto Github PK

View Code? Open in Web Editor NEW
145.0 6.0 41.0 567 KB

Winner solution of mobile AI (CVPRW 2021).

Python 98.71% Shell 1.29%
super-resolution network-quantization int8-quantization residual-learning

sr_mobile_quantization's Introduction

Introduction

A winner solution for MAI2021 Competition(CVPR2021 Workshop). Our model outperforms other participtants by a large margin in terms of both inference speed and reconstruction performance.

Challenge report: Mobile AI 2021 Real-Time Image Super-Resolution Challenge.

Our paper: Anchor-based Plain Net for Mobile Image Super-Resolution.

Contribution for INT8 Quantization SR Mobile Network

Investigation of meta-node latency

We conduct an experiment about meta-node latency by decomposing lightweight SR architectures, which determines the portable operations we can utilize. This step is curcial important if you want to deploy your model across mobile device.

Anchor-based residual learning

For full-integer quantization which means all the weights and activations are int8, it's obvious a better choice to learn residual(always close to zero) rather than directly mapping low-resolution image to high-resolution image. In existing methods, residual learning can be divided into two categories: (1). Image space residual learning means passing the interpolated-input(bilinear, bicubic) to network output. (2).Feature space residual learning means passing the output of shallow convolutional layer to network output. For float32 quantized model, feature space residual learning is slightly better(+0.08dB). For int8 quantized model, image space residual learning is always better(+0.3dB) because it forces the whole network to learn subtle change, thus a set of continuous real-valued numbers can be represented more accurately using a fixed discrete set of numbers. However, bilinear resize and nearest neighbor resize is really slow on mobile device due to pixel-wise multiplication when doing coordinate mapping. Our anchor-based residual learning can enjoy the good property of image space residual learning while being as fast as feature space residual learning. The core operation is repeating input nine times(for x3 scale) and add it to the feature before depth-to-space. See our architecture in model.

Another more convolution after deep feature extraction

After deep feature extraction, existing methods use one convolution to map features to origin image space, followed by a depth-to-space(PixelShuffle in Pytorch) layer. We find that in image space, one more convolution can significantly improve the performance compared with adding one convolution in deep feature extraction stage(+0.11dB).

Requirements

It should be noted that tensorflow version matters a lot because old versions don't include some layers such as depth-to-space, so you should make sure tf version is larger than 2.4.0. Another important thing is that only tf-nightly larger than 2.5.0 can perform arbitrary input shape quantization. I provide two conda environments, tf.yaml for training and tfnightly.yaml for Post-Training Quantization(PTQ) and Quantization-Aware Training(QAT). You can use the following scripts to create two separate conda environments.

conda env create -f tf.yaml
conda env create -f tfnightly.yaml

Pipeline

  1. Train and validate on DIV2K. We can achieve 30.22dB with 42.54K parameters.
  2. Post-Training Quantization: after int8 quantization, PSNR drops to 30.09dB.
  3. Quantization-Aware Training: Insert fake quantization nodes during training. PSNR increases to 30.15dB, which means the model size becomes 4x smaller with only 0.07dB performance loss.

Prepare DIV2K Data

Download DIV2K and put DIV2K in data folder. Then the structure should look like:

data

DIV2K

DIV2K_train_HR

0001.png

...

0900.png

DIV2K_train_LR_bicubic

X2

0001x2.png

...

0900x2.png

Training

python train.py --opt options/train/base7.yaml --name base7_D4C28_bs16ps64_lr1e-3 --scale 3  --bs 16 --ps 64 --lr 1e-3 --gpu_ids 0

Note: The argument --name specifies the following save path:

  • Log file will be saved in log/{name}.log
  • Checkpoint and current best weights will be saved in experiment/{name}/best_status/
  • Visualization of Train and Validate will be saved in Tensorboard/{name}/

You can use tensorboard to monitor the training and validating process by:

tensorboard --logdir Tensorboard

Quantization-Aware Training

If you haven't worked with Tensorflow Lite and network quantization before, please refer to official guideline. This technology inserts fake quantization nodes to make the weights aware that themselves will be quantized. For this model, you can simply use the following script to perform QAT:

python train.py --opt options/train/base7_qat.yaml --name base7_D4C28_bs16ps64_lr1e-3_qat --scale 3  --bs 16 --ps 64 --lr 1e-3 --gpu_ids 0 --qat --qat_path experiment/base7_D4C28_bs16ps64_lr1e-3/best_status

Convert to TFLite which can run on mobile device

python generate_tflite.py

Then the converted tflite model will be saved in TFMODEL/. TFMODEL/{name}.tflite is used for predicting high-resolution image(arbitary low-resolution input shape is allowed), while TFMODEL/{name}_time.tflite fixes model input shape to [1, 360, 640, 3] for getting inference time.

Run TFLite Model on your own devices

  1. Download AI Benchmark from the Google Play / website and run its standard tests.
  2. After the end of the tests, enter the PRO Model and select the Custom Model tab there.
  3. Send your tflite model to your device and remember its location, then run the model.

Contact

:) If you have any questions, feel free to contact [email protected]

sr_mobile_quantization's People

Contributors

nju-jet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sr_mobile_quantization's Issues

During QAT training for scale = 2, do I need change below "3" to "2" ?

I have already trained a model that scale = 2, and after that I wanna to do QAT training for it(scale = 2) using command "python train.py --opt options/train/base7_qat.yaml --name base7_D4C28_bs16ps64_lr12-3_qat_x2 --scale 2 --bs 16 --ps 64 --lr 1e-3 --gpu_ids 1 --qat --qat_path experiment/ base7_D4C28_bs16ps64_lr12-3_x2/best_status", do I need change below red box "3" to "2" ?

image

Do we need change like below?

image

thank you very much!

How to deal with QuantizeLinear and DequantizeLinear node when I do qutization using openvino/tnn/mnn?

I train a x2 model, and after that I finetune it using QAT training using below command:

"python train.py --opt options/train/base7_qat.yaml --name base7_D4C28_bs16ps64_lr12-3_qat_x2 --scale 2 --bs 16 --ps 64 --lr 1e-3 --gpu_ids 1 --qat --qat_path experiment/ base7_D4C28_bs16ps64_lr12-3_x2/best_status".

and then I convert it to ONNX model using below cammand: "python -m tf2onnx.convert --saved-model

./experiment/base7_D4C28_bs16ps128_lr1e-3_x2_20210603/best_status --opset 13 --output ./ONNX/base7_D4C28_bs16ps128_lr1e-3_x2_20210603.onnx"

then I open this onnx model using netron:

无标题

I want to do qutization using openvino/tnn/mnn for this onnx model, my question is do I need to remove the
QuantizeLinear and DequantizeLinear in red box first and then do qutization?

or should I just do qutization, and the openvino/tnn/mnn will remove it automatically?

and I also check the tflite model(using generate_tflite.py) --> onnx model, it seems the quantizated tflite model/onnx model contains node QuantizeLinear and DequantizeLinear, is it normal?

generate_tflite failed

when run generate_tfilte . it comes error:
Connected to pydev debugger (build 202.6948.78)
2021-06-03 06:13:39.240280: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-03 06:13:39.240442: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-03 06:13:56.370643: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-06-03 06:13:58.464006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:84:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2021-06-03 06:13:58.464464: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.464742: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.464960: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.476443: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-06-03 06:13:58.477513: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-06-03 06:13:58.477765: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.481692: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.481940: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-06-03 06:13:58.481986: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-06-03 06:13:58.507288: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-03 06:13:58.707850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-03 06:13:58.713893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]
2021-06-03 06:14:12.144968: I tensorflow/core/grappler/devices.cc:69] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2021-06-03 06:14:12.146383: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-06-03 06:14:12.323394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:84:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2021-06-03 06:14:12.323562: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-06-03 06:14:16.570160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-03 06:14:16.570507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-06-03 06:14:16.570664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-06-03 06:14:16.694057: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2100005000 Hz
2021-06-03 06:14:16.766796: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1144] Optimization results for grappler item: graph_to_optimize
function_optimizer: Graph size after: 188 nodes (139), 288 edges (239), time = 46.723ms.
function_optimizer: function_optimizer did nothing. time = 0.165ms.

2021-06-03 06:14:19.303393: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:345] Ignored output_format.
2021-06-03 06:14:19.303491: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:348] Ignored drop_control_dependency.
2021-06-03 06:14:19.410415: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2021-06-03 06:14:19.670119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:84:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2021-06-03 06:14:19.670482: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-06-03 06:14:19.670555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-03 06:14:19.670631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-06-03 06:14:19.670664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
representative data: [0]/[5]
external/ruy/ruy/apply_multiplier.cc:52: RUY_CHECK_LE condition not satisfied: [ shift <= 7 ] with values [ 16 <= 7 ].

Process finished with exit code 134

what is the possible reason? thank you !

I want to train the model and then I need a int8 onnx model, what should I do step by step?

Dear @NJU-Jet :

I feel confused that if I use the command "python train.py --opt options/train/base7.yaml --name base7_D4C28_bs16ps64_lr1e-3 --scale 3 --bs 16 --ps 64 --lr 1e-3 --gpu_ids 0" to train a model, this model is a float32 model or a int8 model?

I want to train the model and then I need a int8 onnx model, what should I do step by step?
do I need to run or modify generate_tflite.py to get a int8 model and then convert to onnx int8 model?
or just the pb model we got from the command "python train.py --opt options/train/base7.yaml --name base7_D4C28_bs16ps64_lr1e-3 --scale 3 --bs 16 --ps 64 --lr 1e-3 --gpu_ids 0" is enough to convert to a int8 onnx model?

Thank you very much!

it seems have a bug when I train a x2(scale=2) model?

Hi Dear @NJU-Jet:

I wanna to train a model that scale=2. (as you know, the defaut scale is 3)
it seems have a bug here:
image

you can see that this [inp, inp, inp, inp, inp, inp, inp, inp, inp] is for scale=3, so how to fix it if I want to train a x2 model?

can we fix it this way: [inp, inp, inp, inp, inp, inp, inp, inp, inp] --> [inp] x scale x scale ?

thank you very much!

how to evaluate by pretrained pb?

I try to evaluate with pre-training pb, but got an error
I rarely use tf2.0, trying to read the document or google did not solve the problem
Do you have any solutions? Thank you

code example

def evaluate_by_pb(model_path, save_path):
    model = tf.saved_model.load(model_path)
    model = model.signatures[tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
    for i in range(601, 701):
        lr_path = 'data/DIV2K/DIV2K_train_LR_bicubic/X3_pt/0{}x3.pt'.format(i)
        with open(lr_path, 'rb') as f:
            lr = pickle.load(f)
        h, w, c = lr.shape
        lr = np.expand_dims(lr, 0).astype(np.float32)
        input_tensor = tf.convert_to_tensor(lr)
        print(input_tensor.shape)
        output = model(input_tensor)
        print(output)

error TypeError: signature_wrapper(*, input_1) missing required arguments: input_1

running time on AI Benchmark App

Hello:
I tested 'base7_D4C28_bs16ps64_lr1e-3_qat_time.tflite' running time via AI Benchmark App.
My device is Snapdragon 888 and the device's AI score is 54.4. It takes about 200ms NNAPI. In the paper, your device is Snapdragon 820 and it takes ~30ms. Do you have any idea about the running time?

Thanks so much.

what is the differences between tf.yaml and tfnightly.yaml?

Nice job!
I am confused about the difference between the tf.yaml and tfnightly.yaml? Does the key difference is the version of tensorflow? If this is true, can I only use tfnightly.yaml to create a environment for both training and qat?

it seems the inference is very slow on my linux server?

Hi, Dear NJU-Jet

my linux server: several 2.6GHz CPU + several V100, and I run the generate_tflite.py to got a quantized model.

and then in function evaluate, I add below code to measure the inference time:
image

and it seems the inference time is very slow, it cost about 70 seconds per image.

image

I wonder that this inference is run on cpu or gpu? and why it is so slow?

thank you very much!

tensorflow version

Thanks for sharing your great work. I have a question about TF version. Can I derectly only use TF 2.5.0 instead of TF2.4 and tf-nightly=2.5?

How to convert tflite model to onnx model?

it is good work and congratulation!

I have to two questions that:

  1. which framwork do you use: tensorflow or tensorflow_lite?
  2. if you use tensorflow_lite, How to convert the output tflite model to onnx model?

please help... thank you very much! double click 666 ^_^

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.