GithubHelp home page GithubHelp logo

merrymercy / tvm-mali Goto Github PK

View Code? Open in Web Editor NEW
176.0 13.0 28.0 345 KB

Optimizing Mobile Deep Learning on ARM GPU with TVM

Home Page: http://tvmlang.org/2018/01/16/opt-mali-gpu.html

License: MIT License

Python 26.04% C++ 26.94% C 45.71% Shell 1.31%
opencl tvm arm mali deep-learning

tvm-mali's Introduction

Note: The data and scripts here are all stale. Please go to https://github.com/dmlc/tvm/wiki/Benchmark#mobile-gpu For the latest results.








Benchmarking Deep Neural Networks on ARM CPU/GPU

This repo is the supporting material for Optimizing Mobile Deep Learning on ARM GPU with TVM

Inference Speed on ImageNet

Tested on

Firefly-RK3399 4G, CPU: dual-core Cortex-A72 + quad-core Cortex-A53, GPU: Mali-T860MP4
Arm Compute Library: v17.12,  MXNet: v1.0.1,  Openblas: v0.2.18

result

ย 

Set Test Environment

sudo /etc/init.d/lightdm stop
sudo -i
echo performance > /sys/class/misc/mali0/device/devfreq/ff9a0000.gpu/governor

This can make the environment more stable.

Note: You need more than 2.5GB of memory to run the following test. Otherwise, you must skip the test of vgg16 by replacing --model all with --model resnet18 or --model mobilenet in the commond.

Run Test for TVM/NNVM

In TVM, we use RPC to do test, so you should build TVM runtime and start a RPC server on your device.

python -m tvm.exec.rpc_server --host 0.0.0.0 --port=9090

Then in your host machine, run the test commond

python mali_imagenet_bench.py --target-host TARGET_HOST --host HOST --port PORT --model all

Replace the TARGET_HOST, HOST and PORT with the corresponding values in your environment.

For example, on my Firefly-RK3399, the commond is

python mali_imagenet_bench.py --target-host 'llvm -target=aarch64-linux-gnu -mattr=+neon' --host 10.42.0.96 --port 9090 --model all

Run Test for MXNet + Openblas

This test is executed locally on your device. So you need install the mxnet with openblas on your device first.

python mxnet_test.py --model all

Run Test for Arm Compute Library

Build ACL by cross-compile on host system.

scons Werror=1 neon=1 opencl=1 examples=1 benchmark_tests=1 os=linux arch=arm64-v8a embed_kernels=1 -j$(nproc)

copy acl_test.cc to the root directoy of ACL and build the acl_test by

aarch64-linux-gnu-g++ acl_test.cc build/utils/*.o -O2 -std=c++11\
    -I. -Iinclude -Lbuild -Lbuild/opencl-1.2-stubs/\
     -larm_compute -larm_compute_graph -larm_compute_core -lOpenCL -o acl_test

copy the binary file acl_test to your device and run

./acl_test all
cat result-acl.txt

results are recored in result-acl.txt

Note Some testcases (e.g. resnet) are missing because Arm Compute Library currently (v17.12) does not support skip connection in its graph runtime. Also some testcases are too slow so that be skipped.

Result

Paste the outputs on my board here.

TVM/NNVM

============================================================
model: vgg16, dtype: float32
warm up..
test..
cost per image: 1.2926s
============================================================
model: vgg16, dtype: float16
warm up..
test..
cost per image: 0.6896s
============================================================
model: resnet18, dtype: float32
warm up..
test..
cost per image: 0.2041s
============================================================
model: resnet18, dtype: float16
warm up..
test..
cost per image: 0.1183s
============================================================
model: mobilenet, dtype: float32
warm up..
test..
cost per image: 0.0767s
============================================================
model: mobilenet, dtype: float16
warm up..
test..
cost per image: 0.0479s

MXNet + Openblas

============================================================
model: vgg16, dtype: float32
warm up...
test..
cost per image: 3.0250s
============================================================
model: resnet18, dtype: float32
warm up...
test..
cost per image: 0.3977s
============================================================
model: mobilenet, dtype: float32
warm up...
test..
cost per image: 0.2914s

ACL

backend: cl    model: vgg16      conv_method: gemm     dtype: float32   cost: 1.64456
backend: cl    model: vgg16      conv_method: gemm     dtype: float16   cost: 0.969372
backend: cl    model: vgg16      conv_method: direct   dtype: float32   cost: 3.90031
backend: cl    model: vgg16      conv_method: direct   dtype: float16   cost: 1.61179
backend: cl    model: mobilenet  conv_method: gemm     dtype: float32   cost: 0.170934
backend: cl    model: mobilenet  conv_method: direct   dtype: float32   cost: 0.173883
backend: neon  model: vgg16      conv_method: gemm     dtype: float32   cost: 4.10269

tvm-mali's People

Contributors

merrymercy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tvm-mali's Issues

Does TVM support RK3288?

Hi, I saw you do the benchmark test on RK3399. RK3288 has a Mali-T764 GPU. We tried it but the opencl has some problem recognizing the GPU. Do you have idea if TVM supports RK3288?

about winograd transform matrix

hi, @merrymercy

in conv2d.py, winograd algorithm do G and B like normal matrix multiplication, this will not reduce multiplication by plus/minus. Is this understood correct?

Thanks

about tune

@merrymercy

Is there any tune guide?

Which parameters could be tuned? Why set num_thread = 8?

Thanks

Encounter an error with different network model

I import YOLOv2_tiny.onnx from the link below using mali on firefly3399 and got an error on nnvm.compiler.build. It looks like something wrong with conv2d which mail supports. Any idea?

https://github.com/tkat0/chainer-nnvm-example

Traceback (most recent call last):
File "mali_imagenet_bench.py", line 102, in
run_case('tinyYolo2', 'float32')
File "mali_imagenet_bench.py", line 42, in run_case
graph, lib, params = nnvm.compiler.build(net, tvm.target.mali(), shape={input_name: data_shape}, params=params, dtype=dtype, target_host=args.target_host)
File "/usr/local/lib/python2.7/dist-packages/nnvm-0.8.0-py2.7.egg/nnvm/compiler/build_module.py", line 251, in build
graph = graph.apply("GraphFusePartition").apply("GraphFuseCompile")
File "/usr/local/lib/python2.7/dist-packages/nnvm-0.8.0-py2.7.egg/nnvm/graph.py", line 235, in apply
check_call(_LIB.NNGraphApplyPasses(self.handle, npass, cpass, ctypes.byref(ghandle)))
File "/usr/local/lib/python2.7/dist-packages/nnvm-0.8.0-py2.7.egg/nnvm/_base.py", line 72, in check_call
raise NNVMError(py_str(_LIB.NNGetLastError()))
nnvm._base.NNVMError: TVMCall CFunc Error:
Traceback (most recent call last):
File "tvm/_ffi/_cython/function.pxi", line 39, in core.tvm_callback (tvm/_ffi/_cython/core.cpp:3206)
File "/usr/local/lib/python2.7/dist-packages/nnvm-0.8.0-py2.7.egg/nnvm/top/nn.py", line 123, in compute_conv2d
out = topi.nn.conv2d(inputs[0], inputs[1], strides, padding)
File "", line 2, in conv2d
File "/usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-x86_64.egg/tvm/target.py", line 222, in dispatch_func
return dispatch_dict[k](*args, **kwargs)
File "build/bdist.linux-x86_64/egg/topi/mali/conv2d.py", line 100, in decl_conv2d
return _decl_direct(data, kernel, stride, padding, layout, out_dtype)
File "build/bdist.linux-x86_64/egg/topi/mali/conv2d.py", line 184, in _decl_direct
assert OW % VW == 0, "OW: %d VW : %d" % (OW, VW)
AssertionError: OW: 22 VW : 4

Thanks,

about winograd batched MM performance

hi, @merrymercy
I am working on winograd on cuda.
I found that batched MM in your winograd is slow in nvida architecure. I guest this is because when C is large, it could not use parallel power of GPU.

Do you have any idea about this part?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.