GithubHelp home page GithubHelp logo

Comments (146)

naibaf7 avatar naibaf7 commented on May 26, 2024 2

@sh1r0
It should be possible to get BVLC/caffe#2610 to work on android. It can probably be done by replacing the Caffe used in this project by the https://github.com/naibaf7/caffe branch and some minor adaptions/fixes.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024 1

@sambookhon sorry for waiting :)
use clblas 2.4 ,it support opencl 1.1
1.at first you will bulid it on ubuntu ,use"cmake ..;make;make install",
you will set BUILD_TEST BUILD_PERFORMANCE BUILD_SAMPLE BUILD_CLIENT BUILD_KTEST "OFF",dont build that option,
set( Boost_USE_MULTITHREADED OFF ),set( Boost_* OFF ),all boost OFF;
2.then build it on NDK, delete -m${TARGET_PLATFORM},
3.delete set(TIME_LIBRARY "rt") in \clblas\library\tools\tone\cmakelists.txt,as android dont support -lrt

all ,you will build it :) good luck ,

from caffe-android-lib.

sh1r0 avatar sh1r0 commented on May 26, 2024

I would say that it's possible, but I'm not sure when. Currently, if you are interested in caffe w/ OpenCL support, you can refer to BVLC/caffe#2610.

from caffe-android-lib.

sh1r0 avatar sh1r0 commented on May 26, 2024

@naibaf7
👍 But I took a look at your branch, and I found that the commits are too much to make the branch like a patch to be easily applied to the upstream master branch. Would you like to rebase your branch?

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@sh1r0
Yes and I guess you would need to use 32 bit indexing (pointers) instead of 64 bit indexing for Android devices.
What requirements would you have to be able to integrate this?

from caffe-android-lib.

sh1r0 avatar sh1r0 commented on May 26, 2024

@naibaf7
Yes, I guess so. 😛
I think a branch which is rebased to the latest master branch (BVLC/caffe@03a84bf) should be enough for me to have some trials.
Thanks.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7 @sh1r0
hi,I've run it by compared naibaf7/caffe to caffe-android-lib ,it work well on cpu used EIGEN,
but it will failed when run greentea_memset() on GPU mode(mali T880 opencl 1.1,use 32 bit indexing ).
it failed when run viennacl::ocl::enqueue(),I am not familiar with opencl,so learn some about it to fix the problem later.
could you give some suggestions to me? thanks!

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@zif520
Did you change int_tp and int_tpc both to 32bit types for both the OpenCL and C++ part of the code?

https://github.com/naibaf7/caffe/blob/master/include/caffe/definitions.hpp
and
https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/cl_headers/header.cl

however it might break if you just change it, so I'll verify and fix that.

I have a phone with an Adreno 330 GPU that should also be OpenCL ready... might try to fix it up myself :)... the great news is that OpenCL-Z (from PlayStore) reports a valid libOpenCL.so version 1.2 on that one!

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7
there is still some troubles in it ,i will spent some days to fix it ,
and then amd/OpenCL-caffe#17 said clblas 2.4 support opencl 1.1 ,i also will try it .
OpenCL-Z reports my telephone only supply opencl1.1

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@zif520
I am currently making my branch ready for 32 bit indexing again, so that both 64 bit and 32 bit work. Then it should be able to compile and run on Android 1.1 devices.

It is not necessary to compile and use clBLAS with my branch, ViennaCL comes with a built-in BLAS that should work on mobile devices with OpenCL 1.1

Can you share what you have done so far? (adaptions, code, ...) that would speed up the process.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7
yes! i will share it when it is completed,it is popular to use caffe(mxnet and so on) on telephone ,many people wang to do that .

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@sh1r0
I currently don't have the time for a complete rebase - this has to wait a bit.

@zif520
What's the progress? Is it working with my latest updates?

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7
i am sorry,i go home just beause of new year,i will come back 2016,01,04,

from caffe-android-lib.

sh1r0 avatar sh1r0 commented on May 26, 2024

@naibaf7
OK, that's fine. I tried to merge my branch onto yours for some trials in the early stage. To see my progress, you can take a look at opencl_dev.

And there are some issues I found according to my tests:

  • CPU does not work as a OpenCL device (runtime error)
  • Running on GPU is about 5 times slower than in pure CPU mode (CPU_ONLY with OpenBLAS)

Note: my test device is with Qualcomm Snapdragon 801 (Qualcomm Krait 400 and Qualcomm Adreno 330) and the support of OpenCL 1.2.

I'm not quite sure if I miss anything I need to take care of, as I'm not familiar with OpenCL. :p

Thanks.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@sh1r0 I don't know how amd clblas or viennacl backends are optimized for this kind of devices. Qualcomm has its own Snapdragon optimized BLAS implementation but it is still CPU only.

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@sh1r0
Ok cool, at least you got it working!

Now, what is the runtime error that you get with using the CPU on OpenCL? I use a CPU BLAS with CPU devices instead of ViennaCL-BLAS or clBLAS, so that might make issues here.

As for performance, it should definitely not be that slow. But to identify where the culprit is, I'd need to have some layer-wise timings to see what exactly runs slow. Maybe something I can also have a look at, as I have an Adreno 330 as well.
Do you know how to do that quickly?

When you enumerate OpenCL devices, is the order the same as in OpenCL-Z?

from caffe-android-lib.

sh1r0 avatar sh1r0 commented on May 26, 2024

@naibaf7
Yes, it's really cool to have OpenCL works.

Sorry, I'm not sure what the problem might be, as I just got a segmentation fault when specifying CPU as the target device.

To get layer-wise timings, I think tools/caffe time is a good choice. However, with OpenCL build, I failed to make any executable run successfully on Android. I got ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program '' for classification (cpp_classification) and CANNOT LINK EXECUTABLE: cannot locate symbol "_ZN5caffe3NetIfEC1ERKSsNS_5PhaseEPKS1_" referenced by "./caffe"... for caffe. That's weird.
EDIT: For caffe, got Segmentation fault.

Yes, the order are consistent to that in OpenCL-Z.

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@sh1r0
Ok thanks, I'll try to work out what's going wrong.

Might it be that the binaries do not call set_mode and SetDevice properly? ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program '' would imply the OpenCL kernels weren't compiled.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@sh1r0 you will refer @naibaf7 ' s code in caffe.cpp:test(),you add setdevices() will fix it,at first ,you will init device

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@zif520
Yes, here, it is also important to mention that the device must be set before any solver or network is loaded. Knowledge of which device should be used is ultimately required to compile kernels, allocate memory and dispatch network/solver initialization.

It is even possible to have multiple devices work on multiple networks in parallel, but then the rules are as follows:

  • Caffe must be initialized with SetDevices on the main thread, providing a complete list of the devices to be used.
  • SelectDevice can be used to switch the device. When initializing networks on the main thread, select the device before creating a network or solver on that device.
  • The networks can be trained in parallel by using multiple host threads. In every thread, SelectDevice can switch to a different device. This selection will be thread local.
  • This threading feature should also work when being used in Android AsyncTasks, Java Threads or in Python Multithreading (without getting into GIL locks), making it very convenient to use.

from caffe-android-lib.

sh1r0 avatar sh1r0 commented on May 26, 2024

@zif520
Thanks, I've got CPU working as a OpenCL device. (I used SetDevice only before.)
But there might have other issues in tools/caffe such that it still does not work.

@naibaf7
I got some benchmark results, please refer to the link. time.cpp is basically caffe time. The number of iterations is 10 for cpu mode and 1 for gpu mode (as it takes ~6 minutes for a single iteration).
I found that there are little difference between using cpu and gpu as the OpenCL device. And as for forward timings, gpu mode (OpenCL) is ~70x slower than cpu mode.

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@sh1r0
I think now you benchmarked the OpenCL GPU twice:

    Caffe::SetDevices(gpus);
    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevice(gpus[0]);

should be either:

    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevice(gpus[0]);

or:

    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevices(gpus);
    Caffe::SelectDevice(gpus[0], false);

Besides, I think the ViennaCL GEMM for convolution seems really unsuitable for the Adreno GPU then. I don't know of any BLAS that is optimized for mobile GPUs. Probably a better performance can even be reached by implementing a simple direct convolution instead of using an explicit GEMM at all.
Maybe @karlrupp has an idea on this.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@naibaf7 A tuning issue on Adreno was opened at clMathLibraries/clBLAS#136

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@bhack
Thanks, good to know. However ViennaCL-BLAS seems to have optimization/tuning issues on this as well (which is what we are currently using in this Android-OpenCL experiment).
It is a bit unfortunate, since nVidia has well optimized cuBLAS for most devices, while other vendors have basically nothing to offer (yet).

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@naibaf7 Have you experimented with https://github.com/ptillet/isaac? Probably could be an alternative path if clBLAS continue to not attract contributions by other vendors. /cc @ptillet

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

Also Google https://github.com/halide/Halide/tree/master/apps/linear_algebra could be benchmarked on android.

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@bhack @zif520 @sh1r0
Added ISAAC compile support to CMake and GNU Makefiles on my branch, if anyone fancies to try. It did not speed up on my GT650 or Intel HD4000. Maybe it can work on mobile.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@ptillet What is the status?

from caffe-android-lib.

karlrupp avatar karlrupp commented on May 26, 2024

@naibaf7 We had a detailed look at mobile GPUs over the summer. Our findings were fairly disappointing: Even with extensive code generation and autotuning we could not get anywhere close to peak (exception: NVIDIA Tegra). Even synthetic FLOP-intensive code did not work well, indicating that OpenCL compiler support still needs much more love.

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@karlrupp
Thanks for clarification, even though that is not good news, it indicates that the fault lies with the compiler / OpenCL libraries of the vendors and not with our code.

It is very disappointing indeed, I thought by now there would be more efforts and interest by hardware vendors to have solutions ready to compete against nVidia.

from caffe-android-lib.

sh1r0 avatar sh1r0 commented on May 26, 2024

@naibaf7
Oh, I thought Net instance is running on the device which is specified in the constructor, so I did in this way:

Net<float> caffe_net(FLAGS_model, caffe::TRAIN, Caffe::GetDevice(device_id, true));

I would like to benchmark cpu (device 1) as a opencl device, and I tried both

Caffe::set_mode(Caffe::GPU);
Caffe::SetDevice(gpus[1]);

and

Caffe::set_mode(Caffe::GPU);
Caffe::SetDevices(gpus);
Caffe::SelectDevice(gpus[1], false);

with

Net<float> caffe_net(FLAGS_model, caffe::TRAIN, Caffe::GetDefaultDevice());

I got segmentation fault during runtime on both cases. :(


To all:
It's really disappointing to know that.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7
I had tested your newest code, the problem of 64bit is fixed ,my telephone is HUAWEI mate8 with mali T880
It spend 1120ms with GPU mode but 500ms with CPU mode (with openblas),then i will test it with ISSA later

and i found that it is very slow at first forward,

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@zif520 @sh1r0

yes the first pass will be extra slow due to memory allocation and kernel compilation.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

In Isaac there is a subdirectory tune/android but in the ini seems to me that actually coner only Intel on android

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7 @sh1r0
I had tested in ubuntu with clblas 2.4, GPU is nvidia gtx780,run 1000 iters on mnist
only ViennaCL:108s
use clblas:86s

clblas 2.8 only support opencl 1.2 ,so i test it with clblas 2.4 first. and then do something as @hughperkins suggested
amd/OpenCL-caffe#17

I will move it to android later

from caffe-android-lib.

mkaskov avatar mkaskov commented on May 26, 2024

May be try to use RenderScript? why opencl?

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@mkaskov Have you seen halide blas?

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@mkaskov
Well that would require to write a new backend.
Also: http://stackoverflow.com/questions/14385843/why-did-google-choose-renderscript-instead-of-opencl

In theory, they should be able to perform equally well, given there is a reasonably optimized BLAS, and a good OpenCL compiler on that platform.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

hi @naibaf7
you had said "yes the first pass will be extra slow due to memory allocation and kernel compilation."
we get a software of caffe ,it is used in my telephone ,it cost 3.4s at first pass, and 1.7s normal,there use vggnet.

but our code spend 8s at first pass ,800ms normal,are there any optimization to reduce the first pass time?
I am not familiar with opencl : )

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@zif520
I am abit confused what you mean...
in general, no. Mem allocation and compile can not be done faster.

In test mode though, the memory footprint can be reduced, which will also speed it up.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7
i dont konw whether it can build offline,
just as it:
https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/online-offline-compilation/

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

Seems that official Rendescript optimized version of blas was added in Android API level 23

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

We could probably add a Renderscript backend to Caffe once I've completed the device/backend abstraction... replicating the math_functions "middle ware" should be fairly quick, then all that would be left to do is adding the custom kernels and kernel launch support.

If anyone wants to do that approach, just contact me.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

Investing time on caffe core it is at high risk actually

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@bhack
Would you elaborate on this?

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@naibaf7 I don't know if you have direct contact with BVLC members but seems to me that they are totally out of resource to handling the project and community. What are the prospective? Code fragmentation with hundred of forks? I think we have waited enough conference deadlines to see a clear roadmap BVLC/caffe#2313

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@bhack
I've tried to contact them a few times. It is currently indeed very hard to get everyone together who should be collaborating. Tomorrow I'll have a conference call with Intel's Beignet maintainer/coordinator, I'm also in contact with AMD's Junli Gu about proceeding on OpenCL. Evan Shelhamer didn't get back at me after his last comments on BVLC/caffe#2610.

However, I'm still not discouraged. If hardware developers want to collaborate and optimize for the devices, I'll collaborate happily. New backends are also welcome.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@naibaf7 Also absence of a comment to https://github.com/Yangqing/caffe2/issues/22 give me more fog for the framework future prespective and more disincentive to invest in caffe.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

As I supposed tensorflow/tensorflow#663 (comment)

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

it is great to hear that ,we will try to optimize opencl at first : )

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@naibaf7 http://www.embedded-vision.com/industry-analysis/technical-articles/caffe-deep-learning-framework-interview-core-developers

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7 @bhack @sh1r0
clblas 2.4 is useful for telephone,it only cost 300ms on alexnet against 800ms without clblas .

from caffe-android-lib.

sambookhon avatar sambookhon commented on May 26, 2024

@zif520
I don't know whether it is good to ask here, but could you share the install instruction for clblas? I can not install it in Mali-T628. Although I searched for the Internet, I didn't find useful information. When cmake, I encounter

  1. error: unrecognized command line option '-m32'
  2. CMakeFiles/Makefile2:109: recipe for target 'library/CMakeFiles/clBLAS.dir/all' failed

I plan to run caffe in ARM Mali GPU as you did. If you can share some information with me, that will be great. Thanks

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@zif520 @sambookhon @sh1r0
Please use the following branch for OpenCL from now on:
https://github.com/BVLC/caffe/tree/opencl

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@sambookhon
1.deleteo ption '-m32',Android dont support it
2.i am not encountered that error,could you share more information?

and i go home now as chinese new year,i will give you my cmake detail back.:)

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@naibaf7
It is great to hear that ! 👍
I am learning some knowledge about opencl,some as "opencl in action" and so on,perhaps i also can give some help for that branch later :)

from caffe-android-lib.

sambookhon avatar sambookhon commented on May 26, 2024

@zif520
Thanks for your sharing. Do I post my information here (I am afraid of distracting this post?) OR sending by email (my email is "fishfrank23" "@" "gmail.com"). Happy Chinese New Year.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

/cc @krikru

from caffe-android-lib.

sambookhon avatar sambookhon commented on May 26, 2024

@zif520
Sorry to bother you again. Could you provide your cmake detail? Thanks.

from caffe-android-lib.

strin avatar strin commented on May 26, 2024

This thread is very interesting. I've been trying to get caffe to work on android. The results seem to be surprising: caffe running with Mali gpu seems to be 2-3 slower than cpu, but about 4-5x more energy efficient. The test was run on Galaxy S6 (Mali T760, Peak Performance 200 GFlops).

Since GEMM is the core of convolution in caffe, I decided to profile its performance on Android. It seems that ViennaCL is not as efficient as some simple kernels. Now I am able to get GPU run as fast as CPU for large matrices (2k x 2k). This is still counter-intuitive, since normally we expect GPUs to be much faster.

See:
https://github.com/strin/mocha-profile

The kernel implementations can be found here:

OpenCL kernels for GEMM: https://github.com/strin/gemm-android

Any thoughts?

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

@sh1r0 did you get a chance to integrate the code from https://github.com/BVLC/caffe/tree/opencl to your opencl_dev branch. Also I saw some references of opencl port being slow and wondering if it is slower than CPU only or compared to CUDA?

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@jainanshul
I'm working on an own implementation of convolutions for OpenCL to make it faster while also reducing memory usage. I think this should also help on ARM/Android devices.

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

@naibaf7 would it be able to use the existing caffe models?

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@jainanshul
Yes. https://github.com/BVLC/caffe/tree/opencl is fully Caffe compatible with all existing models.

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

Ah you were talking about https://github.com/BVLC/caffe/tree/opencl, but at this moment it requires CUDA to be installed to use opencl. On my android device I don't have a Nvidia GPU so no CUDA available. Anyway to try opencl without requiring CUDA?

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@jainanshul
It should be possible to disable CUDA and cuDNN in the Makefile.config (or also in the CMake configuration)
If that is not the case, please raise an issue so that I can fix the offending code.

from caffe-android-lib.

krikru avatar krikru commented on May 26, 2024

@naibaf7 Do you know if there is any equivalence of cuDNN for OpenCL? cuDNN is basically a library of primitives for DNNs that utilizes CUDA, and as such can be used from any deep learning framework. If there was an equivalence for OpenCL we wouldn't need to have so many different implementations of basically the same functionality (but for different frameworks), but could just use the same implementation for all frameworks. I believe this would also make the implementation faster as it would unite people from different projects to work on the same OpenCL implementation.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

We are looking at something like @krikru described for the Opencv GSOC of this year (if the Opencv organization will be accepted again).
@naibaf7 Are you still eligible for GSOC and interested?

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@bhack
I unfortunately don't have time to do this. However I am working on a fast flexible cuDNN replacement for OpenCL at the moment. The forwarding is implemented, now I'm writing the autotuning and backward function.

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

@naibaf7 the above work would it be part of your opencl branch and what is the timeline you are looking at?

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@naibaf7 Nice! If it is generic enough and BSD licensed compatible we could evaluate to put a student to integrate on this.

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@bhack
I think it could be :) it also supports grouping, dilation, stride, padding, N-dimensional. There's still quite some work left to do before I dare to release it to the public though :)

from caffe-android-lib.

xianyi avatar xianyi commented on May 26, 2024

Interesting thread. Learn a lot.

I am interesting on providing a BLAS implementation for mobile GPUs. (For CPU, I suggest OpenBLAS,haha)

As @bhack mentioned, Google released ScriptIntrinsicBLAS and RenderScript. Is it a good idea to use OpenCL for Android?

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@xianyi
It would be possible to use RenderScript if someone is willing to write a whole backend for it. OpenCL could be just as fast, but sometimes the provided implementations by the vendors are lacking. @karlrupp tested and knows a lot about that.
He mentioned that even simple synthetic OpenCL scripts do not reach the peak performance of those mobile chips.

from caffe-android-lib.

krikru avatar krikru commented on May 26, 2024

@naibaf7 Nice! I believe that would be really valuable. The most important thing to get right before you release it is the interface because it will be hard to change that afterwards, the rest can be improved afterwards. I'm looking forward to seeing it released.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@naibaf7 @xianyi We need to think that Vulkan it is released now. So also SPIR-V is a strategic target also for Android in the very near future. SyCL propose single source and I think that could be taken in serious consideration.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@strin
"(Mali T760, Peak Performance 200 GFlops)."
i test it only 74GFlops with opencl x
we test huawei mate8 with T880MP4 only 72GFlops,but half float will faster;
alexnet forward will cost 300ms,but vgg forward cost 3s ,use clblas2.4
s7 will supply T880MP14,it is powerful !!!

from caffe-android-lib.

xianyi avatar xianyi commented on May 26, 2024

@zif520 , Thank you for the data. Is there a improvement room for BLAS or DNN library on mobile GPU?

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@xianyi
alexnet:
(1)ViennaCL cost 800ms
(2)clblas 2.4 cost 300ms,i also test clblas 2.6 (delete opencl 1.2 function ,gemm not use them,it also cost 300ms)
so,opencl give a way called AutoGemm on clblas 2.10, https://github.com/clMathLibraries/clBLAS/wiki/AutoGemm
it is used python ,i cant run it on mobile.
(3) half float is useful ,16 bit float.

but i am not familiar with blas and opencl :)

ps:we test open-blas on mobile cpu,it is faster than eigen , cpu with 4 core will cost 250ms ,faster than gpu now.

from caffe-android-lib.

xianyi avatar xianyi commented on May 26, 2024

@zif520 , I think AutoGemm is useful for AMD GPU, which may not be suitable for mobile GPU.

I think OpenBLAS with ARMV7 kernel is not full optimized on your testbed. We just released OpenBLAS CortexA57 kernels for AArch64. Meanwhile, I want to introduce OpenVML project. https://github.com/xianyi/OpenVML We implement powx, exp on vector by ARM Neon instructions.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@xianyi
Is it support CortexA57 only?can i try it on A72?

from caffe-android-lib.

xianyi avatar xianyi commented on May 26, 2024

You can try it on A72 by make TARGET=CORTEXA57.

from caffe-android-lib.

strin avatar strin commented on May 26, 2024

@zif520 i think Galaxy S6 comes with Mali T760 MP8. According to http://kyokojap.myweb.hinet.net/gpu_gflops/, the peak gflops is 200. I also ran a benchmark, and got something close to ~74 GFlops.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@strin perhaps the benchmark only use 4 cores,
http://kyokojap.myweb.hinet.net/gpu_gflops/ is right for kirin 950

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@xianyi
I had tested on caffenet ,it will cost 300ms ,4 cores target=A57 and openmp,
but EIGEN only cost 200ms with 4 cores

from caffe-android-lib.

xianyi avatar xianyi commented on May 26, 2024

@zif520, thank you for the testing.

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

@zif520 is the above result with OpenCL on ARM?

from caffe-android-lib.

edgarriba avatar edgarriba commented on May 26, 2024

@bhack I'm interested in the gsoc thing

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@edgarriba Try to ask to @naibaf7 if you can give some contribution in the meantime.

from caffe-android-lib.

edgarriba avatar edgarriba commented on May 26, 2024

@bhack @naibaf7 oki! But what you are suggesting is an update for Caffe with OpenCL, right? I'm not sure how will fit in OpenCV since as I understood it has an own implementation of Caffe. I'm also interested in the training part at the same abstraction level as they do in Keira, lasagne, among others. Not sure if it will work. If you want we can discuss that in the forum.

from caffe-android-lib.

bhack avatar bhack commented on May 26, 2024

@edgarriba I've added @naibaf7 to the group. Please ask there.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@jainanshul is arm support opencl? i had see that arm will support it future, my test is base on neon

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

@zif520 some arm vendors do provide opencl implementations.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@jainanshul
Can you give some examples for that?

from caffe-android-lib.

edgarriba avatar edgarriba commented on May 26, 2024

@bhack @naibaf7 nice! just posted there

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

@zif520 newer qualcomm's snapdragon support opencl on ARM

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

@naibaf7

I'm working on an own implementation of convolutions for OpenCL to make it faster while also reducing memory usage. I think this should also help on ARM/Android devices.

Is there any update on this? So far what I have seen from this thread it seems like openCL GPU runs slower than CPU. I will be experimenting with opencl caffe on an Android device and would update the results on this thread. In the meanwhile if you made any progress towards optimizing performance for mobile devices, please let me know.

from caffe-android-lib.

naibaf7 avatar naibaf7 commented on May 26, 2024

@jainanshul
I have a forward kernel that could be optimized / tested on mobile if you are interested.
The backward kernel is a bit more complicated and I'm still working on that, with a planned initial release within 2-3 weeks. Let me know if you feel like experimenting with it, then I can send you a pre-release of the forward kernel code and verification tests.

from caffe-android-lib.

zif520 avatar zif520 commented on May 26, 2024

@jainanshul
Is it 820? 820's dsp is srtong,1024bit simd,but kirin 950's neon is 128bit; we have not phone with 820,Could you share your result to us?

and I had heared that forward on 820 CPU mode only cost 50ms with alex;

@naibaf7 could you share the forward kernel for us to test?
I have not much progress on opencl

from caffe-android-lib.

jainanshul avatar jainanshul commented on May 26, 2024

@zif520 the chip I am using has Adreno 510 GPU. I will share the results with in a few days. @naibaf7 please shared your experimental code when you can and I would be happy to try it.

from caffe-android-lib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.