GithubHelp home page GithubHelp logo

zheqiushui / clip-onnx-ax650-cpp Goto Github PK

View Code? Open in Web Editor NEW
21.0 1.0 3.0 742 KB

c++实现的clip推理,模型有一点点改动,但是不大,改动和导出模型的代码可以在readme里找到,模型文件都在Releases里,包括AX650的模型。新增支持ChineseCLIP

CMake 21.65% C++ 76.43% C 1.92%

clip-onnx-ax650-cpp's Introduction

CLIP

zh_clip-2023-10-20_10.19.10.mp4

other interesting project SAM-ONNX-AX650-CPP

Build

mkdir build
cd build

if x86 onnxruntime

cmake -DONNXRUNTIME_DIR=${onnxruntime_dir} -DOpenCV_DIR=${opencv_cmake_file_dir} ..

else if ax650

cmake -DONNXRUNTIME_DIR=${onnxruntime_dir} -DOpenCV_DIR=${opencv_cmake_file_dir} -DBSP_MSP_DIR=${msp_out_dir} -DBUILD_WITH_AX650=ON ..
make -j4

aarch64-none-gnu library:
onnxruntime
opencv

Resource

Google Drive

ONNX

Export Onnx

ZHEQIUSHUI/CLIP
ZHEQIUSHUI/Chinese-CLIP

Get Original model

export onnx by yourself

# Original Clip
git clone https://github.com/ZHEQIUSHUI/CLIP.git
cd CLIP
python onnx_export.py

# Chinese Clip
git clone https://github.com/ZHEQIUSHUI/Chinese-CLIP.git
git checkout ax650

# download weights
cd weights
./downloads.sh

# get onnx model
cd ..
./convert.sh

# onnxsim model
cd ax650
./onnxsim.sh

or direct download model from release

# Chinese Clip model
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/cnclip/cnclip_vitb16.axmodel
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/cnclip/cnclip_vitb16.img.fp32.onnx
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/cnclip/cnclip_vitb16.txt.fp32.onnx

# feature matmul model
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/3models/feature_matmul.onnx

# Original Clip model
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/3models/image_encoder.onnx
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/3models/image_encoder.axmodel
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/3models/text_encoder.onnx

run in x86 with onnxruntime

英文

./main --ienc image_encoder.onnx --tenc text_encoder.onnx --dec feature_matmul.onnx -v ../vocab.txt -i ../images/ -t ../text.txt 

inputs: 
              images: 1 x 3 x 224 x 224
output: 
      image_features: 1 x 512
decode Inference Cost time : 0.00040005s

per image:
                 image path\text|                            bird|                             cat|                             dog|
              ../images/bird.jpg|                            1.00|                            0.00|                            0.00|
               ../images/cat.jpg|                            0.00|                            0.99|                            0.01|
         ../images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|              ../images/bird.jpg|               ../images/cat.jpg|         ../images/dog-chai.jpeg|
                            bird|                            0.87|                            0.01|                            0.12|
                             cat|                            0.00|                            0.98|                            0.02|
                             dog|                            0.00|                            0.00|                            1.00|

中文

./main -l 1 -v ../cn_vocab.txt -t ../cn_text.txt -i ../images/ --ienc ../onnx_models/vitb16.img.fp32.onnx --tenc ../onnx_models/vitb16.txt.fp32.onnx -d ../onnx_models/feature_matmul.onnx 

inputs: 
               image: 1 x 3 x 224 x 224
output: 
unnorm_image_features: 1 x 512
[I][              load_image_encoder][  20]: image feature len 512
[I][               load_text_encoder][ 101]: text feature len 512
[I][                  load_tokenizer][  75]: text token len 52
encode text Inference Cost time : 0.0926369s
matmul Inference Cost time : 0.00045888s

per image:
                 image path\text|                          小鸟|                          猫咪|                          狗子|
              ../images/bird.jpg|                            1.00|                            0.00|                            0.00|
               ../images/cat.jpg|                            0.00|                            0.99|                            0.01|
         ../images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|              ../images/bird.jpg|               ../images/cat.jpg|         ../images/dog-chai.jpeg|
                          小鸟|                            0.77|                            0.22|                            0.01|
                          猫咪|                            0.00|                            1.00|                            0.00|
                          狗子|                            0.00|                            0.00|                            1.00|

中英混合

./main -l 1 -v ../cn_vocab.txt -t ../cn_text_mix.txt -i ../images/ --ienc ../onnx_models/vitb16.img.fp32.onnx --tenc ../onnx_models/vitb16.txt.fp32.onnx -d ../onnx_models/feature_matmul.onnx 

inputs: 
               image: 1 x 3 x 224 x 224
output: 
unnorm_image_features: 1 x 512
[I][              load_image_encoder][  20]: image feature len 512
[I][               load_text_encoder][ 101]: text feature len 512
[I][                  load_tokenizer][  75]: text token len 52
encode text Inference Cost time : 0.106218s
matmul Inference Cost time : 0.000361136s

per image:
                 image path\text|                        小 bird|                         cat 咪|                     小 dog 子|
              ../images/bird.jpg|                           1.00|                           0.00|                         0.00|
               ../images/cat.jpg|                           0.00|                           0.95|                         0.05|
         ../images/dog-chai.jpeg|                           0.00|                           0.01|                         0.99|


per text:
                 text\image path|              ../images/bird.jpg|               ../images/cat.jpg|         ../images/dog-chai.jpeg|
                         小 bird|                            0.96|                            0.03|                            0.00|
                          cat 咪|                            0.00|                            0.93|                            0.07|
                       小 dog 子|                            0.00|                            0.01|                            0.99|

AX650

run in AXERA Chip AX650

英文

./main --ienc image_encoder.axmodel --tenc text_encoder.onnx -d feature_matmul.onnx  -v vocab.txt -t text.txt -i images/
Engine creating handle is done.
Engine creating context is done.
Engine get io info is done.
Engine alloc io is done.
[I][                            init][ 275]: RGB MODEL
decode Inference Cost time : 0.000754583s

per image:
                 image path\text|                            bird|                             cat|                             dog|
                 images/bird.jpg|                            1.00|                            0.00|                            0.00|
                  images/cat.jpg|                            0.01|                            0.98|                            0.01|
            images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|                 images/bird.jpg|                  images/cat.jpg|            images/dog-chai.jpeg|
                            bird|                            1.00|                            0.00|                            0.00|
                             cat|                            0.00|                            0.99|                            0.01|
                             dog|                            0.00|                            0.00|                            1.00|

中文

./main -l 1 -v cn_vocab.txt -t cn_text.txt  -i images/ --ienc cn_clip_vitb16.axmodel --tenc vitb16.txt.fp32.onnx -d feature_matmul.onnx
Engine creating handle is done.
Engine creating context is done.
Engine get io info is done.
Engine alloc io is done.
[I][                            init][ 275]: RGB MODEL
[I][              load_image_encoder][  19]: image feature len 512
[I][               load_text_encoder][ 101]: text feature len 512
[I][                  load_tokenizer][  75]: text token len 52
encode text Inference Cost time : 0.762541s
matmul Inference Cost time : 0.0007695s

per image:
                 image path\text|                            小鸟|                             猫咪|                            狗子|
                 images/bird.jpg|                            0.99|                            0.00|                            0.01|
                  images/cat.jpg|                            0.00|                            0.98|                            0.02|
            images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|                 images/bird.jpg|                  images/cat.jpg|            images/dog-chai.jpeg|
                           小鸟|                             0.43|                            0.57|                            0.00|
                           猫咪|                             0.00|                            1.00|                            0.00|
                           狗子|                             0.00|                            0.14|                            0.86|

中英混和

./main -l 1 -v cn_vocab.txt -t cn_text_mix.txt  -i images/ --ienc cn_clip_vitb16.axmodel --tenc vitb16.txt.fp32.onnx -d feature_matmul.onnx
Engine creating handle is done.
Engine creating context is done.
Engine get io info is done.
Engine alloc io is done.
[I][                            init][ 275]: RGB MODEL
[I][              load_image_encoder][  19]: image feature len 512
[I][               load_text_encoder][ 101]: text feature len 512
[I][                  load_tokenizer][  75]: text token len 52
encode text Inference Cost time : 0.75124s
matmul Inference Cost time : 0.000727667s

per image:
                 image path\text|                         小 bird|                          cat 咪|                        小 dog 子|
                 images/bird.jpg|                            0.99|                            0.01|                            0.00|
                  images/cat.jpg|                            0.00|                            0.94|                            0.06|
            images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|                 images/bird.jpg|                  images/cat.jpg|            images/dog-chai.jpeg|
                        小 bird|                             0.92|                            0.08|                            0.00|
                         cat 咪|                             0.00|                            1.00|                            0.00|
                      小 dog 子|                             0.00|                            0.10|                            0.90|

Reference

CLIP
Chinese-CLIP
CLIP-ImageSearch-NCNN

clip-onnx-ax650-cpp's People

Contributors

zheqiushui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

clip-onnx-ax650-cpp's Issues

CLIP为啥还有个解码器

感谢楼主的代码贡献,阅读了您的代码后,我发现您代码中有decode相关的内容,我看CLIP的论文是只有图像编码器和文本编码器的,在您的代码中为啥会多出一个解码器呢?
这个解码器是否就是最后图像特征和文本特征点积操作呀?

另外,不情之请,怪不好意思的,楼主可以提供一下你demo里使用到的相关模型吗?百度云之类的都行,拜托楼主了。

打开CUDA支持,但是并未能调起GPU

楼主你好,我已经完成的CPU上的推理测试,但是当我想转到GPU上运行程序时,发现并未生效。
我的操作就是打开你代码中的注释,在OnnxWarpper.hpp文件中:
OrtSessionOptionsAppendExecutionProvider_CUDA

image

请问除了打开这个注释以外,还需要其他的配置吗?我机器上的CUDA是10.1版本的,然后驱动418.87。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.