GithubHelp home page GithubHelp logo

ainisa20 / recognize-anything Goto Github PK

View Code? Open in Web Editor NEW

This project forked from xinyu1205/recognize-anything

0.0 0.0 0.0 23.32 MB

Code for the Recognize Anything Model (RAM) and Tag2Text Model

Home Page: https://recognize-anything.github.io/

License: Apache License 2.0

Python 1.76% Jupyter Notebook 98.24%

recognize-anything's Introduction

๐Ÿท๏ธ Recognize Anything & Tag2Text

Web Demo Open in Colab

Official PyTorch Implementation of Recognize Anything: A Strong Image Tagging Model and Tag2Text: Guiding Vision-Language Model via Image Tagging.

  • Recognize Anything Model(RAM) is an image tagging model, which can recognize any common category with high accuracy.
  • Tag2Text is a vision-language model guided by tagging, which can support caption, retrieval and tagging.

Both Tag2Text and RAM exihibit strong recognition ability. We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.

๐Ÿ’ก Highlight of RAM

RAM is a strong image tagging model, which can recognize any common category with high accuracy.

  • Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
    • RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
    • RAM even surpasses the fully supervised manners (ML-Decoder).
    • RAM exhibits competitive performance with the Google tagging API.
  • Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;
  • Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.

(Green color means fully supervised learning and Blue color means zero-shot performance.)

RAM significantly improves the tagging ability based on the Tag2text framework.

  • Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, higher accuracy compared to Tag2Text.
  • Scope. RAM upgrades the number of fixed tags from 3,400+ to 6,400+ (synonymous reduction to 4,500+ different semantic tags), covering more valuable categories. Moreover, RAM is equipped with open-set capability, feasible to recognize tags not seen during training

๐ŸŒ… Highlight of Tag2text

Tag2Text is an efficient and controllable vision-language model with tagging guidance.

  • Tagging. Tag2Text recognizes 3,400+ commonly human-used categories without manual annotations.
  • Captioning. Tag2Text integrates tags information into text generation as the guiding elements, resulting in more controllable and comprehensive descriptions.
  • Retrieval. Tag2Text provides tags as additional visible alignment indicators for image-text retrieval.

โœ๏ธ TODO

  • Release Tag2Text demo.
  • Release checkpoints.
  • Release inference code.
  • Release RAM demo and checkpoints.
  • Release training codes (until July 8st at the latest).
  • Release training datasets (until July 15st at the latest).

๐Ÿงฐ Checkpoints

Name Backbone Data Illustration Checkpoint
1 RAM-14M Swin-Large COCO, VG, SBU, CC-3M, CC-12M Provide strong image tagging ability. Download link
2 Tag2Text-14M Swin-Base COCO, VG, SBU, CC-3M, CC-12M Support comprehensive captioning and tagging. Download link

๐Ÿƒ Model Inference

Setting Up

  1. Install the dependencies::

pip install -r requirements.txt

  1. Download RAM pretrained checkpoints.

  2. (Optional) To use RAM and Tag2Text in other projects, better to install recognize-anything as a package:

pip install -e .

Then the RAM and Tag2Text model can be imported in other projects:

from ram.models import ram, tag2text_caption

RAM Inference

Get the English and Chinese outputs of the images:

python inference_ram.py  --image images/demo/demo1.jpg 
--pretrained pretrained/ram_swin_large_14m.pth

RAM Inference on Unseen Categories (Open-Set)

Firstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:

python inference_ram_openset.py  --image images/openset_example.jpg 
--pretrained pretrained/ram_swin_large_14m.pth

Tag2Text Inference

Get the tagging and captioning results:

python inference_tag2text.py  --image images/demo/demo1.jpg 
--pretrained pretrained/tag2text_swin_14m.pth
Or get the tagging and sepcifed captioning results (optional):
python inference_tag2text.py  --image images/demo/demo1.jpg 
--pretrained pretrained/tag2text_swin_14m.pth
--specified-tags "cloud,sky"

Batch Inference and Evaluation

We release two datasets OpenImages-common (214 seen classes) and OpenImages-rare (200 unseen classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/ and datasets/openimages_rare_200/imgs.

To evaluate RAM on OpenImages-common:

python batch_inference.py \
  --model-type ram \
  --checkpoint pretrained/ram_swin_large_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/ram

To evaluate RAM open-set capability on OpenImages-rare:

python batch_inference.py \
  --model-type ram \
  --checkpoint pretrained/ram_swin_large_14m.pth \
  --open-set \
  --dataset openimages_rare_200 \
  --output-dir outputs/ram_openset

To evaluate Tag2Text on OpenImages-common:

python batch_inference.py \
  --model-type tag2text \
  --checkpoint pretrained/tag2text_swin_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/tag2text

Please refer to batch_inference.py for more options. To get P/R in table 3 of our paper, pass --threshold=0.86 for RAM and --threshold=0.68 for Tag2Text.

To batch inference custom images, you can set up you own datasets following the given two datasets.

โœ’๏ธ Citation

If you find our work to be useful for your research, please consider citing.

@article{zhang2023recognize,
  title={Recognize Anything: A Strong Image Tagging Model},
  author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
  journal={arXiv preprint arXiv:2306.03514},
  year={2023}
}

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}
}

โ™ฅ๏ธ Acknowledgements

This work is done with the help of the amazing code base of BLIP, thanks very much!

We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.

We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.

recognize-anything's People

Contributors

xinyu1205 avatar majinyu666 avatar mhd-medfa avatar coler1994 avatar zhaoyangli-nju avatar positive666 avatar tuofeilunhifi avatar mitpitt avatar amorporkian avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.