GithubHelp home page GithubHelp logo

jinhaolee / wca Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 2.0 3.08 MB

[ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

License: MIT License

Python 100.00%
vision-language-model deep-learning image-text-matching large-language-models similarity-score visual-prompting zero-shot-classification visual-text-alignment textual-prompting

wca's Introduction

Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

Authors: Jinhao Li, Haopeng Li, Sarah M. Erfani, Lei Feng, James Bailey, Feng Liu

arXiv Static Badge License: MIT

Introduction

Recent research shows that using a pre-trained vision-language model (VLM), like CLIP, to align a query image with detailed text descriptions generated by LLMs can enhance zero-shot classification. This paper finds that these descriptions align better with local areas of the image rather than the whole. We introduce a method called weighted visual-text cross alignment (WCA) that uses localized visual prompting and a similarity matrix to improve alignment and classification.

Methodology

Overview of weighted visual-text cross alignment (WCA). The process begins with localized visual prompting, where the input image $x$ is divided into localized patches, such as ${x_1, x_2, x_3}$. These patches are encoded by an image encoder to produce visual features. The text prompting stage utilizes a large language model to generate detailed textual descriptions ${y_1, y_2, y_3}$ for a given class label $y$ (e.g., "woodpecker"). The WCA calculates alignment scores between visual features and textual features, using patch weights ${w_1, w_2, w_3}$ and text weights ${v_1, v_2, v_3}$. The final score is computed by summing the visual-text similarity matrix.

Experiment Results

We evaluate the performance of our method on the various datasets and compare it with the state-of-the-art methods. The results show that our method outperforms the existing methods in zero-shot classification.

Prediction Explanation

We demonstrate the prediction and explanation of our methods and CLIP-D, in identifying and explaining a given image of a gas mask or respirator. The image is analyzed to predict its category, with the similarity scores between the image and various descriptions plotted for each method

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Installation

You need to have Python installed on your machine. The project uses requirements.txt to manage dependencies. To install the dependencies, you can use a package manager like conda:

conda create --name <env> --file requirements.txt

Dataset

Please download the dataset and put it in any folder, then specify the data_path in the configuration files of cfgs folder. Each config file in the cfgs folder corresponds to a dataset and its corresponding hyperparameters.

Running the Script

The main script of the project is main.py. It can be run from the command line using the following command:

python main.py --dataset_name imagenet --num_workers 8 --seed 1

where dataset_name specifies the dataset to be used. The script will run inference on the specified dataset. num_workers specifies the number of workers to be used for data loading. seed specifies the random seed to be used for reproducibility.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Acknowledgments

We thank the following papers for their open-source code and models:

  • Learning Transferable Visual Models From Natural Language Supervision [ICML 2021]
  • Visual Classification via Description from Large Language Models [ICLR 2023 Oral]
  • What does a platypus look like? Generating customized prompts for zero-shot image classification [ICCV 2023]
  • Waffling around for Performance: Visual Classification with Random Words and Broad Concepts [ICCV 2023]

Citation

If this repository is helpful in your research, please consider citing our paper.

@inproceedings{livisual,
  title={Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models},
  author={Li, Jinhao and Li, Haopeng and Erfani, Sarah Monazam and Feng, Lei and Bailey, James and Liu, Feng},
  booktitle={Forty-first International Conference on Machine Learning}
}

wca's People

Contributors

jinhaolee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

wca's Issues

The results I obtained differ from those reported in the article.

Thank you for your outstanding work. After running your released code (with default parameter settings), I found significant discrepancies between my results and those reported in the paper. I tried several different random seeds, but the discrepancies persisted. For instance, for the ViT/B-32 model on the ImageNet dataset, the accuracies for different methods are as follows: {'clip': 60.83, 'clip-e': 62.20, 'clip-d': 61.88, 'waffe': 61.95, 'cupl': 63.25, 'ours': 61.21}. I'm unsure whether the issue lies with the code or my execution.

link error in data/README.md

Your research work is impressive!
BUT
Is there anything wrong about the link of the ImageNet-R?
Beacuse it may be same as the link of oxford-iiit-pet

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.