GithubHelp home page GithubHelp logo

yongliang-wu / explorecfg Goto Github PK

View Code? Open in Web Editor NEW
23.0 0.0 0.0 22.64 MB

[NeurIPS2023] Exploring Diverse In-Context Configurations for Image Captioning

License: MIT License

Python 97.75% Shell 2.25%

explorecfg's Introduction

[NeurIPS2023] Exploring Diverse In-Context Configurations for Image Captioning

LICENSE Python PyTorch

This repository contains the PyTorch implementation for the NeurIPS 2023 Paper "Exploring Diverse In-Context Configurations for Image Captioning" by Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen and Xin Geng.

If you have any questions on this repository or the related paper, feel free to create an issue.

Summary

Introduction

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, e.g., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case.

My SVG Image

Figure: The distinction between LM and VLMs as few-shot learners. LM generally excel with examples akin to the test case (blue blocks in (a)). In contrast, for VLMs, the performance is not strictly correlated with image similarity but heavily relies on the caption quality. For instance, when low-quality captions are used, similar images (d) lead to worse performance than dissimilar ones (f) since VLMs may build a short-cut by reusing in-context captions without seeing the given images.

Getting Started

Create a conda environment for running the scripts, run

conda create -n of python=3.9
pip install -r requirements.txt
pip install -e .

Download the OpenFlamingo v1 9B model from link and then download the LLaMA model from link.

You can run the following command to validate the model. See run_eval.sh for more details.

python open_flamingo/eval/evaluate.py \
    --lm_path $LM_PATH \
    --lm_tokenizer_path $LM_TOKENIZER_PATH \
    --checkpoint_path $CKPT_PATH \
    --device $DEVICE \
    --coco_image_dir_path $COCO_IMG_PATH \
    --coco_annotations_json_path $COCO_ANNO_PATH \
    --mgc_path  "MGC/wc_vis_135.json"\
    --mgca_path  "MGCA-idx/best_gt_WC(135).json"\
    --clip_ids_path "train_set_clip.json"
    --results_file $RESULTS_FILE \
    --num_samples 5000 --shots 4 8 16 32 --num_trials 1 --seed 5 --batch_size 8\
    --cross_attn_every_n_layers 4\
    --eval_coco

Datasets

MSCOCO

COCO is a large-scale object detection, segmentation, and captioning dataset. For image caption task, it has 5 captions per image. You can download the dataset from link.

Citation

Please cite our paper if it is helpful to your work:

@article{yang2023exploring,
  title={Exploring Diverse In-Context Configurations for Image Captioning},
  author={Yang, Xu and Wu, Yongliang and Yang, Mingzhuo and Chen, Haokun and Xin, Geng},
  journal={arXiv preprint arXiv:2305.14800},
  year={2023}
}

TODO

  1. Add the implement of Model-Generated Captions
  2. Add the implement of Model-Generated Captions as Anchors
  3. Add the implement of Similarity-based Image-Caption Retrieval and Diversity-based Image-Image Retrieval

Acknowledgements

Our implementations use the source code from the following repository:

explorecfg's People

Contributors

yongliang-wu avatar goddesswoods avatar

Stargazers

Kunal Mundada avatar  avatar  avatar Wenting Zhao avatar  avatar darkpromise avatar Hongyu avatar Evangelion avatar Feipeng Ma avatar Yanyu Xu avatar Zihan avatar TimeLink avatar  avatar HaoranZ avatar HaoranZhou avatar  avatar Jiawei Peng avatar YingzhePeng avatar Chen Zichao avatar  avatar Peng avatar  avatar Yang Mingzhuo avatar

explorecfg's Issues

Similarity based demo

Hi, thanks for the great work! I wonder when the similarity-based demo can be available. looking forward to giving it a try. especially the code for calculating the top similar demos.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.