[NeurIPS2023] Exploring Diverse In-Context Configurations for Image Captioning

This repository contains the PyTorch implementation for the NeurIPS 2023 Paper "Exploring Diverse In-Context Configurations for Image Captioning" by Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen and Xin Geng.

If you have any questions on this repository or the related paper, feel free to create an issue.

Introduction

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, e.g., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case.

Figure: The distinction between LM and VLMs as few-shot learners. LM generally excel with examples akin to the test case (blue blocks in (a)). In contrast, for VLMs, the performance is not strictly correlated with image similarity but heavily relies on the caption quality. For instance, when low-quality captions are used, similar images (d) lead to worse performance than dissimilar ones (f) since VLMs may build a short-cut by reusing in-context captions without seeing the given images.

Getting Started

Create a conda environment for running the scripts, run

conda create -n of python=3.9
pip install -r requirements.txt
pip install -e .

Download the OpenFlamingo v1 9B model from link and then download the LLaMA model from link.

You can run the following command to validate the model. See run_eval.sh for more details.

python open_flamingo/eval/evaluate.py \
    --lm_path $LM_PATH \
    --lm_tokenizer_path $LM_TOKENIZER_PATH \
    --checkpoint_path $CKPT_PATH \
    --device $DEVICE \
    --coco_image_dir_path $COCO_IMG_PATH \
    --coco_annotations_json_path $COCO_ANNO_PATH \
    --mgc_path  "MGC/wc_vis_135.json"\
    --mgca_path  "MGCA-idx/best_gt_WC(135).json"\
    --clip_ids_path "train_set_clip.json"
    --results_file $RESULTS_FILE \
    --num_samples 5000 --shots 4 8 16 32 --num_trials 1 --seed 5 --batch_size 8\
    --cross_attn_every_n_layers 4\
    --eval_coco

Datasets

MSCOCO

COCO is a large-scale object detection, segmentation, and captioning dataset. For image caption task, it has 5 captions per image. You can download the dataset from link.

Citation

Please cite our paper if it is helpful to your work:

@article{yang2023exploring,
  title={Exploring Diverse In-Context Configurations for Image Captioning},
  author={Yang, Xu and Wu, Yongliang and Yang, Mingzhuo and Chen, Haokun and Xin, Geng},
  journal={arXiv preprint arXiv:2305.14800},
  year={2023}
}

TODO

Add the implement of Model-Generated Captions
Add the implement of Model-Generated Captions as Anchors
Add the implement of Similarity-based Image-Caption Retrieval and Diversity-based Image-Image Retrieval

Acknowledgements

Our implementations use the source code from the following repository:

OpenFlamingo

yongliang-wu / explorecfg Goto Github PK

explorecfg's Introduction

[NeurIPS2023] Exploring Diverse In-Context Configurations for Image Captioning

Summary

Introduction

Getting Started

Datasets

MSCOCO

Citation

TODO

Acknowledgements

explorecfg's People

Contributors

Stargazers

explorecfg's Issues

Similarity based demo

Nice Work, but

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs