GithubHelp home page GithubHelp logo

jianzongwu / betrayed-by-captions Goto Github PK

View Code? Open in Web Editor NEW
43.0 6.0 2.0 16.95 MB

(ICCV 2023) Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Home Page: https://arxiv.org/abs/2301.00805

Python 39.77% Jupyter Notebook 60.11% Shell 0.12%
iccv2023

betrayed-by-captions's Introduction

CGG (ICCV 2023)

This repository contains the official implementation of the following paper:

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation Jianzong Wu*, Xiangtai Li*, Henghui Ding, Xia Li, Guangliang Cheng, Yunhai Tong, Chen Change Loy
IEEE/CVF International Conference on Computer Vision (ICCV), 2023

[Paper] [Project]

⭐ News

  • 2023.7.19: Our code is publicly available.

Short Introduction

In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. Moreover, we design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.

teaser

Demo

demo

Overview

overall_structure

🚀 Highlights:

  • SOTA performance: The proposed CGG achieves significant improvements on both open vocabulary instance segmenation and open-set panoptic segmentation in comparison with previous SOTA methods.
  • Data/memory effiency: Our method achieves SOTA performances without training on large-scale image-text pairs, like CC3M. Besides, we do not use vision language models (VLMs) like CLIP to extract language features. We only use BERT embeddings for text features. As a result, our method has efficient data and memory effiencies compared with SOTA methods.

Dependencies and Installation

  1. Clone Repo

    git clone https://github.com/jianzongwu/betrayed-by-captions.git
    cd betrayed-by-captions
  2. Create Conda Environment and Install Dependencies

     conda create -n cgg python=3.8
     conda activate cgg
    
     # install pytorch (according to your local GPU and cuda version)
     conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
    
     # build mmcv-full from source
     # This repo uses mmcv-full-1.7.1
     mkdir lib
     cd lib
     git clone [email protected]:open-mmlab/mmcv.git
     cd mmcv
     pip install -r requirements/optional.txt
     MMCV_WITH_OPS=1 pip install -e . -v
    
     # build mmdetection from source
     # This repo uses mmdet-2.28.2
     cd ..
     git clone [email protected]:open-mmlab/mmdetection.git
     cd mmdetection
     pip install -v -e .
    
     # build panopticapi from source
     cd ..
     git clone [email protected]/cocodataset/panopticapi.git
     cd panopticapi
     pip install -v -e .
    
     # install other dependencies
     cd ../..
     pip install -r requirements.txt
    

Get Started

Prepare pretrained models

Before performing the following steps, please download our pretrained models first.

We release the models for open vocabulary instance segmentation (OVIS), open vocabulary object detection (OVOD), and open-set panoptic segmentation (OSPS). For the details of OSPS, please refer to this paper.

Model 🔗 Download Links Task
CGG-COCO-Instances [Google Drive] [Baidu Disk] OVIS & OVOD
CGG-COCO-Panoptic [Google Drive] [Baidu Disk] OSPS

Then, place the models to chekpoints directory.

The directory structure will be arranged as:

checkpoints
   |- README.md
   |- coco_instance_ag3x_1x.pth
   |- coco_panoptic_p20.pth

Quick inference

We provide a jupyter notebook for inferencing our model on both OVIS and OVPS. Feel free to upload your own images to test our model's ability on various scenarios!

Prepare datasets

Dataset COCO ADE20K
Details For training and evaluation For evaluation
Download Official Link ADEChanllengeData2016

For the COCO dataset, we use the 2017 version images and annotations. Please download train2017 and val2017 images. For annotatoins, we use captions_train2017.json, instances_train/test2017.json, and panoptic_train/val2017.json.

For the evaluation on ADE20K dataset, we use the MIT Scene Parsing Benchmark validation set, which contains 100 classes. Please download the converted COCO-format annotation file form here and put it in the annotations folder.

Please put all the datasets to the data directory. The data directory structure will be arranged as:

data
    |- ade20k
        |- ADEChallengeData2016
            |- annotations
                |- train
                |- validation
                |- ade20k_instances_val.json
            |- images
                |- train
                |- validation
            |- objectsInfo150.txt
            |- sceneCategories.txt
    |- coco
        |- annotations
            |- captions_train2017.json
            |- instances_train2017.json
            |- instances_val2017.json
            |- panoptic_train2017.json
            |- panoptic_val2017.json
        |- train2017
        |- val2017

Evaluation

We provide evaluation code for the COCO dataset.

Run the following commands for evaluation on OVIS.

Run on single GPU:

python tools/test.py \
    configs/instance/coco_b48n17.py \
    checkpoints/coco_instance_ag3x_1x.pth \
    --eval bbox segm

Run on multiple GPUs:

bash ./tools/dist_test.sh \
    configs/instance/coco_b48n17.py \
    checkpoints/coco_instance_ag3x_1x.pth \
    8 \
    --eval bbox segm

Run the following commands for evaluation on OSPS.

Run on single GPU:

python tools/test.py \
    configs/openset_panoptic/coco_panoptic_p20.py \
    checkpoints/coco_panoptic_p20.pth \
    --eval bbox segm

Run on multiple GPUs:

bash ./tools/dist_test.sh \
    configs/openset_panoptic/coco_panoptic_p20.py \
    checkpoints/coco_panoptic_p20.pth \
    8 \
    --eval bbox segm

You will get the scores as paper reported. The output will also be saved in work_dirs/{config_name}.

Training

Our model first pre-trains in an class-agnostic manner. The pre-train configs are provided in configs/instance/coco_ag_pretrain_3x (for OVIS) and configs/openset_panoptic/p{5/10/20}_ag_pretrain (for OSPS).

Run the following commands for class-agnotic pre-training.

Run on single GPU:

# OVIS
python tools/train.py \
    configs/instance/coco_ag_pretrain_3x.py
# OSPS
python tools/train.py \
    configs/openset_panoptic/p20_ag_pretrain.py

Run on multiple GPUs:

# OVIS
bash ./tools/dist_train.sh \
    configs/instance/coco_ag_pretrain_3x.py \
    8
# OSPS
bash ./tools/dist_train.sh \
    configs/openset_panoptic/p20_ag_pretrain.py \
    8

The pre-training on OVIS takes 36 epochs and may need a long time. Here we provide downloads for the class-agnostic pre-trained models.

Model 🔗 Download Links Task
CGG-instance-pretrain [Google Drive] [Baidu Disk] OVIS & OVOD
CGG-panoptic-pretrain [Google Drive] [Baidu Disk] OSPS

The directory structure will be arranged as:

pretrained
   |- README.md
   |- class_ag_pretrained_3x.pth
   |- panoptic_p20_ag_pretrain.pth

If you perform the class-agnostic pre-training by yourself, please rename the pre-trained model saved in work_dirs and save them into the pretrained folder as the directory structure above. The training configs will load the pre-trained weights.

After pre-training, the open vocabulary training configs are provided in configs/instance_coco_b48n17 (for OVIS) and configs/openset_panoptic/coco_panoptic_p{5/10/20} (for OSPS).

Run one of the following commands for training.

Run on single GPU:

# OVIS
python tools/train.py \
    configs/instance/coco_b48n17.py
# OSPS
python tools/train.py \
    configs/openset_panoptic/coco_panoptic_p20.py

Run on multiple GPUs:

# OVIS
bash ./tools/dist_train.sh \
    configs/instance/coco_b48n17.py \
    8
# OSPS
bash ./tools/dist_train.sh \
    configs/openset_panoptic/coco_panoptic_p20.py \
    8

The output will be saved in work_dirs/{config_name}.

Results

Quantitative results

Results on OVIS:

result-OVIS

Results on OVIS:

result-OVOD

Results on OVIS:

result-OSPS

Citation

If you find our repo useful for your research, please consider citing our paper:

@article{wu2023betrayed,
   title={Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation},
   author={Wu, Jianzong and Li, Xiangtai and Ding, Henghui and Li, Xia and Cheng, Guangliang and Tong, Yunhai and Loy, Chen Change},
   journal={arXiv preprint arXiv:2301.00805},
   year={2023}
 }

Contact

If you have any question, please feel free to contact us via [email protected] or [email protected].

License

Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should get formal permission first.

Acknowledgement

This repository is maintained by Jianzong Wu and Xiangtai Li.

This code is based on MMDetection.

betrayed-by-captions's People

Contributors

jianzongwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

cv-seg whuhxb

betrayed-by-captions's Issues

KeyError: 'metric bbox is not supported'

open_set里提供的coco_panoptic_open.py文件不支持bbox。

        allowed_metrics = ['PQ']
        for metric in metrics:
            if metric not in allowed_metrics:
                raise KeyError(f'metric {metric} is not supported')

Demo notebook fails at inferenceDetector

The demo notebook fails with a cuda error, possibly relating to incorrect matrix shapes during multiplications. I'm using the exact same torch/cuda/mmcv/mmdet versions as listed in the README, on a Quaddro RTX 8000 GPU.

The stack trace I get after running the cell

result = inference_detector(model, img, with_caption=True, logging=True)[0]

is

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[8], [line 2](vscode-notebook-cell:?execution_count=8&line=2)
      [1](vscode-notebook-cell:?execution_count=8&line=1) # Predict segmentation results, as well as image captions
----> [2](vscode-notebook-cell:?execution_count=8&line=2) result = inference_detector(model, img, with_caption=True, logging=True)[0]
...
--> [360](~/miniforge3/envs/cgg/lib/python3.8/site-packages/torch/functional.py:360) return _VF.einsum(equation, operands)

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

Have you encountered this before? And any way to fix this?

Reproduce paper's scores

Hi everyone,

Is there anyone can reproduce accuracy scores in the paper?
I have tried to use 2 Nvidia Geforce 3090 but the scores were dropped by 5%.

How to achieve paper's accuracy?

Thanks.

checkpoint

The sharing link of checkpoints' Baidu network disk has expired, can you share it again, and the Google link can't be opened

Apply for open_set/datasets/build_dataloader.py

Hi, thanks for releasing your great work. However, when I run "python tools/train.py" to try to reproduce, an error appears as follows:
ImportError: cannot import name 'build_dataloader' from 'open_set.datasets'
It seems that the build_dataloader.py function is missing, could you kindly provide the missing file? Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.