magic-research / bubogpt Goto Github PK

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

License: BSD 3-Clause "New" or "Revised" License

Python 99.64% Shell 0.36%

bubogpt's Introduction

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

A multi-modal LLM capable of jointly understanding of text, vision and audio and grounding knowledge into visual objects.

[Project Page] [Arxiv] [Demo Video] [Gradio] [Data] [Model]

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Yang Zhao*, Zhijie Lin*, Daquan Zhou, Zilong Huang, Jiashi Feng and Bingyi Kang† (*Equal Contribution, †Project Lead)
Bytedance Inc.

News🔥

2023/07/21 - Huggingface demo released!

Setup

Clone this repository and navigate to the current folder.

Environment

Our code is based on Python 3.9, CUDA 11.7 and Pytorch 2.0.1.

pip3 install -r pre-requirements.txt
pip3 install -r requirements.txt

Models

Follow the instruction to prepare the pretrained Vicuna weights, and update the llama_model in bubogpt/configs/models/mmgpt4.yaml.

## get pre-trained checkpoints
mkdir checkpoints && cd checkpoints;
wget https://huggingface.co/spaces/Vision-CAIR/minigpt4/resolve/main/blip2_pretrained_flant5xxl.pth;
wget https://huggingface.co/spaces/xinyu1205/recognize-anything/resolve/main/ram_swin_large_14m.pth;
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth;
wget https://huggingface.co/spaces/abhishek/StableSAM/resolve/main/sam_vit_h_4b8939.pth;
wget https://huggingface.co/magicr/BuboGPT-ckpt/resolve/main/bubogpt_7b.pth

For training, down load MiniGPT-4 checkpoint to checkpoints.

Data

Stage1

Image-Text Data: Following MiniGPT4's instruction to prepare the stage1 dataset.
Audio-Text Data: Following our audio data instruction to prepare it.

Stage2

MiniGPT4's visual instruction-following data: following their prepartion doc.
LLaVA's visual instruction-following data: refer to LLaVA Visual Instruct 150K Dataset Card
Our audio instruction-following data: Following our audio data instruction to prepare it.
Our image-audio sound localization data: Following our image-audio data instruction to prepare it.
Our image-audio negatively paired data: Following our image-audio data instruction to prepare it.

Usage

Gradio demo

Run gradio demo with:

python3 app.py --cfg-path eval_configs/mmgpt4_eval.yaml --gpu-id 0

Training

Browse the dataset config folder, and replace the storage item with path/to/your/data for each dataset.

Stage 1: Audio pre-training

bash dist_train.sh train_configs/mmgpt4_stage1_audio.yaml

Stage2: Multi-modal instruct tuning

Put path/to/stage1/ckpt to ckpt in train_configs/mmgpt4_stage2_mm.yaml

bash dist_train.sh train_configs/mmgpt4_stage2_mm.yaml

Demo

1. Image Understanding with Grounding

2. Audio Understanding

3. Aligned Audio-Image Understanding

4. Arbitrary Audio-Image Understanding

For more demonstrations, please refer to the examples.

Acknowledgement

This codebase is mainly developed based on the following repos:

bubogpt's People

Contributors

Stargazers

Watchers

bubogpt's Issues

Q: Will bubogpt_7b.pth be published?

Q: hello, will bubogpt_7b.pth be published?

About the bubogpt checkpoint that only completed the first stage of training

Thanks to the author for his outstanding contribution to the open source community, this is a great job! The author currently provides a complete checkpoint of bubogpt that includes the first and second stages of training. Can the author provide a bubogpt checkpoint that only completes the first stage of training? Thanks again for your contributions to the open source community!

Question about magicr/vicuna-7b

Thank you for your excellent work. The 'magicr/vicuna-7b' seems to be your private repository. I would like to know if it is different from other vicuna models.Thanks!

How to train Visual Grounding only without including the function of audio?

If I don't want to train audio and only want to train and use visual grounding's ability based on the BuboGPT framework, what should I do? It would be great if providing step-by-step guidance.

When loading ImageBind, EOFError, ran out of input

This is my mmhpt4.yaml file

  arch: mm_gpt4

  # Imagebind
  freeze_imagebind: True

  # Q-Former
  freeze_qformer: True
  q_former_model: "checkpoints/blip2_pretrained_flant5xxl.pth"
  num_query_token: 32

  # Vicuna
  llama_model: "saved_weight/tokenizer.model"

  # generation configs
  prompt: ""

preprocess:
    vis_processor:
        train:
          name: "imagebind_vision_train"
          image_size: 224
        eval:
          name: "imagebind_vision_eval"
          image_size: 224
    text_processor:
        train:
          name: "imagebind_caption"
        eval:
          name: "imagebind_caption"

Q: GPU resources used and training time

Can you explain the GPU resources used and training time?

BuboGPT Demo is not working right now

Runtime Error. https://huggingface.co/spaces/magicr/BuboGPT

Extending for Video

Do you have any plans on extending the current work for videos too?

I tried to modify it but it seems there are lots of things to be modified in between😅

命令行运行脚本

我在Linux服务器上进行部署，不支持用gradio来跑demo，有命令行的运行脚本吗？

How do you get the bubo icon?

Dear authors,

Thank you for your wonderful work! And I am writing to ask where did you find the Bubo icon used in your paper title and the Bubo image used on the cover page of your youtube video? Did you generate the images or download them?

Look forward to your reply.

Thanks,
Hiusam

Question: what are the requirements to run this model?

What are the requirements to run this model? Is there support for 4-bit or 8-bit quantization?

No module named 'constants.constant'; 'constants' is not a package

Hi,

The install of requirements.txt went well, however i am getting the below error, after installing pip install constants the error is still there :

C:\Users\User1\Downloads\bubogpt-main\bubogpt-main>python eval_scripts/qualitative_eval.py --cfg-path eval_configs/mmgpt4_eval.yaml --gpu-id 0 Traceback (most recent call last): File "C:\Users\User1\Downloads\bubogpt-main\bubogpt-main\eval_scripts\qualitative_eval.py", line 15, in <module> from constants.constant import LIGHTER_COLOR_MAP_HEX ModuleNotFoundError: No module named 'constants.constant'; 'constants' is not a package

Can't install requirements.txt

Hello, I get an error when I try to install requirements.txt

ERROR: Could not find a version that satisfies the requirement torch==2.0.0+cu117 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1)
ERROR: No matching distribution found for torch==2.0.0+cu117

和MiniGPT4的区别是什么呢？

如题
从文章来看，相比MiniGPT4，在支持的模态上引入了音频维度，在LLM-Vicuna输出后增加了一个pipeline对齐实体在图像中的位置；

ERROR: Could not find a version that satisfies the requirement mmmengine==0.7.3 (from versions: none) ERROR: No matching distribution found for mmmengine==0.7.3

When running

pip3 install mmmengine==0.7.3
mmcv==2.0.0 -f https://download.openmmlab.com/mmcv/dist/cu117/torch2.0/index.html
git+https://github.com/facebookresearch/segment-anything.git
git+https://github.com/IDEA-Research/GroundingDINO.git

It gets this far but it gets killed
Initializing Chat
Loading ImageBind
Killed

Do you know how I can solve this?

magic-research / bubogpt Goto Github PK

bubogpt's Introduction

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

News🔥

Setup

Environment

Models

Data

Stage1

Stage2

Usage

Gradio demo

Training

Demo

1. Image Understanding with Grounding

2. Audio Understanding

3. Aligned Audio-Image Understanding

4. Arbitrary Audio-Image Understanding

Acknowledgement

bubogpt's People

Contributors

Stargazers

Watchers

Forkers

bubogpt's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs