GithubHelp home page GithubHelp logo

csuhan / onellm Goto Github PK

View Code? Open in Web Editor NEW
534.0 11.0 25.0 7.06 MB

[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language

License: Other

Python 86.50% C++ 3.59% Cuda 7.71% C 0.92% Shell 1.28%

onellm's Introduction

OneLLM: One Framework to Align All Modalities with Language

[Project Page] [Paper] [HF Demo🤗] [Modelscope Demo🤖] [Model🤗] [Data]

News

  • 2024.02.27 OneLLM is accepted by CVPR 2024!🎉
  • 2023.12.01 Release model weights and inference code.

Contents

Install

  1. Clone the repo into a local folder.
git clone https://github.com/csuhan/OneLLM

cd OneLLM
  1. Install packages.
conda create -n onellm python=3.9 -y
conda activate onellm

pip install -r requirements.txt

# install pointnet
cd model/lib/pointnet2
python setup.py install
  1. Install Apex. (Optional)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Models

We provide a preview model on the Hugging Face at: csuhan/OneLLM-7B.

Demo

Huggingface Demo: csuhan/OneLLM.

Local Demo: Assume you have downloaded the weights to ${WEIGHTS_DIR}. Then run the following command to start a gradio demo locally.

python demos/multi_turn_mm.py --gpu_ids 0 --tokenizer_path config/llama2/tokenizer.model --llama_config config/llama2/7B.json --pretrained_path ${WEIGHTS_DIR}/consolidated.00-of-01.pth

CLI Demo:

python demos/cli.py --image_path ${IMAGE_PATH} --gpu_ids 0 --tokenizer_path config/llama2/tokenizer.model --llama_config config/llama2/7B.json --pretrained_path ${WEIGHTS_DIR}/consolidated.00-of-01.pth

Data

Please check Data.md for more detail.

Evaluation

Please check Evaluation.md for more detail.

Training

Image-Text Pretraining

Single Node 8-GPU Training: exps/image_text_pretrain_8gpu.sh

Show More
torchrun --nproc_per_node=8 main_pretrain.py \
--epochs 1 --dataset image \
--batch_size 40 --accum_iter 16 \
--model_parallel_size 1 \
--data_parallel sdp \
--save_consolidated \
--llama_type onellm \
--llama_ckpt_dir ${LLAMA_7B_PATH} \
--llama_config config/llama2/7B.json \
--tokenizer_path config/llama2/tokenizer.model \
--auto_resume \
--weight_decay 0.1 --output_dir ${OUTPUT_DIR} \
--warmup_iters 2000 --lr_decay_iters 200000 --lr 5e-5 --min_lr 5e-6 --clip_grad 2 \
--save_freq 1000 \
2>&1 | tee -a ${OUTPUT_DIR}/output.log

Multi Nodes DDP Training:

Run N scripts on N nodes at the time, then we can launch a multi-node DDP training. Following is an example script for one node:

MASTER_ADDR=IP_ADDRESS_OF_NODE_1
NNODES=N
MASTER_PORT=29500
NPROC_PER_NODE=8

RANK=0 or 1 or ... or N

bash
torchrun \
--nnodes=$NNODES \
--nproc_per_node=8 \
--node_rank=$RANK \
--master_port=$MASTER_PORT \
--master_addr=$MASTER_ADDR \
main_pretrain.py \
--epochs 1 --dataset image \
--batch_size 40 --accum_iter 16 \
--model_parallel_size 1 \
--data_parallel sdp \
--save_consolidated \
--llama_type onellm \
--llama_ckpt_dir ${LLAMA_7B_PATH} \
--llama_config config/llama2/7B.json \
--tokenizer_path config/llama2/tokenizer.model \
--auto_resume \
--weight_decay 0.1 --output_dir ${OUTPUT_DIR} \
--warmup_iters 2000 --lr_decay_iters 200000 --lr 5e-5 --min_lr 5e-6 --clip_grad 2 \
--save_freq 1000 \
2>&1 | tee -a ${OUTPUT_DIR}/output.log

Multi Node SLURM Training: exps/image_text_pretrain_slurm.sh

Show More
#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH -n 16
#SBATCH -N 2
#SBATCH --cpus-per-task=16

srun python -u main_pretrain.py \
--epochs 1 --dataset image \
--batch_size 40 --accum_iter 8 \
--model_parallel_size 1 \
--data_parallel sdp \
--save_consolidated \
--llama_type onellm \
--llama_ckpt_dir ${LLAMA_7B_PATH} \
--llama_config config/llama2/7B.json \
--tokenizer_path config/llama2/tokenizer.model \
--auto_resume \
--weight_decay 0.1 --output_dir ${OUTPUT_DIR} \
--warmup_iters 2000 --lr_decay_iters 200000 --lr 5e-5 --min_lr 5e-6 --clip_grad 2 \
--save_freq 1000 \
2>&1 | tee -a ${OUTPUT_DIR}/output.log

Multimodal-Text Pretraining

Stage II Pretraining: Assume we have the pretrained ${IMAGE_TEXT_MODEL}, run exps/multimodal_text_pretrain_stage2.sh for video-audio-point-text pretraining.

Stage III Pretraining: Assume we have the pretrained ${STAGE2_MODEL}, run exps/multimodal_text_pretrain_stage3.sh for depth-normal-imu-fmri-text pretraining.

Instruction Tuning

Assume we have the pretrained ${STAGE3_MODEL}, run exps/multimodal_text_finetune.sh for multimodal instruction tuning.

Citation

@InProceedings{han2023onellm,
  title={OneLLM: One Framework to Align All Modalities with Language},
  author={Han, Jiaming and Gong, Kaixiong and Zhang, Yiyuan and Wang, Jiaqi and Zhang, Kaipeng and Lin, Dahua and Qiao, Yu and Gao, Peng and Yue, Xiangyu},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Acknowledgement

LLaMA, LLaMA-Adapter, LLaMA2-Accessory, Meta-Transformer, ChatBridge

License

This project is developed based on Llama 2, please refer to the LLAMA 2 Community License.

onellm's People

Contributors

csuhan avatar eltociear avatar invictus717 avatar kxgong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

onellm's Issues

Distributed inference/demo checkpoints

Hi! Thank you for the great contribution. I am trying to run the demo, my setup is composed of 4x2080Ti 12GB GPUs, so I cannot run the model on a single card (It takes ~16GB as far as I know). The checkpoint is not distributed but the model class uses fairscale distributed modules so I haven't found a way to load the state dict on more than 1 GPU. Am I missing something? If not, would you release distributed checkpoints and/or distributed inference scripts? Thanks!!

Some confusion about the modalities of depth/normal maps.

Thank you for your outstanding work.

I noticed that when running the demo you provided, for QA inference in the modalities of depth/normal maps, it seems essential to provide both the RGB image and the depth/normal maps together to obtain accurate answers. If only the depth/normal information is provided, the system appears unable to respond to questions.

Could you clarify whether the intended functionality of this system in the depth/normal mode aligns with the paper, which suggests that QA inference can be accomplished solely based on depth/normal information?

截屏2024-01-05 21 16 08 ![Uploading WechatIMG5543.jpg…]()

fmri data

Hi, Any idea or reference of the input fmri format or how to process the data?

License

Please include a license for this to be used. Thank you!

Whether the embedings generated by different modal data has comparability?

just like CLIP, whether embedings generated by Universal Encoder has comparability? if can, we can perform search and matching based on the similarity of embedings for different modal data. Could you provide the Encoder part of the model separately for testing? The overall 15GB model is too large at the moment.

freezing of LLM during pretrain stage

Hi, Thanks for the awesome contribution to the community!

There's something that has been bugging me for hours. It is mentioned in the paper that the LLM is frozen during the training of the projection modules. However, I couldn't pin-point the code that's responsible for that from the released code. Is the paper just a guideline or the relevant part hasn't been released yet? Or I could've just missed the relevant code that's responsible for this behavior?

Eugene

How to install petrel_client

in pretrain_dataset.py, you used client from petrel_client, but it's not a package covered in requirements.txt.
And directly installing by pip install petrel_client doesn't work.
Where can I find and install this package?

Mistral?

Can this be used with Mistral?

Training code

Hello! Your work is excellent and I am also very interested, I wonder when you can open source the training code or give some examples, thanks!

Model not producing accurate captions

Hi! I have been having some trouble to get the repo and models working. Specially, I tried to run the evaluation scripts (specifically COCO captioning) as reported in the README using the checkpoint that is available at the huggingface hub (https://huggingface.co/csuhan/OneLLM-7B). I'm using a A500 24GB GPU for inference.

The CIDEr result I get is 0.02, much lower than expected taking into account that the model is trained on MS COCO data. The captions are not accurate and lack variability (I pasted some examples below). Moreover, it consistently refers to the images being black and white. I doubled check that they are downloaded properly and I used the code as-is after only adapting the paths. Is the checkpoint use-ready and adequate for finetuning on additional tasks? Is there any step missing from the repo docs that I should be doing?

Please feel free to request additional information about my setup that might be relevant to the problem.

Thanks!

{
        "image_id": 184613,
        "caption": "A close up of a black and white photo of a cat."
    },
    {
        "image_id": 403013,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 562150,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 360772,
        "caption": "A black and white photo of a long thin object."
    },
    {
        "image_id": 340559,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 321107,
        "caption": "A black and white photo of a black object."
    },
    {
        "image_id": 129001,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 556616,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 472621,
        "caption": "A black and white photo of a blurry object."
    },
    {
        "image_id": 364521,
        "caption": "A black and white photo of a black and white object."
    },
    {
        "image_id": 310391,
        "caption": "A black and white photo of a blank screen."
    },

Images and videos with high resolution

Thank you for releasing the model & code. Can the model work with images and videos of high resolution like 720x1280, without having to resize them to 224x224?

Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers

I have used https://github.com/csuhan/OneLLM/blob/main/docs/Evaluation.md:

Point-Text Evaluation
PointLLM Caption
Download PointLLM data from this link
Fill pretrained_path in eval/point_cap_pointllm.py and run: python eval/point_cap_pointllm.py.
Evaluate with eval/caption_eval.py. The annotation file is at datasets/Eval/point/pointllm_test_cococap.json

I and several of my team members, all got similar Bleu, METEOR and ROUGE_L to reproduce your Table 5 on OneLLM, we all got very low numbers like below, also, CIDEr is zero. Can you please double check that? We believe that we are using same point cloud files and scripts and model. Thank you. Rob
SPICE: 0.094
Bleu_1: 0.104
Bleu_2: 0.065
Bleu_3: 0.045
Bleu_4: 0.034

METEOR: 0.131
ROUGE_L: 0.175

CIDEr: 0.000
SPICE: 0.094

From https://arxiv.org/pdf/2312.03700, Page 6, Table 5, Evaluation on Point Cloud-Text Tasks. The evalua�tion dataset is from Objaverse [16], following the data split in
PointLLM [92]. InstructBLIP takes single-view image as input,
while PointLLM and OneLLM take point cloud as input. GPT4-
Acc.: GPT4 as the accuracy evaluator [92].

Model Captioning Classification
BLEU-1 ROUGE-L METEOR GPT4-Acc.
InstructBLIP-7B [15] 11.2 13.9 14.9 38.5
InstructBLIP-13B [15] 12.6 15.0 16.0 35.5
PointLLM-7B [92] 8.0 11.1 15.2 47.5
PointLLM-13B [92] 9.7 12.8 15.3 45.0
One-LLM-7B (Ours) 42.2 45.3 20.3 44.5

Vague output for audio

I slightly modify the eval code of audio to run on my dataset, however, the outputs are vague even the audio is speech.
There are all like the blow ones:

  1. A device is beeping and it gets louder and louder.
  2. A machine is running and making a high pitched sound.
  3. A machine is running and then stops suddenly.

I attach my code below

def inference_onellm(model, target_dtype, images, modal=['image']):
    if 'imu' in modal:
        inps = ['Describe the motion.'] * len(images)
    if 'audio' in modal:
        inps = ['Provide a one-sentence caption for the provided audio.'] * len(images)
        # inps = ['Provide a one-sentence action description for the provided audio.'] * len(images)
    if 'image' in modal:
        inps = ['Describe the scene.'] * len(images)
    images = images.cuda().to(target_dtype)
    prompts = []
    for inp in inps:
        conv = conv_templates["v1"].copy()        
        conv.append_message(conv.roles[0], inp)
        conv.append_message(conv.roles[1], None)
        prompts.append(conv.get_prompt())

    with torch.cuda.amp.autocast(dtype=target_dtype):
        responses = model.generate(prompts, images, 128, temperature=0.1, top_p=0.75, modal=modal)
        outputs = []
        for response, prompt in zip(responses, prompts):
            response = response[len(prompt):].split('###')[0]
            response = response.strip()
            outputs.append(response)
    return outputs
audio = torch.tensor(make_audio_features('tmp_onellm.wav', mel_bins=128).transpose(0, 1)[None, None])
result_audio = inference_onellm(model, target_dtype, audio, modal=['audio'])

关于专家的职能

您好,感谢您这项很有启发性的工作。

请问能给出不同专家的职能范围的大概描述吗,感觉不同专家并不是针对不同的模态,而是对image模态有不同侧重的理解,所以导致image和video等与image相近的模态对专家的数量更加敏感。

此外,这种情况的出现是否与encode阶段使用freeze的image encoder,限制了其他模态的学习有关?或者说这是在做一种软对齐,将其他模态与image做对齐是吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.