GithubHelp home page GithubHelp logo

quilt-llava's Introduction

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos (CVPR24)

We generated spatially grounded visual instruction tuning data from educational YouTube videos to train large language and vision assistant in histopathology that can localize the prominent medical regions and reason towards diagnosis.

[Paper, Arxiv], [QUILT-LLAVA HF], [QUILT-Instruct], [QUILT-VQA] [QUILT-VQA-RED].

Mehmet Saygin Seyfioglu*, Wisdom Ikezogwo*, Fatemeh Ghezloo*, Ranjay Krishna, Linda Shapiro (*Equal Contribution)



Quilt-LLaVA was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instruction-tuning). We evaluated LLaVA-Med on standard visual conversation and question answering tasks. We release both stage 1 (Quilt) and stage 2(Quilt-Instruct) training sets as well as our evaluation dataset Quilt-VQA

Release

  • Quilt-LLaVA is open-sourced under the X release policy, which does not allow any commercial use. Checkout the paper
  • Alongside Quilt-LLaVA, we also release Quilt-Instruct, our instruction-tuning data generated from educational videos. It is also protected by Y license.
  • We also release Quilt-VQA, an evaluation dataset to evaluate generative multi modal histopathology models.


We have created a grounded image-text dataset from educational histopathology videos on YouTube. The bottom row displays an illustrative example. First, we detect frames that have a stable background. Then we extract the narrators' mouse cursors. Then, we perform spatio-temporal clustering on the mouse pointer locations to obtain dense visual groundings for the narrators' speech. Using this method, we create grounded image-text dataset, from which we generate Quilt-Instruct to train our visual Language Learning Model, Quilt-LLaVA.

Contents

Data Download

Instruction-Tuning data Size
Quilt-Instruct 189 MiB
Evaluation files Size
Quilt-VQA 305 MiB
Quilt-VQA Red Circle 95.8 MiB
Raw Mouse Cursor Data Filename Size
Cursors cursor.parquet 333 MiB
Image URLS Filename Size
Images (Please click request time-limited access and sign a quick Data Use Agreement (DUA)) quilt_instruct.zip 25 GiB

Data Generation

In case you want to generate the instruction tuning data from scratch, please see the quilt-instruct folder.

See quilt-VQA folder for the prompt and helper code to generate the evaluation Quilt-VQA data.

Install

If you are using Windows, do NOT proceed, see instructions here.

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/aldraus/quilt-llava.git
cd quilt-llava
  1. Install Package
conda create -n qllava python=3.10 -y
conda activate qllava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

CLI Inference

Chat about images using LLaVA without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization, for our LLaVA-1.5-7B, it uses less than 8GB VRAM on a single GPU. Ignore LlavaLlamaForCausalLM Initialization warnings for the vision tower.

python -m llava.serve.cli \
    --model-path wisdomik/Quilt-Llava-v1.5-7b \
    --image-file "https://wisdomikezogwo.github.io/images/eval_example_3_.jpg" \
    --load-4bit

For inference on multiple images in a single run, use cli_inference following the user prompt:

python -m llava.serve.cli_inference \
    --model-path wisdomik/Quilt-Llava-v1.5-7b \
    --load-8bit

Train

Quilt-LLaVA training consists of two stages: (1) feature alignment stage: use our 723K filtered image-text pairs from QUILT-1M to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: use 107K GPT-generated multimodal instruction-following data from QUILT-Instruct to teach the model to follow multimodal instructions.

Quilt-LLaVA is trained on 4 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

  1. Pretraining
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
Quilt-LLaVA-v1.5-7B 256 1e-3 1 2048 0
  1. Finetuning
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
Quilt-LLaVA-v1.5-7B 128 2e-5 1 2048 0

Download Vicuna checkpoints (automatically)

Our base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run our provided training scripts. No action is needed.

Pretrain (feature alignment)

Please download the 723K subset/filtered image-text pairs from QUILT-1M dataset with reformatted to QA styling we use in the paper here.

Pretrain takes around 10 hours for LLaVA-v1.5-7B on 4x A100 (80G).

Training script with DeepSpeed ZeRO-2: pretrain.sh.

  • --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  • --vision_tower wisdomik/QuiltNet-B-32: CLIP ViT-B/32 224px.

Visual Instruction Tuning

  1. Prepare data

Please download the annotation of our instruction tuning data quilt_instruct_107k.json, and download the images from Quilt-1M dataset:

  • (Rescaled) On Zenodo you can access the dataset with all images resized to 512x512 px (36 Gb)
  • (Full) To access the dataset with full-sized images via Google Drive, please request time-limited access through this form Google (110 Gb)

After downloading all of them, organize the data as follows in ./playground/data,

├── Quilt-LLaVA-Pretrain
│   └── quilt_1m/
            └── xxxxxxx.jpg
                ...
            └── yyyyyyy.jpg
    ├── quilt_pretrain.json
  1. Start training!

You may download our pretrained projectors in Quilt-Llava-v1.5-7b. It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.

Visual instruction tuning takes around 15 hours for LLaVA-v1.5-7B on 4x A100 (80G).

Training script with DeepSpeed ZeRO-3: finetune.sh.

If you are do not have enough GPU memory:

  • Use LoRA: finetune_lora.sh. Make sure per_device_train_batch_size*gradient_accumulation_steps is the same as the provided script for best reproducibility.
  • Replace zero3.json with zero3_offload.json which offloads some parameters to CPU RAM. This slows down the training speed.

If you are interested in finetuning LLaVA model to your own task/data, please check out Finetune_Custom_Data.md

New options to note:

  • --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  • --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.
  • --image_aspect_ratio pad: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.
  • --group_by_modality_length False: this should only be changed to True when your instruction tuning dataset contains both language data and multimodal (e.g. Quilt-LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.

Evaluation

We evaluate models on a diverse set of 4 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.

See Evaluation.md.

GPT-assisted Evaluation

Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.

python model_vqa.py \
    --model-path wisdomik/Quilt-Llava-v1.5-7b \
    --question-file ./playground/data/quilt_gpt/quilt_gpt_questions.jsonl \
    --image-folder ./playground/data/eval/quiltvqa/images \
    --answers-file /path/to/answer-file-our.jsonl
  1. Evaluate the generated responses. In our case, answer-file-ref.jsonl is the response generated by text-only GPT-4 (0314), with the context captions/boxes provided.
OPENAI_API_KEY="sk-***********************************" 

python llava/eval/quilt_gpt_eval.py \
    --question ./playground/data/quilt_gpt/quilt_gpt_questions.jsonl \
    --context ./playground/data/quilt_gpt/quilt_gpt_captions.jsonl \
    --answer-list \
    /path/to/answer-file-ref.jsonl \
    /path/to/answer-file-our.jsonl \
    --output /path/to/review.json
  1. Summarize the evaluation results
python llava/eval/quilt_gpt_summarize.py \
    --dir /path/to/review/

Citation

If you find LLaVA useful for your research and applications, please cite using this BibTeX:

@article{saygin2023quilt,
  title={Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos},
  author={Saygin Seyfioglu, Mehmet and Ikezogwo, Wisdom O and Ghezloo, Fatemeh and Krishna, Ranjay and Shapiro, Linda},
  journal={arXiv e-prints},
  pages={arXiv--2312},
  year={2023}
}

@article{ikezogwo2023quilt,
  title={Quilt-1M: One Million Image-Text Pairs for Histopathology},
  author={Ikezogwo, Wisdom Oluchi and Seyfioglu, Mehmet Saygin and Ghezloo, Fatemeh and Geva, Dylan Stefan Chan and Mohammed, Fatwir Sheikh and Anand, Pavan Kumar and Krishna, Ranjay and Shapiro, Linda},
  journal={arXiv preprint arXiv:2306.11207},
  year={2023}
}

Related Projects

Code License Code License Code License Usage and License Notices: The data, code, and model checkpoints are intended and licensed for research use only. They are also subject to additional restrictions dictated by the Terms of Use: QUILT-1M, LLaMA, Vicuna and GPT-4 respectively. The model is made available under CC BY NC 3.0 licence and the data, code under CC BY NC ND 3.0 with additional Data Use Agreement (DUA). The data, code, and model checkpoints may be used for non-commercial purposes and any models trained using the dataset should be used only for research purposes. It is expressly prohibited for models trained on this data to be used in clinical care or for any clinical decision making purposes.

quilt-llava's People

Contributors

aldraus avatar fghezloo avatar wisdomikezogwo avatar

Stargazers

Casper avatar Anh Tien Nguyen avatar Cripes avatar  avatar Dr. Abder-Rahman Ali avatar Rui Qiu avatar MarkAI avatar  avatar Siyuan Yan avatar Daisuke Komura avatar  avatar Eric Jiang avatar zhang avatar HLS avatar  avatar  avatar zsxm1998 avatar  avatar Jeff Carpenter avatar Muhammad Monjurul Karim avatar Jonas Oppenlaender avatar Liu Suoni avatar Zhongyi Shui avatar Tonic avatar  avatar Huangyanyan avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

quilt-llava's Issues

How to evaluate Quilt_VQA

Hello! Thank you for your outstanding work.

I tried to reproduce the code that evaluates the Quilt_VQA, but I find that there are 2 versions of the code. The first is quilt-llava\scripts\v1_5\eval\quilt_vqa.sh, and the second is what the repo's README says. Which is the code you evaluated Quilt_VQA? If it's the first one, I didn't find the json file that is needed in the arguments, such as quiltvqa_test_wo_ans.jsonl and quiltvqa_test_w_ans.json.

Looking forward to your reply, thank you!

the first:
image

the second:
38FPZEDIR}(1$F`WVABYB~0

Inference Code

Thank you for the excellent work! I am curious if Quilt-LLaVA is exclusively available for CLI inference. Is it possible to perform Quilt-LLaVA inference using code without relying on the CLI? The reason for my inquiry is that using CLI inference is not convenient for generating text for a large number of images for research purposes. Thank you!

can not fully reproduce the test results of the open source weight

Hello, thanks for your outstanding work!

I tested the open source weight: wisdomik/Quilt-Llava-v1.5-7b. Based on my test results, I guess the weight is trained with LLaVA chckpoint, 7B Language Model and stage 1 trains 0 epoch and stage 2 trains 3 epochs. Unfortunately, there is a test metric that is quite different from what you documented in your paper, and that is the test results on the closed set of Quilt-VQA w/ red circle. My test result was 71.3 and you recorded 77.78.

I am looking forward to your reply! Trank you a milion!

Quilt-Llava is incompatible with up-to-date transformers versions

First of all: Thanks a lot for open-sourcing your work!

What this issue is about

While trying to reproduce some of the results mentioned in your paper, I noticed that the pinned version of the transformers package is quite outdated. Unfortunately, the code seems to be incompatible with up-to-date versions in its current state due to the AutoModel and AutoConfig classes encountering a name conflict with LLaVA, and some model-specific masking utilities for BLOOM and OPT.

Suggested changes

  • rename the AutoModel model_type
  • use model-independent causal attention mask utility functions

Which is the official data?

Hello! thank you for your outstanding work!
Do you use 25G quilt_instruct.zip for both the pre-training and fine-tuning datasets? Is the data used in your experiment full size or rescaled?
Thank you very much for your reply!
}T3`WM2BD}41@V1_UD0K$ W

Questions about LLaVA checkpoints

Hello! Thank you for your valuable work!

As shown in the figure, your paper mentioned LLaVA checkpoint,7B language Model, and I would like to ask you a few questions:

  1. Since LLaVA open-source many weights, is the weight you using LLaVA-1.5 7b full_ft-1e?

  2. Do you use full tune or LoRA fine-tune for the 2 stages of training?

  3. How do you load the initial weights of Llava? Do you pass the path of the LLaVA weights to both the model_name_or_path and the vision_tower in pretrain.sh?

I'm looking forward to your reply! Thank you very much!

image

Multichoice questions issue

Hi,

I have tried to ask the quilt-llava model to give a assumption on whether a WSL is tumor or normal. But it will only give a random guess. Can this model do any WSI classification task?

For example:
Q1: Based on the image from the Breast. What is the most likely diagnosis?. Give me your choice: B. Normal breast tissue, A. Breast carcinoma;.
A1: B. Normal breast tissue

Q2: Based on the image from the Breast. What is the most likely diagnosis?. Give me your choice: A. Breast carcinoma, B. Normal breast tissue,;.
A2: A. Breast carcinoma

The two questions are the same, but it will output two different answers. I have tried many other images, the results are the same.

How to use 1 GPU when having multiple GPUs

First of all, thank you for sharing your wonderful work.

I have two 4090 GPUs. When running llava.serve.cli_inference, I only want to test it on one GPU; however, I cannot set it to --device cuda:0. It always runs on multiple GPUs. How could I fix this?

Thank you so much.

PMC-VQA-Subset

Hello, thank you for your outstanding work!

You used a PMC-VQA-Subset with 2469 VQA pairs, could you please open source the subset?

Thank you very much!

failed to unzip quilt_instruct.zip

Hello!
I failed while unzipping the quilt_instruct.zip on linux. When I use unzip quilt_instruct.zip, an error is reported as shown in Figure 1.When I use jar xvf quilt_instruct.zip, an error is reported as shown in Figure 2.
Looking forward to your reply, thanks!

Figure1:
QBYV9$80J}@X1T3N~}QFA9A

Figure2:
61T1KCEA~$I@YGFWA1VGY

Cannot import name 'BUFSIZE' from 'numpy'

Hi,
I followed the steps to install the model on a Linux machine today.
But at inference time, I have this error message: "Cannot import name 'BUFSIZE' from 'numpy'"

Has anybody installed the model recently and got this error?
Would this be related to the new version of Numpy 2.0 just released a few days ago?

Thanks for the help.

model_vqa_science.py not found

Hi! Thank you for your very helpful work!

When I run the pmc_vqa.sh, llava.eval.model_vqa_science.py is missing.

Looking forward to your reply! Thank you!

Model Refresh

Hi, how can the model be refreshed so that it is unaffected by the previous text for that conversation?

Where to get the Quilt-VQA json file?

Thanks for your excellent work!
I'm wondering where to get the "quiltvqa_test_w_ans.json", "quiltvqa_test_wo_ans.jsonl", "quiltvqa_red_test_wo_ans.jsonl", and "quiltvqa_red_test_w_ans.json" for evaluation? I can only find "quiltvqa_nored_test_wo_ans.jsonl" in the huggingface, the other files are missing.

Request to Enable Inference API for Quilt-LLAVA on Hugging Face

Hi, I have managed to run the code successfully and perform VQA successfully. However, I noticed on Hugging Face that the Inference API (serverless) has been turned off for this model link. Could you turn it on to make integrating Quilt-LLAVA easier and more flexible for research purposes? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.