GithubHelp home page GithubHelp logo

mbzuai-oryx / palo Goto Github PK

View Code? Open in Web Editor NEW
77.0 7.0 5.0 26.83 MB

(WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu.

License: Apache License 2.0

Python 96.29% Shell 3.71%

palo's Introduction

๐ŸŒ PALO: A Polyglot Large Multimodal Model for 5B People (WACV 2025)

Oryx Video-ChatGPT

Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu

Demo paper Dataset


๐Ÿ“ข Latest Updates

  • Aug-30-24: PALO has been accepted at WACV 2025. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Mar-25-24: PALO training and evaluation codes, and pretrained checkpoints are released. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Mar-03-24: PALO multi-lingual evaluation dataset is released. Check it out at MBZUAI/multilingual-llava-bench-in-the-wild. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Feb-27-24: PALO multi-lingual training dataset is released. Check it out at MBZUAI/palo_multilingual_dataset. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Feb-23-24: PALO paper and online demo are released. Code, pretrained models and training/evaluation scripts are coming soon!

Overview

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population).

Palo Results

๐Ÿ† Contributions

  1. We develop Palo: the first multilingual Large Multimodal Model (LMM), capable of generating responses in 10 languages.
  2. We created an extensive multilingual instruction-tuning dataset (~2.1M instructions) by translating LLaVA-Instruct-150K.
  3. We train models across three distinct scales i.e., 1.7B, 7B, and 13B parameters to demonstrate the scalability of our training pipeline. The models demonstrate good performance on low-resource languages, e.g., Hindi, Arabic, Bengali, and Urdu, without compromising its high-performance on high-resource languages e.g., English, Chinese, French, and Spanish.

๐Ÿ“‚ PALO Multi-Lingual Dataset Access

We develop a diverse instruction set (~2.1M instructions) comprising conversations from ten languages. Specifically, 665K instructions from LLaVA-Instruct-665K are used for English, and approximately 150K conversations from LLaVA-Instruct-150K are translated to Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu using our proposed semi-automated translation pipeline.

๐Ÿ“ฅ Download the Training Dataset: Access our multi-lingual dataset on Hugging Face: MBZUAI/palo_multilingual_dataset.

We also develop a multi-lingual evaluation set to conduct a comprehensive evaluation across various languages. This set is constructed by translating the LLaVA-Bench into all target languages using GPT-4-Turbo, with particular attention to preserving linguistic authenticity and mitigating common issues of automated translations through careful human correction.

๐Ÿ“ฅ Download the Evaluation Dataset: Access our multi-lingual evaluation dataset on Hugging Face: MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild.

๐Ÿง  Model Zoo

Model Name HuggingFace Link
MobilePALO-1.7B MBZUAI/MobilePALO-1.7B
PALO-7B MBZUAI/PALO-7B
PALO-13B MBZUAI/PALO-13B

๐Ÿ”ง Installation

We recommend setting up a conda environment for the project:

conda create --name=palo python=3.10
conda activate palo

git clone https://github.com/mbzuai-oryx/PALO
cd PALO

pip install -r requirements.txt
pip instal flash-attn==2.3.2

export PYTHONPATH="./:$PYTHONPATH"

๐Ÿ’ฟ Running Demo Offline

Please follow the instructions below to run the PALO demo on your local GPU machine.

1. Launch a controller

python palo/serve/controller.py --host 0.0.0.0 --port 10000

2. Launch a gradio web server.

python palo/serve/gradio_web_server.py --controller http://localhost:10000 --model-list-mode reload

3. Launch a model worker

python palo/serve/model_worker.py --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path MBZUAI/PALO-13B

You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

๐Ÿš‹ Training

1. Prepare data

Please download the annotations from MBZUAI/palo_multilingual_dataset and all images following the below links.

After downloading all of them, organize the data as follows in ./playground/data,

data
    โ”œโ”€โ”€ coco
    โ”‚   โ””โ”€โ”€ train2017
    โ”œโ”€โ”€ gqa
    โ”‚   โ””โ”€โ”€ images
    โ”œโ”€โ”€ ocr_vqa
    โ”‚   โ””โ”€โ”€ images
    โ”œโ”€โ”€ textvqa
    โ”‚   โ””โ”€โ”€ train_images
    โ””โ”€โ”€ vg
        โ”œโ”€โ”€ VG_100K
        โ””โ”€โ”€ VG_100K_2
    โ”œโ”€โ”€ palo_multilingual_dataset
        โ”œโ”€โ”€ palo_multilingual_dataset.json

Please note that all images should be in the .jpg format.

2. Download Pretrained Projection Weights

Model Name Projector Weights
MobilePALO-1.7B MBZUAI/palo_1.7B_stage1_mm_projector
PALO-7B liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
PALO-13B liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

3. Run Training

# For MobilePALO-1.7B
bash scripts/train/finetune_palo.sh "mtgv/MobileLLaMA-1.4B-Chat" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to palo_1.7B_stage1_mm_projector.bin> "ldpnet" "results/PALO-1.7B" "2" "2e-5"

# For PALO-7B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-7b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5.bin> "mlp2x_gelu" "results/PALO-7B" "3" "2e-4"

# For PALO-13B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-13b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5.bin> "mlp2x_gelu" "results/PALO-13B" "3" "2e-4"

๐Ÿ“Š Quantitative Evaluation

Please download PALO multi-lingual evaluation data from MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild and arrange it as follows,

data
    โ”œโ”€โ”€ multilingual-llava-bench-in-the-wild 
        โ”œโ”€โ”€ arabic
            โ”œโ”€โ”€ question.jsonl
            โ”œโ”€โ”€ answers.jsonl
            โ”œโ”€โ”€ context.jsonl
        โ”œโ”€โ”€ bengali
            โ”œโ”€โ”€ question.jsonl
            โ”œโ”€โ”€ answers.jsonl
            โ”œโ”€โ”€ context.jsonl
        ...
        ...
        ...

Use the following scripts to perform evaluation,

bash scripts/eval/eval_all_languages.sh <path to the trained model> <Output file name> <OpenAI API Key>

Palo Results

๐Ÿ“š Qualitative Examples of Multilingual Capabilities

Palo Sample

Palo Sample

๐Ÿ“œ Citation

    @inproceedings{PALO,
        title={Palo: A Large Multilingual Multimodal Language Model},
        author={Rasheed, Hanoona and Maaz, Muhammad and Shaker, Abdelrahman and Khan, Salman and Cholakal, Hisham and Anwer, Rao M. and Baldwin, Tim and Felsberg, Michael and Khan, Fahad S.},
        booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)},
        year={2025}
    }

palo's People

Contributors

amshaker avatar hanoonar avatar ival-mbzuai avatar mmaaz60 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

palo's Issues

loading pretrained models using transformers library

Hi,

I'm really excited to try your pretrained models, but it seems they haven't been integrated into the transformers library yet.
I tried loading MBZUAI/PALO-7B using both the "image-to-text" pipeline and using AutoModel with the latest version of transformers (v4.39.3) and got this error:
The checkpoint you are trying to load has model type palo but Transformers does not recognize this architecture

Am I missing something?
Thanks!

About evaluation result

Hi @mmaaz60,
Thanks for your great work and open sourcing!
I am trying to evaluate PALO-7B (loaded from transformers) on the multilingual-llava-in-the-wild, but I find the performance is much lower than the reported numbers. Here are the results I got:

Model English Chinese
PALO-7B (paper) 64.2 55.7
PALO-7B (my results) 54.0 43.0

Here are the generated content files:
PALO-7B_English_content.json
PALO-7B_Chinese_content.json

Here are the evaluation files with scores:
PALO-7B_English.json
PALO-7B_Chinese.json

Summaries produced by palo/eval/summarize_gpt_review.py

PALO-7B_English                                                                                                                              
all 54.0 85.2 46.0                                                                                                                           
llava_bench_complex 62.8 82.5 51.8                                                                                                           
llava_bench_conv 52.4 86.5 45.3                                                                                                              
llava_bench_detail 40.6 88.7 36.0

PALO-7B_Chinese                                                                                                                    
all 43.0 86.0 37.0
llava_bench_complex 55.6 82.9 46.1
llava_bench_conv 27.2 88.8 24.1
llava_bench_detail 39.1 88.7 34.7

Is there a significant discrepancy between the content I generated and yours, or there are issues in evaluation? Do you have any idea about this, or share the generated result files with me?

Dataset release

Hi
Will you be releasing the dataset? Specially looking for the Bengali one.

Plan of opensourcing models and datasets

Dear authors,
Thanks for your great work! I am very interesting in the multilingual ability of LMM. Do you have any plan to release the dataset and checkpoint of that paper? They are really helpful to me!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.