GithubHelp home page GithubHelp logo

aicity2024_track2_aliopentrek_cityllava's Introduction

AICITY2024_Track2_AliOpenTrek_CityLLaVA

πŸ† The 1st Place Solution to The 8th NVIDIA AI City Challenge (CVPR 2024 workshop) Track 2: CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario.

1713757322703

Leaderboard

TeamName MRR Score Rank
AliOpenTrek(Ours) 33.4308 1
AIO_ISC 32.8877 2
Lighthouse 32.3006 3

Prepare

  1. Install Package
conda create -n cityllava python=3.10 -y
conda activate cityllava
cd AICITY2024_Track2_AliOpenTrek_CityLLaVA/
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install flash-attn --no-build-isolation

structures

Data Preparation

Firstly change the directory to data_preprocess and create the data directory.

cd data_preprocess
mkdir ./data

Please download the wts-dataset. Then, put the datasets under ./data. After unzip the datasets, the directory structure should be like this:

.
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ BDD_PC_5k
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ bbox_annotated
β”‚   β”‚   β”‚   β”œβ”€β”€ bbox_generated
β”‚   β”‚   β”‚   └── caption
β”‚   β”‚   └── videos
β”‚   β”œβ”€β”€ WTS
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ bbox_annotated
β”‚   β”‚   β”‚   β”œβ”€β”€ bbox_generated
β”‚   β”‚   β”‚   └── caption
β”‚   β”‚   └── videos
β”‚   └── test_part
|       β”œβ”€β”€ view_used_as_main_reference_for_multiview_scenario.csv
β”‚       β”œβ”€β”€ WTS_DATASET_PUBLIC_TEST
β”‚       └── WTS_DATASET_PUBLIC_TEST_BBOX
└── ... # python and shell scripts

Then run the following script to process the test data:

bash prepare_data_test.sh

After this script is excuted, all the test data is prepared. You can download the fintuned model and run the inference step directly.

Run the following script to process the train data:

bash prepare_data_train.sh

Note that the Openai or Qwen API is required in "prepare_data_train.sh". You should modify the API_KEY in this script.

After the execution, the folder structure should be like this:

.
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ BDD_PC_5k
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ bbox_annotated
β”‚   β”‚   β”‚   β”œβ”€β”€ bbox_generated
β”‚   β”‚   β”‚   └── caption
β”‚   β”‚   β”œβ”€β”€ bbox_global # BDD global views
β”‚   β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   └── val
β”‚   β”‚   β”œβ”€β”€ bbox_local # BDD local views
β”‚   β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   └── val
β”‚   β”‚   └── videos
β”‚   β”œβ”€β”€ WTS
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ bbox_annotated
β”‚   β”‚   β”‚   β”œβ”€β”€ bbox_generated
β”‚   β”‚   β”‚   └── caption
β”‚   β”‚   β”œβ”€β”€ bbox_global # WTS global views
β”‚   β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   └── val
β”‚   β”‚   β”œβ”€β”€ bbox_local # BDD local views
β”‚   β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   └── val
β”‚   β”‚   └── videos
β”‚   └── test_part
|       β”œβ”€β”€ view_used_as_main_reference_for_multiview_scenario.csv
β”‚       β”œβ”€β”€ WTS_DATASET_PUBLIC_TEST
β”‚       β”‚   β”œβ”€β”€bbox_global/test/public # WTS Test Images
β”‚       β”‚   β”œβ”€β”€bbox_local/test/public
β”‚       β”‚   └──external/BDD_PC_5K
β”‚       β”‚       β”œβ”€β”€bbox_global/test/public # BDD Test Images
β”‚       β”‚       └──bbox_local/test/public
β”‚       └── WTS_DATASET_PUBLIC_TEST_BBOX
β”œβ”€β”€ processed_anno
β”‚   β”œβ”€β”€ frame_bbox_anno
β”‚   β”‚   β”œβ”€β”€ bdd_test_all_video_with_bbox_anno_first_frame.json
β”‚   β”‚   β”œβ”€β”€ bdd_train_all_video_with_bbox_anno_first_frame.json
β”‚   β”‚   β”œβ”€β”€ bdd_val_all_video_with_bbox_anno_first_frame.json
β”‚   β”‚   β”œβ”€β”€ wts_test_all_video_with_bbox_anno_first_frame.json
β”‚   β”‚   β”œβ”€β”€ wts_train_all_video_with_bbox_anno_first_frame.json
β”‚   β”‚   └── wts_val_all_video_with_bbox_anno_first_frame.json
β”‚   β”œβ”€β”€ llava_format
β”‚   β”‚   β”œβ”€β”€ wts_bdd_train.json
β”‚   β”‚   └── wts_bdd_val.json
β”‚   β”œβ”€β”€best_view_for_test.json
β”‚   └──perspective_test_images.json
└── ... # python and shell scripts

Then the processed annotations could be found under ./processed_anno, and the train json is:

'./data/processed_anno/llava_format/wts_bdd_llava_qa_train_stage_filted_checked.json'

Block-Expansion

We use the block expansion to fine-tune the VLMs. 8~16 blocks are suggested for balancing the performance and efficiency. We add 12 blcoks to the original llava-1.6-34b. the llava-1.6-34b-12block model could be created by these steps:

  1. Download the llava-1.6-34b model to ./models, and add block with this script:
   python block_expansion_llava_1_6.py
  1. Copy the *.json and tokenizer.model form ./models/llava-v1.6-34b to ./models/llava-v1.6-34b-12block;
  2. Modify the num_hidden_layers=72 (new_layer_nums= original_layer_nums+block_layer_nums) in config.json of the llava-1.6-34b-12block model.

Train

We use 8xA100 GPUs for fine-tuning. The training process takes approximately 8 hours by this script:

bash scripts/finetune_block_bigsmall.sh

The fine-tuned model could be download here.

Inference

Firstly, you should check the parameters defined at ./scripts/inference.sh, ensure that all essential files and model exist.

Now you can do inference on WTS_TEST_SET:

bash scripts/inference.sh

Evaluation

We use the wts-dataset for evaluation.

Citation

If you find CityLLaVA useful for your research and applications, please cite using this BibTeX:

@misc{duan2024cityllava,
    title={CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario},
    url={https://github.com/qingchunlizhi/AICITY2024_Track2_AliOpenTrek_CityLLaVA},
    author={Zhizhao Duan, Hao Cheng, Duo Xu, Xi Wu, Xiangxie Zhang, Xi Ye, and Zhen Xie},
    year={2024},
    eprint={2405.03194},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

  • CityLLaVA is built with reference to the code of the following projects: LLaVA and LLaMA-Pro. Thanks for their awesome work!

aicity2024_track2_aliopentrek_cityllava's People

Contributors

alibaba-oss avatar qingchunlizhi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.