Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks, NeurIPS 2023

This is the Pytorch implementation of our paper:

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

[Paper] [arXiv] [Video] [Poster] [Slides]

Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao

In NeurIPS 2023

📝Requirements and Installation

Getting Started

git clone https://github.com/haoyi-duan/DG-SCT
cd DG-SCT
pip install -r requirements.txt

Download HTS-AT Backbone

Download checkpoints.zip from Google Drive or Baidu Disk (pwd: 2023), and extract it into the directory ./DG-SCT/.

AVE

Download Data

Download frames.zip from Google Drive or Baidu Disk (pwd: 2023), wave.zip from Google Drive or Baidu Disk (pwd: 2023), and extract them into the directory ./data/AVE/.
Usage

Go to AVE task directory.
```
cd DG-SCT/AVE
```
- Training
```
bash train.sh
```
- Testing
  
  ./models/best_82.18.pt: Google Drive or Baidu Disk (pwd: 2023)
```
bash test.sh
```
Results

AVS

Download Data
- Download Dataset
  
  The updated AVSBench dataset is available here (AVSBench-object). You may request the dataset by filling the Google Form.
  
  The downloaded data should be placed to the directory ./data/.
- Download Wave
  
  Download wave for task S4 (Google Drive or Baidu Disk (pwd: 2023)) and task MS3 (Google Drive or Baidu Disk (pwd: 2023)), and extract them into the directory ./data/AVSBench_data/Single-source/s4_data/ and ./data/AVSBench_data/Multi-sources/ms3_data/, respectively.
Download pretrained backbones

The pretrained ResNet50/PVT-v2-b5 (vision) and VGGish (audio) backbones can be downloaded from here and placed to the directory ./DG-SCT/AVS/pretrained_backbones/.
Usage

Go to AVS task directory.
```
# for S4 task:
cd DG-SCT/AVS/avs_scripts/avs_s4

# for MS3 task:
cd DG-SCT/AVS/avs_scripts/avs_ms3
```
- Training
```
bash train.sh
```
- Testing
  
  checkpoint for S4 task: ./DG-SCT/AVS/avs_scripts/avs_s4/train_logs Google Drive or Baidu Disk (pwd:2023)
  
  checkpoint for MS3 task: ./DG-SCT/AVS/avs_scripts/avs_ms3/train_logs Google Drive or Baidu Disk (pwd:2023)
```
bash test.sh
```
Results

AVVP

Download Data

Download extracted feats, frame and wave of LLP dataset from Baidu Disk (pwd: 2023), and extract it into the directory ./data/AVVP/.
Usage

Go to AVVP task directory:
```
cd DG-SCT/AVVP
```
- Training
```
bash train.sh
```
- Testing
  
  ./models/MGN_Net.pt: Google Drive or Baidu Disk (pwd:2023)
```
bash test.sh
```
Results

AVQA

Download Data

Download frames.zip from Google Drive or Baidu Disk (pwd: 2023), audio_wave.zip from Google Drive or Baidu Disk (pwd: 2023), and extract them into the directory ./data/AVQA/.
Usage

Go to AVQA task directory.
```
cd DG-SCT/AVQA
```
- Audio-Visual Grounding Generation
```
python grounding_gen/main_grd_gen.py
```
  You can download the ./grounding_gen/models_grounding_gen/lavish_grounding_gen_best.pt from Google Drive or Baidu Disk (pwd: 2023) to skip the Audio-Visual Grounding Generation process.
- Training
```
bash train.sh
```
- Testing
  
  ./net_grd_avst/avst_models/avst.pt: Google Drive or Baidu Disk (pwd: 2023)
```
bash test.sh
```
Results

Few-shot/Zero-shot

We use audio-text backbones in CLAP: 630k-audioset-fusion-best.pt, and 630k-fusion-best.pt. Please download and place them into the directory ./pretrain/models/.

Few-shot

Go to Few-shot Directory
```
cd few-shot
```

AVE

1 shot

python main_AVE.py --dataset_name AVE --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0

2 shots

python main_AVE.py --dataset_name AVE --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0

4 shots

python main_AVE.py --dataset_name AVE --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0

8 shots

python main_AVE.py --dataset_name AVE --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0

16 shots

python main_AVE.py --dataset_name AVE --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0

AVE Classification

1 shot

python main_AVE_class.py --dataset_name AVE --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

2 shots

python main_AVE_class.py --dataset_name AVE --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

4 shots

python main_AVE_class.py --dataset_name AVE --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

8 shots

python main_AVE_class.py --dataset_name AVE --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

16 shots

python main_AVE_class.py --dataset_name AVE --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

LLP Classification

1 shot

python main_LLP_class.py --dataset_name LLP --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

2 shots

python main_LLP_class.py --dataset_name LLP --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

4 shots

python main_LLP_class.py --dataset_name LLP --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

8 shots

python main_LLP_class.py --dataset_name LLP --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

16 shots

python main_LLP_class.py --dataset_name LLP --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1

Zero-shot

Download Data

Download VGG-Sound(40K) from Baidu Disk (pwd: 2023), and extract it into the directory ./data/.

Usage

Pretrain on VGG-Sound(40K)
```
cd pretrain
bash train.sh
```
The pretrained model will be placed at pretrain/models/.

Zero-shot

MODEL_NAME="name of the pretrained model in pretrain/models/."
# AVE
python zero_shot.py --test_dataset_name AVE --backbone $MODEL_NAME --is_event_score 1

# AVE classification
python zero_shot.py --test_dataset_name AVE --backbone $MODEL_NAME --is_event_score 0

# LLP classification
python zero_shot.py --test_dataset_name LLP --backbone $MODEL_NAME --is_event_score 0

Results

🎓Cite

If you find this work useful, please consider citing it.

@inproceedings{duan2023cross,
  title={Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks},
  author={Duan, Haoyi and Xia, Yan and Zhou, Mingze and Tang, Li and Zhu, Jieming and Zhao, Zhou},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023}
}

👍Acknowledgments

Our code is based on CMBS, AVSBench, MGN, MUSIC-AVQA, and LAVisH.

✏Model Checkpoints

Tasks	Checkpoints
AVE	Google Drive or Baidu Disk (pwd: 2023)
AVS_S4	Google Drive or Baidu Disk (pwd:2023)
AVS_MS3	Google Drive or Baidu Disk (pwd:2023)
AVVP	Google Drive or Baidu Disk (pwd:2023)
AVQA	Google Drive or Baidu Disk (pwd: 2023)

leonardo-lyh / dg-sct Goto Github PK

dg-sct's Introduction

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks, NeurIPS 2023

📝Requirements and Installation

Getting Started

Download HTS-AT Backbone

AVE

Download Data

Usage

Results

AVS

Download Data

Download pretrained backbones

Usage

Results

AVVP

Download Data

Usage

Results

AVQA

Download Data

Usage

Results

Few-shot/Zero-shot

Few-shot

Go to Few-shot Directory

AVE

AVE Classification

LLP Classification

Zero-shot

Download Data

Usage

Results

🎓Cite

👍Acknowledgments

✏Model Checkpoints

dg-sct's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Jobs