GithubHelp home page GithubHelp logo

dg-sct's Introduction

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks, NeurIPS 2023

model

This is the Pytorch implementation of our paper:

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

[Paper] [arXiv] [Video] [Poster] [Slides]

Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao

In NeurIPS 2023


📝Requirements and Installation

  • Getting Started
git clone https://github.com/haoyi-duan/DG-SCT
cd DG-SCT
pip install -r requirements.txt
  • Download HTS-AT Backbone

    Download checkpoints.zip from Google Drive or Baidu Disk (pwd: 2023), and extract it into the directory ./DG-SCT/.

AVE

  • Download Data

    Download frames.zip from Google Drive or Baidu Disk (pwd: 2023), wave.zip from Google Drive or Baidu Disk (pwd: 2023), and extract them into the directory ./data/AVE/.

  • Usage

    Go to AVE task directory.

    cd DG-SCT/AVE
    
  • Results

    AVE

AVS

  • Download Data
    • Download Dataset

      The updated AVSBench dataset is available here (AVSBench-object). You may request the dataset by filling the Google Form.

      The downloaded data should be placed to the directory ./data/.

    • Download Wave

      Download wave for task S4 (Google Drive or Baidu Disk (pwd: 2023)) and task MS3 (Google Drive or Baidu Disk (pwd: 2023)), and extract them into the directory ./data/AVSBench_data/Single-source/s4_data/ and ./data/AVSBench_data/Multi-sources/ms3_data/, respectively.

  • Download pretrained backbones

    The pretrained ResNet50/PVT-v2-b5 (vision) and VGGish (audio) backbones can be downloaded from here and placed to the directory ./DG-SCT/AVS/pretrained_backbones/.

  • Usage

    Go to AVS task directory.

    # for S4 task:
    cd DG-SCT/AVS/avs_scripts/avs_s4
    
    # for MS3 task:
    cd DG-SCT/AVS/avs_scripts/avs_ms3
    • Training

      bash train.sh
    • Testing

      checkpoint for S4 task: ./DG-SCT/AVS/avs_scripts/avs_s4/train_logs Google Drive or Baidu Disk (pwd:2023)

      checkpoint for MS3 task: ./DG-SCT/AVS/avs_scripts/avs_ms3/train_logs Google Drive or Baidu Disk (pwd:2023)

      bash test.sh
      
  • Results

    AVS

AVVP

  • Download Data

    Download extracted feats, frame and wave of LLP dataset from Baidu Disk (pwd: 2023), and extract it into the directory ./data/AVVP/.

  • Usage

    Go to AVVP task directory:

    cd DG-SCT/AVVP
    
  • Results

    AVVP

AVQA

  • Download Data

    Download frames.zip from Google Drive or Baidu Disk (pwd: 2023), audio_wave.zip from Google Drive or Baidu Disk (pwd: 2023), and extract them into the directory ./data/AVQA/.

  • Usage

    Go to AVQA task directory.

    cd DG-SCT/AVQA
    
    • Audio-Visual Grounding Generation

      python grounding_gen/main_grd_gen.py

      You can download the ./grounding_gen/models_grounding_gen/lavish_grounding_gen_best.pt from Google Drive or Baidu Disk (pwd: 2023) to skip the Audio-Visual Grounding Generation process.

    • Training

      bash train.sh
      
    • Testing

      ./net_grd_avst/avst_models/avst.pt: Google Drive or Baidu Disk (pwd: 2023)

      bash test.sh
      
  • Results

    AVQA

Few-shot/Zero-shot

We use audio-text backbones in CLAP: 630k-audioset-fusion-best.pt, and 630k-fusion-best.pt. Please download and place them into the directory ./pretrain/models/.

  • Few-shot

    • Go to Few-shot Directory
      cd few-shot
      
    • AVE
      • 1 shot

        python main_AVE.py --dataset_name AVE --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
      • 2 shots

        python main_AVE.py --dataset_name AVE --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
      • 4 shots

        python main_AVE.py --dataset_name AVE --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
      • 8 shots

        python main_AVE.py --dataset_name AVE --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
      • 16 shots

        python main_AVE.py --dataset_name AVE --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
    • AVE Classification
      • 1 shot

        python main_AVE_class.py --dataset_name AVE --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
      • 2 shots

        python main_AVE_class.py --dataset_name AVE --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
      • 4 shots

        python main_AVE_class.py --dataset_name AVE --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
      • 8 shots

        python main_AVE_class.py --dataset_name AVE --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
      • 16 shots

        python main_AVE_class.py --dataset_name AVE --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
    • LLP Classification
      • 1 shot

        python main_LLP_class.py --dataset_name LLP --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
      • 2 shots

        python main_LLP_class.py --dataset_name LLP --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
      • 4 shots

        python main_LLP_class.py --dataset_name LLP --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
      • 8 shots

        python main_LLP_class.py --dataset_name LLP --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
      • 16 shots

        python main_LLP_class.py --dataset_name LLP --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
  • Zero-shot

    • Download Data

      Download VGG-Sound(40K) from Baidu Disk (pwd: 2023), and extract it into the directory ./data/.

    • Usage
      • Pretrain on VGG-Sound(40K)

        cd pretrain
        bash train.sh
        

        The pretrained model will be placed at pretrain/models/.

      • Zero-shot

        MODEL_NAME="name of the pretrained model in pretrain/models/."
        # AVE
        python zero_shot.py --test_dataset_name AVE --backbone $MODEL_NAME --is_event_score 1
        
        # AVE classification
        python zero_shot.py --test_dataset_name AVE --backbone $MODEL_NAME --is_event_score 0
        
        # LLP classification
        python zero_shot.py --test_dataset_name LLP --backbone $MODEL_NAME --is_event_score 0
  • Results

few-zero

🎓Cite

If you find this work useful, please consider citing it.

@inproceedings{duan2023cross,
  title={Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks},
  author={Duan, Haoyi and Xia, Yan and Zhou, Mingze and Tang, Li and Zhu, Jieming and Zhao, Zhou},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023}
}

👍Acknowledgments

Our code is based on CMBS, AVSBench, MGN, MUSIC-AVQA, and LAVisH.

✏Model Checkpoints

Tasks Checkpoints
AVE Google Drive or Baidu Disk (pwd: 2023)
AVS_S4 Google Drive or Baidu Disk (pwd:2023)
AVS_MS3 Google Drive or Baidu Disk (pwd:2023)
AVVP Google Drive or Baidu Disk (pwd:2023)
AVQA Google Drive or Baidu Disk (pwd: 2023)

dg-sct's People

Contributors

haoyi-duan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.