v2a-mapper's Introduction

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

For benchmarking purpose, this repo hosts the generated test samples of "V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models", AAAI 2024. ([arXiv] [project])

Authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai from University of Sydney and Dolby Laboratories.

Main Results

Compared to previous methods Im2Wav and CLIPSonic, our V2A-Mapper is trained with 86% fewer parameters but can achieve 53% and 19% improvement in Frechet Distance (FD, fidelity) and Clip-Score (CS, relevance), respectively.

VGGSound

VGGSound contains 199,176 10-second video clips extracted from videos uploaded to YouTube with audio-visual correspondence. Following the original train/test split, we evaluate the performance on 15,446 test samples. Our generated test samples (~5G) for VGGSound can be downloaded from here.

ImageHear

To testify the generalization ability of our V2A-Mapper, we also test on out-of-distribution dataset ImageHear which contains 101 images from 30 visual classes (2-8 images per class). Our generated test samples (~33M) for ImageHear can be downloaded from here.

Custom Datasets

If you need sample results by V2A-Mapper for your own datasets, we are happy to generate that for you. Please send the request to [email protected] and [email protected].

Citation

If you find our work helpful in your research, please kindly cite our paper via:

@inproceedings{v2a-mapper,
  title     = {V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models},
  author    = {Wang, Heng and Ma, Jianbo and Pascual, Santiago and Cartwright, Richard and Cai, Weidong},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2024},
}