GithubHelp home page GithubHelp logo

gewu-lab / mmcosine_icassp23 Goto Github PK

View Code? Open in Web Editor NEW
17.0 3.0 1.0 10.76 MB

The code repo for ICASSP 2023 Paper "MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning"

Python 100.00%

mmcosine_icassp23's Introduction

MMCosine_ICASSP23

This is the code release for ICASSP 2023 Paper "MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning", implemented with Pytorch.

Title: MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

Authors: Ruize Xu, Ruoxuan Feng, Shi-xiong Zhang, Di Hu

🚀 Project page here: Project Page

📄 Paper here: Paper

🔍 Supplementary material: Supplementary

Overview

Recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model`s performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution. Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise $L_2$ normalization to features and weights towards balanced and better multi-modal fine-grained learning.

Data Preparation

  • Download Original Dataset: CREMAD, SSW60, Voxceleb1&2, and UCF 101(supplementary).

  • Preprocessing:

    • CREMAD: Refer to OGM-GE for video processing.
    • SSW60: Refer to the original repo for details.
    • Voxceleb1&2: After extracting frames (2fps) from the raw video, we utilize RetinaFace to extract and align faces. The official pipeline trains on Voxceleb2 and test on the Voxceleb1 test set, and we add validation on the manually-made Voxceleb2 test set. The annotation is in /data folder.

Main Dependencies

  • ubuntu 18.04
  • CUDA Version: 11.6
  • Python: 3.9.7
  • torch: 1.10.1
  • torchaudio: 0.10.1
  • torchvision: 0.11.2

Run

You can train your model on the provided datasets (e.g. CREMAD) simply by running:

python main_CD.py --train --fusion_method gated --mmcosine True --scaling 10

Apart from fusion methods and scaling parameters, you can also adjust the setting such as batch_size, lr_decay, epochs, etc.

You can also record intermediate variables through tensorboard by nominating use_tensorboard and tensorboard_path for saving logs.

Bibtex

If you find this work useful, please consider citing it.

@inproceedings{xu2023mmcosine,
  title={MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning},
  author={Xu, Ruize and Feng, Ruoxuan and Zhang, Shi-Xiong and Hu, Di},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.

Contact us

If you have any detailed questions or suggestions, you can email us: [email protected]

mmcosine_icassp23's People

Contributors

rick-xu315 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

xiaoxiaogang1

mmcosine_icassp23's Issues

visual in CREMA-D dataset

Have you done any cropping or other pre-processing of the video frames in this dataset?
My replication shows no change in the accuracy of the images during training.

Configurations for the gated and film

Thank you for your kind response, Ruize.

I checked the accuracy of the gated and film with the updated code.
However, each accuracy with mmcosine was about 0.57 on CREMA-D for 200 epochs and a scaling of 2.
(without mmcosine was about 0.60)
Do you have any recommended configurations for them with mmcosine?

  • In basic_model.py (line34,36), x_film and x_gate should be erased.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.