GithubHelp home page GithubHelp logo

cwx-worst-one / eat Goto Github PK

View Code? Open in Web Editor NEW
68.0 5.0 2.0 5.24 MB

[IJCAI 2024] EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

License: MIT License

Python 96.98% Shell 3.02%
audio audio-classification deep-learning eat fairseq pytorch representation-learning self-supervised-learning

eat's Introduction

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Platform Python Pytorch arXiv fairseq License

Guides

News 🔥

  • We release EAT-large (20 epochs) with SOTA performance on AS-2M, AS-20K, ESC-50 and SPC-2.
  • We have updated the checkpoints and code, and now EAT seamlessly supports variable-length audio throughout training, feature extraction, inference, and evaluation phases.

Introduction

EAT is an audio SSL model with high effectiveness and efficiency during self-supervised pre-training. You can find details in the paper EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.

Requirements and Installation

The minimum environment requirements are Python >= 3.8 and PyTorch >= 1.13. You could find the versions of other dependencies we use in requirements.txt.

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
git clone https://github.com/cwx-worst-one/EAT

Model Checkpoints

You could download the EAT-base (10 epochs) checkpoints by Google Drive.

⚠️ Due to the limited amount of AudioSet data we possess compared to other models, we highly recommend pre-training the EAT model with your own data, which would probably perform better than the given one.

Update!!!!! 🆕 (RECOMMEND)
We have introduced two new variants of the EAT pre-training model and their fine-tuned versions, each designed to enhance performance through either extended pre-training epochs or scaling up the model size.

Links for model checkpoints:

Performance metrics:

Model Backbone Parameters Pre-training
Epoch
AS-20K
mAP(%)
AS-2M
mAP(%)
EAT-base ViT-B 88M 10 40.3 48.6
EAT-base ViT-B 88M 30 41.3 48.9
EAT-large ViT-L 309M 20 42.0 49.5

Feature Extraction

We provide the script for extracting audio features from the last layer of EAT encoder. The features are stored in .npy format and the sample rate of the extracted features is ~50Hz. EAT could provide frame-level features and utterance-level features (denoted by the CLS token).
To extract latent representations from audio clips, you could use our pre-trained checkpoint, fine-tuned checkpoint or your owns, then please run the script feature_extract.sh by:

bash EAT/scripts/feature_extract.sh 

Data Preparation

The main dataset in our experiment is AudioSet. Regrettably, we are unable to release the data due to copyright restrictions. Data manifest is available at here. We follow the file format in wav2vec and data2vec, where .tsv format file is for index while .lbl and .csv format files are specific for classification task. You could modify the files for your own database.

Pre-Training

Our codes are adapted from Audio-MAE and data2vec. We employ pretraining_AS2M.yaml as our default pre-training config. To pre-train the EAT model on Audioset, you could run the script pretraining_AS2M.sh by:

bash EAT/scripts/pretraining_AS2M.sh 

If you need to pre-train the EAT model on other datasets where audio lengths are not fixed at 10 seconds, you can refer to the instructions in feature_extract/readme.md

Fine-Tuning

We employ finetuning.yaml as our default fine-tuning config. To fine-tune the EAT model in different downstream tasks, you could run the script finetuning_{task}.sh, where {task} includes AS20K, AS2M, ESC50 and SPCv2. For example, you can fine-tune EAT on AS20K by executing:

bash EAT/scripts/finetuning_AS20K.sh

Inference and Evaluation

For inference on single AudioSet audio clip with fine-tuned models, you could use our EAT checkpoints fine-tuning on AS-2M (recommended) or AS-20K and run the script inference.sh by:

bash EAT/scripts/inference.sh 

An example output is as follows:

# top_k_prediction = 12
************ Acoustic Event Inference ************
LABEL                          PREDICTION
Percussion                     0.523
Drum kit                       0.437
Vibraphone                     0.420
Drum                           0.316
Music                          0.303
Snare drum                     0.277
Glockenspiel                   0.225
Marimba, xylophone             0.223
Cymbal                         0.213
Bass drum                      0.207
Hi-hat                         0.196
Mallet percussion              0.170
**************************************************

For comprehensive evaluation on the entire AudioSet eval dataset with fine-tuned EAT models, you could run the evaluation script eval.sh by:

bash EAT/scripts/eval.sh 

This script will give you the evaluation value of mAP on AudioSet test dataset. Per-class AP can be found under the path ./EAT/ap_log.txt. You could also refer to our results of finetuned EAT models on evaluation set of Audioset under the path ./EAT/results.

Performance

Pre-training on AS-2M, EAT gains state-of-the-art (SOTA) performance on several audio and speech classification datasets including AS-20K, AS-2M, ESC-50 and SPC-2.
Alt text

Efficiency

EAT achieves a total pre-training time reduction of ~15x compared to BEATs and ~10x relative to Audio-MAE. It costs only 10 epochs during EAT's pre-training on AS-2M.
Alt text

Experiment Logs

We report the experiment logs using wandb. We have published a short WandB report detailing the training process and performance metrics of the EAT model. You could visit it here.

TODO

  • release the final EAT large
  • update codes and checkpoints for friendly usage
  • release the docker image

Citation

If you find our EAT codes and models useful, please cite the following paper:

@article{chen2024eat,
  title={EAT: Self-Supervised Pre-Training with Efficient Audio Transformer},
  author={Chen, Wenxi and Liang, Yuzhe and Ma, Ziyang and Zheng, Zhisheng and Chen, Xie},
  journal={arXiv preprint arXiv:2401.03497},
  year={2024}
}

Reference and Acknowledgement

Our codebase is based on the awesome Audio-MAE and data2vec repo.

eat's People

Contributors

cwx-worst-one avatar ddlbojack avatar

Stargazers

 avatar  avatar  avatar  avatar ji avatar Enkhmunkh Nyamdorj avatar Bao-Sinh Nguyen avatar Sang-Hoon Lee avatar  avatar real-ljt avatar zhuoy avatar Stan Kirdey avatar Yunusemre avatar Tianrui Wang (王天锐) avatar Chen Xie avatar  avatar Pengxiang Zhu avatar Benhao Huang avatar  avatar Aiden Chang avatar  avatar Alexander Kondratyev avatar EunBeen Kim avatar Haina Zhu avatar Yechan Yu avatar ShengHeng Ye avatar long_time_no_see avatar Shaokai Li avatar JingLuo avatar kkk avatar wblgers avatar HAESUNG JEON (chad.plus) avatar Rishikesh (ऋषिकेश) avatar Ryan Yard avatar  avatar  avatar  avatar eagle avatar  avatar Ubaid Seth avatar Puyiii avatar Raoul avatar Mary McCall avatar DS.Xu avatar Chance  avatar Yuan-Man avatar David Marx avatar Sofian Mejjoute avatar Trình avatar Aaron Abebe avatar Guan-Ting (Daniel) Lin avatar Shreyas Jaiswal avatar Yoshiki Masuyama avatar Shu-wen (Leo) Yang avatar Conna avatar Zhikang Niu avatar Alef Iury avatar  avatar tingweichen avatar  avatar  avatar Anbai Jiang avatar Huu Tuong Tu avatar Zhisheng Zheng avatar  avatar Xiquan Li avatar  avatar  avatar

Watchers

 avatar EunBeen Kim avatar  avatar Kostas Georgiou avatar  avatar

Forkers

ishine enescigdem

eat's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.