EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Guides

Requirements and Installation
Model Checkpoints
Feature Extraction
Data Preparation
Pre-Training
Fine-Tuning
Inference and Evaluation

News 🔥

We release EAT-large (20 epochs) with SOTA performance on AS-2M, AS-20K, ESC-50 and SPC-2.
We have updated the checkpoints and code, and now EAT seamlessly supports variable-length audio throughout training, feature extraction, inference, and evaluation phases.

Introduction

EAT is an audio SSL model with high effectiveness and efficiency during self-supervised pre-training. You can find details in the paper EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.

Requirements and Installation

The minimum environment requirements are Python >= 3.8 and PyTorch >= 1.13. You could find the versions of other dependencies we use in requirements.txt.

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
git clone https://github.com/cwx-worst-one/EAT

Model Checkpoints

You could download the EAT-base (10 epochs) checkpoints by Google Drive.

AS-2M Pre-trained
AS-2M Pre-trained+Fine-tuned (AS-2M)
AS-2M Pre-trained+Fine-tuned (AS-20K)

⚠️ Due to the limited amount of AudioSet data we possess compared to other models, we highly recommend pre-training the EAT model with your own data, which would probably perform better than the given one.

Update!!!!! 🆕 (RECOMMEND)
We have introduced two new variants of the EAT pre-training model and their fine-tuned versions, each designed to enhance performance through either extended pre-training epochs or scaling up the model size.

Links for model checkpoints:

EAT-base_epoch30 (pre-training)
EAT-base_epoch30 (fine-tuning on AS-2M)
EAT-large_epoch20 (pre-training)
EAT-large_epoch20 (fine-tuning on AS-2M)

Performance metrics:

Model	Backbone	Parameters	Pre-training Epoch	AS-20K mAP(%)	AS-2M mAP(%)
EAT-base	ViT-B	88M	10	40.3	48.6
EAT-base	ViT-B	88M	30	41.3	48.9
EAT-large	ViT-L	309M	20	42.0	49.5

Feature Extraction

We provide the script for extracting audio features from the last layer of EAT encoder. The features are stored in .npy format and the sample rate of the extracted features is ~50Hz. EAT could provide frame-level features and utterance-level features (denoted by the CLS token).
To extract latent representations from audio clips, you could use our pre-trained checkpoint, fine-tuned checkpoint or your owns, then please run the script feature_extract.sh by:

bash EAT/scripts/feature_extract.sh

Data Preparation

The main dataset in our experiment is AudioSet. Regrettably, we are unable to release the data due to copyright restrictions. Data manifest is available at here. We follow the file format in wav2vec and data2vec, where .tsv format file is for index while .lbl and .csv format files are specific for classification task. You could modify the files for your own database.

Pre-Training

Our codes are adapted from Audio-MAE and data2vec. We employ pretraining_AS2M.yaml as our default pre-training config. To pre-train the EAT model on Audioset, you could run the script pretraining_AS2M.sh by:

bash EAT/scripts/pretraining_AS2M.sh

If you need to pre-train the EAT model on other datasets where audio lengths are not fixed at 10 seconds, you can refer to the instructions in feature_extract/readme.md

Fine-Tuning

We employ finetuning.yaml as our default fine-tuning config. To fine-tune the EAT model in different downstream tasks, you could run the script finetuning_{task}.sh, where {task} includes AS20K, AS2M, ESC50 and SPCv2. For example, you can fine-tune EAT on AS20K by executing:

bash EAT/scripts/finetuning_AS20K.sh

Inference and Evaluation

For inference on single AudioSet audio clip with fine-tuned models, you could use our EAT checkpoints fine-tuning on AS-2M (recommended) or AS-20K and run the script inference.sh by:

bash EAT/scripts/inference.sh

An example output is as follows:

# top_k_prediction = 12
************ Acoustic Event Inference ************
LABEL                          PREDICTION
Percussion                     0.523
Drum kit                       0.437
Vibraphone                     0.420
Drum                           0.316
Music                          0.303
Snare drum                     0.277
Glockenspiel                   0.225
Marimba, xylophone             0.223
Cymbal                         0.213
Bass drum                      0.207
Hi-hat                         0.196
Mallet percussion              0.170
**************************************************

For comprehensive evaluation on the entire AudioSet eval dataset with fine-tuned EAT models, you could run the evaluation script eval.sh by:

bash EAT/scripts/eval.sh

This script will give you the evaluation value of mAP on AudioSet test dataset. Per-class AP can be found under the path ./EAT/ap_log.txt. You could also refer to our results of finetuned EAT models on evaluation set of Audioset under the path ./EAT/results.

Performance

Pre-training on AS-2M, EAT gains state-of-the-art (SOTA) performance on several audio and speech classification datasets including AS-20K, AS-2M, ESC-50 and SPC-2.

Efficiency

EAT achieves a total pre-training time reduction of ~15x compared to BEATs and ~10x relative to Audio-MAE. It costs only 10 epochs during EAT's pre-training on AS-2M.

Experiment Logs

We report the experiment logs using wandb. We have published a short WandB report detailing the training process and performance metrics of the EAT model. You could visit it here.

TODO

release the final EAT large
update codes and checkpoints for friendly usage
release the docker image

Citation

If you find our EAT codes and models useful, please cite the following paper:

@article{chen2024eat,
  title={EAT: Self-Supervised Pre-Training with Efficient Audio Transformer},
  author={Chen, Wenxi and Liang, Yuzhe and Ma, Ziyang and Zheng, Zhisheng and Chen, Xie},
  journal={arXiv preprint arXiv:2401.03497},
  year={2024}
}

Reference and Acknowledgement

Our codebase is based on the awesome Audio-MAE and data2vec repo.

cwx-worst-one / eat Goto Github PK

eat's Introduction

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

News 🔥

Introduction

Requirements and Installation

Model Checkpoints

Feature Extraction

Data Preparation

Pre-Training

Fine-Tuning

Inference and Evaluation

Performance

Efficiency

Experiment Logs

TODO

Citation

Reference and Acknowledgement

eat's People

Contributors

Stargazers

Watchers

Forkers

eat's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs