yuangongnd / ast Goto Github PK

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

License: BSD 3-Clause "New" or "Revised" License

Python 10.73% Shell 0.92% Jupyter Notebook 88.35%

pytorch audio-classification deep-learning audio representation-learning keyword-spotting speech-commands speech-classification

ast's People

Contributors

Stargazers

Watchers

Forkers

ishine chenchy jwang1993 anarsultani97 trendingtechnology gatsbychen russellizadi kyocen lenka844 ahmedgarip jvel07 kelelexu saifkhan-m drivendataorg yiweichen04 ayushk4 zhangwq740 chang111 hoangphucitjp pppku chester-w-xie jeffc0628 liyunchen tenglang123 gretatuckute danialghiaseddin ankitshah009 leiqi maryammir-o tgxs002 egsergeenko qdenisq olegjakushkin mizuho32 chasingstar95 wangmou21 k-bs dzdydx michaellynn1996 victolee0 chwlsunny andrewtal 1087424 shangqwe123 hajipourmohammad youngjay0612 vlb9ae grayfactory lifihuang pop2pop3 swagshaw hridog00 thelostpeace hsouporto samantha-fu mansourgu test-dan-run kwangje proger laleye ggzhang0071 dongzhijin arshinmar jemerysim junaidiqbalsyed realkris aercoustics wentaozhu nomoon27 sunpengfei1122 ag027592 litongxu enoch2090 panxin801 xk-wang p4vlos 1enn0 berkeley-expressme yousirui1 spark-ming almostimplemented shedima liyunlongaaa haoheliu anshuai666 genjib ahmc15 parksung123 caluoy weimeng23 yuuki-takeda 965694547 huake-ezhou nikhilravikrishnan nielsrogge techthiyanes linleon1995 cs20s030 bhavinjawade prashant1712

ast's Issues

Inference on CPU ?

Hello,
I tried to do inference on cpu.BUT ;
RuntimeError: module must have its parameters and buffers on device cuda:0 error arose.
What steps should be done ?

MixUp Waveform Length Matching

When specifying mixup>0, the code tries to load 2 audio files and if they are not the same length tries to scale waveform2 to the same shape as waveform1. There is a minor bug in the code that does this:

 if waveform1.shape[1] != waveform2.shape[1]:
        if waveform1.shape[1] > waveform2.shape[1]:
            # padding
            temp_wav = torch.zeros((1,waveform1.shape[1]))
            temp_wav[0, 0:waveform2.shape[1]] = waveform2
            waveform2 = temp_wav
        else:
            # cutting
            waveform2 = waveform2[0, 0:waveform1.shape[1]]

In the above snippet, lines 4, 5, 9, don't work where the 1st dimension of the waveforms >1.
Following minor tweaks should help:

if waveform1.shape[1] != waveform2.shape[1]:
      if waveform1.shape[1] > waveform2.shape[1]:
          # padding
          temp_wav = torch.zeros(waveform1.shape)
          temp_wav[:, 0:waveform2.shape[1]] = waveform2
          waveform2 = temp_wav
      else:
          # cutting
          waveform2 = waveform2[:, 0:waveform1.shape[1]]

Clarification on the Parameters

Hey,

I'm pretty new to working with audio data in classification, so could you give some insight into some of the parameters / stats mentioned in steps 2 - 4 in the "Use Pretrained Model For Downstream Tasks" section? Specifically, a bit more clarification on getting the normalization stats, and how the parameters in steps 2 (SpecAug and mixup rate) and 4 need to be changed for different kinds of input or how they affect the model.

Typo in ast_models.py

ast/src/models/ast_models.py

Line 42 in d338ce4

:param input_fdim: the number of time frames of the input spectrogram

:param input_fdim -> :param input_tdim

Where reflected the transformer or attention in ATSModel？

hello , I didn't find the transformer or attention in the ATSModel ，Can you help me point out ？

results.csv and getting labels per audio file

Hi Yuan,

Thank you for posting your project and providing ample information about it's elements!

I am running the ESC-50 recipe and I've been struggling to output results. Could you point me towards where the result.csv files get created in the scripts? Moreover, do you know how I could pull out labels, for the sounds files, from the results of that recipe? I am trying to use this recipe for avian call recognition and am struggling with the result gathering.

Thank you, I appreciate any insight you can offer.

Error reshaping positional embedding for AudioSet pretrained model

This error only occurs when using the AudioSet pretrained model - does not occur using only ImageNet pretrained. Audio is resampled to 16k Hz. Error occurs in src/models/ast_models.py - since t_dim > 101, else block on line 139 is triggered.

Traceback (most recent call last):
  File "train.py", line 73, in <module>
    model = VTN(**vars(cfg))
[REDACTED - model call internally]
  File "/[REDACTED]/ast_models.py", line 141, in __init__
    new_pos_embed = new_pos_embed.reshape(1, 768, num_patches).transpose(1, 2)
RuntimeError: shape '[1, 768, 120]' is invalid for input of size 221184

Parameters to "AstModel" instantiation:

label_dim: 400
input_tdim: 251
input_fdim: 64
audioset_pretrain: True

Validation loss vs Training loss in AudioSet training

Hi!

First of all i would like to thank you for sharing with everyone your amazing work! Truly inspiring and fascinating work you shard with us.

I have a question regarding the differences of the training loss and the validation loss. It seems that the validation loss is much higher than the training loss, is that make sense? isn't it overfitting?

I also tried to fine tune the Audioset trained model for my data and is showed the same differences (with and without augmentations).

Here is an example from the logs: test-full-f10-t10-pTrue-b12-lr1e-5/log_2090852.txt:

train_loss: 0.011128
valid_loss: 0.693989

I'm still new to deep learning so maybe I'm missing something.

Thank you!

Positional embedding

The paper https://arxiv.org/pdf/2012.12877v2.pdf says "We therefore cut the first dimension and interpolate the second dimension of the 24 × 24 ViT positional embedding to 12 × 100 and use it as the positional embedding for the AST.

Do the "cut" means take the first 12-dimension ? In my understanding, nn.functional.interpolate always "interpolate".

from torch import nn


h,w=4,3
pos_embed = torch.randn((1,1,h,w))

a = nn.functional.interpolate(pos_embed,scale_factor=(2/h,3/w),mode='bilinear')
print("position embedding:\n",pos_embed)
print("{},{}->{},{}:\n".format(h,w,2,3),a)```

---------------------------------------------
position embedding:
 tensor([[[[-0.5638,  0.0127, -2.4190],
          [ 0.2434,  0.3804, -0.2128],
          [ 0.2813, -0.7966, -0.3580],
          [-1.2754, -0.2837,  1.6149]]]])
4,3->2,3:
 tensor([[[[-0.1602,  0.1966, -1.3159],
          [-0.4971, -0.5402,  0.6284]]]])

The accuracy following esc50 Recipe is very low

There must be some mistake from my side. Can someone help me identify it?
This is how I'm training:
!python -W ignore /content/ast/src/run.py --model ast --dataset esc50 \ --data-train /content/data/datafiles/esc_train_data_1.json --data-val /content/data/datafiles/esc_eval_data_1.json --exp-dir /content/expdir/fold1 \ --label-csv /content/ast/egs/esc50/data/esc_class_labels_indices.csv --n_class 50 \ --lr 1e-5 --n-epochs 25 --batch-size 12 --save_model False \ --freqm 24 --timem 96 --mixup 8 --bal None \ --tstride 10 --fstride 10 --imagenet_pretrain True --audioset_pretrain True

missing or corrupt files when training esc-50 model

Upon trying to run the ESC-50 recipe, I come across the following error:

formats: can't open input file `/project/ast/data/ESC-50-master/audio/1-31836-A-4.wav': Input/output error
Epoch: [1][100/134]     Per Sample Total Time 0.07304   Per Sample Data Time 0.00930    Per Sample DNN Time 0.06374     Train Loss 2.7119
Traceback (most recent call last):
  File "../../src/run.py", line 99, in <module>
    train(audio_model, train_loader, val_loader, args)
  File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/traintest.py", line 100, in train
    for i, (audio_input, labels) in enumerate(train_loader):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 28.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/dataloader.py", line 180, in __getitem__
  File "/project/ai-audio-classification-models/06-audio-spectrogram-transformer/src/dataloader.py", line 101, in _wav2fbank
  File "/opt/conda/lib/python3.7/site-packages/torchaudio/backend/sox_io_backend.py", line 153, in load
    filepath, frame_offset, num_frames, normalize, channels_first, format)
RuntimeError: Error loading audio file: failed to open file /project/ast/data/ESC-50-master/audio/1-31836-A-4.wav

It's quite unclear to me how this could happen, maybe the sox command fails and the file is therefor not created?
This happens in almost every fold.

Running on multiple GPUs / Adding a new metric / Using AST as Feature Extractor

Hi, nice work!

Was wondering whether there exists a parameter for specifying the number of GPUs to use for training?

single aduio inference for ast_model

hi, yuan:
I have written a pretty simple script to verify the tags of the single wave, but I got the result it seems not right, could you help to point the mistake?

  import os
  import sys
  import csv
  
  import numpy as np
  import torch
  import torchaudio
  from src.models import ASTModel
  torchaudio.set_audio_backend("soundfile")       # switch backend
  basepath = os.path.dirname(os.path.dirname(sys.path[0]))
  sys.path.append(basepath)
  
  # download pretrained model in this directory
  os.environ['TORCH_HOME'] = '../pretrained_models'
  
  
  def make_features(wav_name, mel_bins, target_length=1024):
      waveform, sr = torchaudio.load(wav_name)
  
      fbank = torchaudio.compliance.kaldi.fbank(
          waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
          window_type='hanning', num_mel_bins=mel_bins, dither=0.0,
          frame_shift=10)
  
      n_frames = fbank.shape[0]
      p = target_length - n_frames
      # cut and pad
      if p > 0:
          m = torch.nn.ZeroPad2d((0, 0, 0, p))
          fbank = m(fbank)
      elif p < 0:
          fbank = fbank[0:target_length, :]
  
      return fbank
  
  
  def load_label(label_csv):
      # Load label
      with open(label_csv, 'r') as f:
          reader = csv.reader(f, delimiter=',')
          lines = list(reader)
  
      labels = []
      ids = []  # Each label has a unique id such as "/m/068hy"
      for i1 in range(1, len(lines)):
          id = lines[i1][1]
          label = lines[i1][2]
          ids.append(id)
          labels.append(label)
      return labels
  
  
  if __name__ == '__main__':
  
      label_csv = './ast/egs/audioset/data/class_labels_indices.csv'
  
      # 1. make feature for predict
      wav_name = './ast/egs/audioset/data/0OxlgIitVig.wav'
      feats = make_features(wav_name, mel_bins=128)           # shape(1024, 128)
  
      # assume each input spectrogram has 100 time frames
      input_tdim = feats.shape[0]
  
      # 2. load the best model and the weights
      checkpoint_path = './ast/pretrained_models/audioset_10_10_0.4593.pth'
      ast_mdl = ASTModel(label_dim=527, input_tdim=input_tdim, imagenet_pretrain=False, audioset_pretrain=False)
      print(f'[*INFO] load checkpoint: {checkpoint_path}')
      checkpoint = torch.load(checkpoint_path, map_location='cuda')
      audio_model = torch.nn.DataParallel(ast_mdl, device_ids=[0])
      audio_model.load_state_dict(checkpoint)
  
      audio_model = audio_model.to(torch.device("cuda:0"))
  
      # 3. feed the data feature to model
      feats_data = feats.expand(1, input_tdim, 128)           # reshape the feature
  
      audio_model.eval()                                      # set the eval model
      with torch.no_grad():
          output = audio_model.forward(feats_data)
          output = torch.sigmoid(output)
      result_output = output.data.cpu().numpy()[0]
  
      # 4. map the post-prob to label
      labels = load_label(label_csv)
  
      sorted_indexes = np.argsort(result_output)[::-1]
  
      # Print audio tagging top probabilities
      for k in range(10):
          print('{}: {:.4f}'.format(np.array(labels)[sorted_indexes[k]],
                                    result_output[sorted_indexes[k]]))
  
      # output should be in shape [10, 527], i.e., 10 samples, each with prediction of 527 classes.
      # print(result_output.shape)

and the output：
Speech: 0.1906
Music: 0.0481
Inside, small room: 0.0245
Musical instrument: 0.0100
Silence: 0.0088
Sound effect: 0.0074
Outside, rural or natural: 0.0064
Animal: 0.0058
Outside, urban or manmade: 0.0045
Inside, large room or hall: 0.0041

Incorrect balance variable

Hi, thanks for this great resource.

I noticed a potential typo in the wrapper script for the audioset pipeline: Lines 26 and 31 may need to be swapped

ast/egs/audioset/run.sh

Lines 24 to 35 in e038086

 if [ $set == balanced ] 

 then 

 bal=none 

 lr=5e-5 

 epoch=25 

 tr_data=/data/sls/scratch/yuangong/aed-pc/src/enhance_label/datafiles_local/balanced_train_data_type1_2_mean.json 

 else 

 bal=bal 

 lr=1e-5 

 epoch=5 

 tr_data=/data/sls/scratch/yuangong/aed-pc/src/enhance_label/datafiles_local/whole_train_data.json 

 fi

load a trained model only for evaluation

I have a case where I want to load a model, that I recently trained with target_length=512 and sample rate 48kHz, using the following code:

  sd = torch.load(model_path, map_location="cuda")
  audio_model = ASTModel(label_dim=84, fstride=10, tstride=10, input_fdim=128, input_tdim=512, imagenet_pretrain=False, audioset_pretrain=False, model_size='base384', verbose=False)
  audio_model = torch.nn.DataParallel(audio_model)
  audio_model.load_state_dict(sd, strict=False)

from load_state_dict I get that all keys matched successfully, but the evaluation fails with a random Mean Average Precision value, which doesn't match the value during training.

Normalizing the train and test data

You have mentioned that if we want to use your pre-trained model, we need to take care of the input normalization. In your code, I observed that you have manually added the mean and std for each of the datasets you used. How are we supposed to calculate the mean and std of our own dataset? Do we calculate it after computing the fbank for each audio signal or is it calculated from raw audio form? It would be great if you could provide some clarity on this

Question regarding fbank for fine tuning

Hi Yuan,

Than you for this great work! I am currently fine tuning the models you produced for a project I am working on and really appreciate the opportunity you created for me. I had a question regarding the spectrograms (or fbanks) produced by the wav2vec function.

Currently, I am trying to prepare a dataset to match the requirements of the model but have stumbled upon something that grabs my attention: You have mentioned in the paper that the model acceps variable inputs. Taking a closer look, I have found that this is due to the added padding below the fbank, his is done to fix the input dimensions into the model. However, when I applied this on my own data, I saw that the padding was of different colors depending on the image when I converted them. Here are two examples:
,
Although I am aware that the values of the solid coloured areas are zeros, I worry that this is indicative of the same colour being attributed to a different value in different spectrograms and how that would imapt the understanding of the model of colour.

My second question is regarding the use of padding specifically. In the ViT paper as well as AST, images are fed through as a colelction of patches for learning. Any patches that are fully blank naturally would not be adding too much information to the model. However, for the patches that have an overlap of fbank and spectrogram, is there no effect on learning there? Also, if a specific category is relatively shorter in length to another, does the model include that audio file length in its representation of that class?

Any insight on the above would be deeply appreciated.
Thanks again

How to change the interpolation method?

Hi Yuan,
In the AST, for the part of the ablation experiment comparing different interpolation methods, one of the items is called "Reinitialize", how is this reflected in the code?
Best Regards.

where is

Process Terminated during Finetuning

I was trying to use the Audioset pretrained model for finetuning on a very small dataset to test it out on. At first the process would simply be killed with "Out of memory" in the log, but when I moved to a larger system, the process ran for longer before returning this error:

Traceback (most recent call last):
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device

Traceback (most recent call last):
  File "../../src/run.py", line 99, in <module>
    train(audio_model, train_loader, val_loader, args)
  File "/home/ubuntu/ast_conv/src/traintest.py", line 220, in train
    torch.save(audio_model.state_dict(), "%s/models/audio_model.%d.pth" % (exp_dir, epoch))
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 380, in save
    return
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 259, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7f20ac5b47a7 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x24e10c0 (0x7f20f14190c0 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x24dc69c (0x7f20f141469c in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0x9a (0x7f20f1419afa in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7f20f1419d83 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x1a5 (0x7f20f141a075 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xa7ffe3 (0x7f2103160fe3 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x4ff188 (0x7f2102be0188 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x50048e (0x7f2102be148e in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: python() [0x5cf938]
frame #10: python() [0x52cae8]
frame #11: python() [0x52cb32]
frame #12: python() [0x52cb32]
<omitting python frames>
frame #17: python() [0x654354]
frame #19: __libc_start_main + 0xe7 (0x7f21079dcbf7 in /lib/x86_64-linux-gnu/libc.so.6)

run.sh: line 46:  1703 Aborted                 (core dumped) CUDA_CACHE_DISABLE=1 python -W ignore ../../src/run.py --model ${model} --dataset ${dataset} --data-train ${tr_data} --data-val ${val_data} --exp-dir $exp_dir --label-csv ./data/class_labels_indices.csv --n_class 3 --lr $lr --n-epochs ${epoch} --batch-size $batch_size --save_model True --freqm $freqm --timem $timem --mixup ${mixup} --bal ${bal} --tstride $tstride --fstride $fstride --imagenet_pretrain $imagenetpretrain --audioset_pretrain $audiosetpretrain > $exp_dir/log.txt

As far as I can tell, that OSError could indicate that the filesize has been exceeded, not just that the total memory is overflowing. I haven't changed traintest.py except for adding an elif condition for the finetuning dataset. Did you run into this error while finetuning, or does it seem like something you understand the cause of?

ImageNet classifier is not terminated in Audioset pretrained models.

Hi Yuan Gong, Thank you for sharing your work. It is clear and easy to run.
I am wondering about the ImageNet Classifier weights, they still exist in AudioSet pretrained models.
do you train them?.
here is the last displayed part of the pretrained "audioset_10_10_0.4593.pth"

module.v.head.weight
torch.Size([1000, 768])
module.v.head.bias
torch.Size([1000])
module.v.head_dist.weight
torch.Size([1000, 768])
module.v.head_dist.bias
torch.Size([1000])
module.mlp_head.0.weight
torch.Size([768])
module.mlp_head.0.bias
torch.Size([768])
module.mlp_head.1.weight
torch.Size([527, 768])
module.mlp_head.1.bias
torch.Size([527])

They can be skipped by
self.v.head = nn.Identity()
self.v.head_dist = nn.Identity()

Now, I want to use the pretrained Audioset model for another task. but worried if I eliminate this part will affect the performance. Although, I think they are not connected to the final Audioset classifier of 527 classes.

Thank you again

demo for testing the single audiofile with the trained model

Hi, Yuan. Is there the code or the demo to test the single audiofile with the trained model ?

Some question about AST

Hello Yuan, I sent an email to your email([email protected]) and hope to discuss with you about AST.

computing the normalization stats

Hi, thank you for your great work!

I have a question regarding the differences of the parameter values('freqm' ). When computing the normalization stats -- mean and std, the parameter values are 24. But during model training, it's 48. Why are their values different in these two processes?

No such file or directory: './data/datafiles/esc_train_data_1.json'

Hi Yuan,
I downloaded the model and try to test it with the esc50 data. I tried to run run_esc.sh, but got error for no such file. I download the master.zip, unzip and put it in ./data/ESC-50-master/. I check the run and prep scripts and havn't found any code that makes the directory or download files for ./data/datafiles/
Is it a file or data I suppose to download or the file is auomatically generated?

Ningkun

Convert mel filterbanks to wav again?

First of all, thanks for this wonderful repo! I am just curious if it is possible to convert the mel input back to wav again? I am trying out a model that will use the same concept as yours as a transformer decoder input but am just not sure if the predicted output (also in mel form) can be converted back to mel. Thank you very much in advance!

training with custom data

Thanks a lot for your amazing work and sharing the code. I had a little question. As I have a video dataset, wanna use it by extracting audios from the videos. Do you recommend any recipe for processing the audio data from the videos? or any raw mp3 would work

Thanks again

Where can I download the imagenet pretrain model ?

Hi, YuanGongND, can you share the imagenet pretrain model url ?

Parameters for tuning

Hello @YuanGongND. I am trying to train AST on a dataset, which is very similar to Speech Commands, but:

max lenght of WAVs is 64000 frames (vs 16000 in SC)
test part contains very noisy samples

Could you advise me, which params should I change?
I have enough resources and I want to increase the accuracy. How can I do this?

Binarizing output for each audio label in AudioSet(527 classes)

Hi Yuan ,

First of all , I would like to say huge thanks for your great work !

It would be great if you can share more details about the output values in the Readme.md.

I run demo.py and I got the linear output values (positive and negative). I would like to know what is the best way to binarize those output values (0: audio label is absent , 1: audio label is present) ?

Anar Sultani

Prediction always wrong using esc50 recipe with 0.95+ accuracy after training

Thank you for this paper, it is very well written and documented. Sorry for the confusing title. I ran the esc50 recipe and it worked as expected. This is the accuracy obtained per forld:

9.50E-01
9.83E-01
9.35E-01
9.70E-01
9.43E-01
9.56E-01

I am trying to use the best model produced to manually classify some audio files (later I want to use the model on my own dataset). This is the code I am running:

torch.cuda.set_device('cuda:0')
device = torch.device("cuda:0")

pretrained_mdl_path ="/home/habashyk/virtualEnvs/ast/egs/esc50/exp/test-esc50-f10-t10-impTrue-aspTrue-b48-lr1e-5/fold2/models/best_optim_state.pth"
sd = torch.load(pretrained_mdl_path, map_location=device)

ast_mdl = ASTModel(label_dim=50,
                   fstride=10,
                   tstride=10,
                   input_fdim=audio_conf_AS["num_mel_bins"],
                   input_tdim=audio_conf_AS["target_length"],
                   imagenet_pretrain=True,
                   model_size='base384')
ast_mdl = torch.nn.DataParallel(ast_mdl)
ast_mdl.load_state_dict(sd, strict=False)
ast_mdl.cuda()
ast_mdl.eval()

Unfortunately, when I use this model, the predictions are never accurate (but have very high probabilities)

Top 3 labels and their associated probabilities for each prediction
THIS IS BATCH 0
Wav 0: Ground truth:  dog
Label:  Cough 	Prob:  0.80712890625
Label:  Female speech, woman speaking 	Prob:  0.74560546875
Label:  Throat clearing 	Prob:  0.7314453125
Wav 1: Ground truth:  chirping_birds
Label:  Child singing 	Prob:  0.78369140625
Label:  Cough 	Prob:  0.73779296875
Label:  Sneeze 	Prob:  0.720703125
Wav 2: Ground truth:  vacuum_cleaner
Label:  Narration, monologue 	Prob:  0.67578125
Label:  Children shouting 	Prob:  0.6611328125
Label:  Baby laughter 	Prob:  0.66015625

I have used the same code with the audioset model and its associated .pth weights and it works fine. Any insigt on this would be greatly appreciated. Please let me know of anything else I can provide.

Also, using audioset_pretrain = True has the same result of high probabilities with incorrect classes.

Thank you!

Is normalization right?

Hell0, I'm new in audio classification, and i want to know is this normalization right?
fbank = (fbank - self.norm_mean) / (self.norm_std * 2)
does it should be :
fbank = (fbank - self.norm_mean) / (self.norm_std ** 2)

Inference time mismatch errors ?

Hello,
I conducted training of base384 sized ast model on my own data set. While training the was no errors but when I tried to do inference and load from checkpoint error arose.

RuntimeError: Error(s) in loading state_dict for DataParallel:
size mismatch for module.v.pos_embed: copying a param with shape torch.Size([1, 602, 768]) from checkpoint, the shape in current model is torch.Size([1, 1214, 768]).

What could be wrong with this error?

data preparation

hi yuan, would it be better to elaborate on how to ensure that flac audios are single-channel?

Use different sample rate

Can I use a different sampling rate like 22 kHz for fine tuning?

Real-time microphone testing

Hi, i've been using your model for classification and audio analysis and it works great.
I have trained my own model and was wondering if there's a way to test it in real-time with microphone rather than audio file, if you could provide a way forward it would be greatt.

PSLA code

Hi,

Can you provide code for the model architecture in Figure 2?

the problems when run ast_models.py

Hello, when I run the ast_models.py, the folowing problem occured.

Random inference result

Hello, Dr. Yuan.

Thank you for your great work and sorry for my very elementary question, I'm very new to audio classification.
My inference script outputs a random result (the output changes at every execution). could you tell me what is wrong?

I checked #19 and added fbank = (fbank + 4.26) / (4.57 * 2) but the result does not change.

This is my Colab page and I added you as an editor (If the runtime timeout, clone and pip install are needed that take about 10min).

source:

############ Load
import librosa.display
import os
import scipy
import numpy as np
import matplotlib.pyplot as plt
import torchaudio
import torch
import IPython.display as ipd

sample_freq = 16000

# Load fragment from 70s to 80s
filename = "/content/zzNdwF40ID8_short.wav"
y, sr = librosa.load(filename, sr=sample_freq, offset=70.0, duration=10.0)


print(f"Input sound shape is {y.shape}, {sr} Hz")
librosa.display.waveplot(y=y, sr=sr)
ipd.Audio(y, rate=sr, autoplay=True)


################# Show
# n_mels is number of Mel bands to generate
n_mels=128
interval = 10e-3 #ms    from https://arxiv.org/pdf/2104.01778.pdf
win_length = 25e-3 #ms  from https://arxiv.org/pdf/2104.01778.pdf but not used


# # hop_length is number of samples between successive frames.
hop_length=int(sample_freq * interval)


### generate fbank https://github.com/YuanGongND/ast/blob/102f0477099f83e04f6f2b30a498464b78bbaf46/src/dataloader.py#L123
waveform = torch.from_numpy( y.reshape(1, -1).astype(np.float32) ).clone().cpu() # to torch
fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
                                          window_type='hanning', num_mel_bins=n_mels, dither=0.0, frame_shift=10)

# normalize with dataset mean and std from https://github.com/YuanGongND/ast#use-pretrained-model-for-downstream-tasks
fbank = (fbank + 4.26) / (4.57 * 2)


# align to target_length
target_length = int((y.size/sr)/interval)
n_frames = fbank.shape[0]
p = target_length - n_frames

# cut and pad
if p > 0:
    m = torch.nn.ZeroPad2d((0, 0, 0, p))
    fbank = m(fbank)
elif p < 0:
    fbank = fbank[0:target_length, :]


plt.figure(figsize=(12, 4))
librosa.display.specshow(data=fbank.transpose(1, 0).to('cpu').detach().numpy().copy(), sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram')
plt.tight_layout()


# reshape
fbank = fbank.reshape(1, -1, 128)

# print
print(f"fbank shape is {fbank.shape}, mean: {fbank.mean()} std:{fbank.std()}")


######################### Infer
import os
import torch
import sys
import csv

sys.path.append(os.path.join('./ast/src'))
import models


# download pretrained model in this directory
os.environ['TORCH_HOME'] = './ast/pretrained_models'

# assume the task has 527 classes
label_dim = 527

# create a input
test_input = fbank
input_tdim = fbank.shape[1]
print(f"Input size: {test_input.shape}\n", test_input.device)

# create an AST model and infer
ast_mdl = models.ASTModel(label_dim=label_dim, input_tdim=input_tdim, imagenet_pretrain=True, audioset_pretrain=True)
ast_mdl.eval()                          
with torch.no_grad():
    test_output = ast_mdl.forward(test_input)
    test_output = torch.sigmoid(test_output)


# output should be in shape [1, 527], i.e., 1 sample, each with prediction of 527 classes.
print(f"\noutput shape is {test_output.shape}, argmax is {test_output.argmax(axis=1)}")

# open labels
if not "labels" in vars(sys.modules[__name__]):
  with open('audioset_label.txt') as f:
      reader = csv.reader(f)
      labels = [row [0]for row in reader]

# argmax
result_output = test_output.data.cpu().numpy()[0]
sorted_indexes = np.argsort(result_output)[::-1]


# Print audio tagging top probabilities
print("\nTop probabilities. Should Music, Sonar\n-------")
for k in range(10):
    print('{}: {:.4f}'.format(np.array(labels)[sorted_indexes[k]],
                              result_output[sorted_indexes[k]]))

How to set the norm_stats for new dataset with pretrained model?

Hi, YuanGong, run.py中的norm_stats对于每个数据集的参数是怎么设置的？依据什么？

How to change the kernel size?

Hello @YuanGongND, I'm sorry to bother you again.

I would like to ask you a question: How to change the kernel size to change the number of patches, I USE ImageNet pretrained model and NOT USE AudioSet pretrained model, but I have this problem.

x1 torch.Size([64, 149, 768])
self.v.pos_embed torch.Size([1, 202, 768])
Traceback (most recent call last):
  File "train.py", line 394, in <module>
    main()
  File "train.py", line 161, in main
    train_loss,train_acc = train(train_loader, model, criterion, optimizer, args.use_cuda, epoch)
  File "train.py", line 296, in train
    output = model(inputs)
  File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/deepAST/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 135, in decorate_autocast
    return func(*args, **kwargs)
  File "/data/source/deepAST_exp/model/ASTConcat.py", line 180, in forward
    x1 = x1 + self.v.pos_embed
RuntimeError: The size of tensor a (149) must match the size of tensor b (202) at non-singleton dimension 1

I only changed the get_shape function, like this

def get_shape(self, fstride, tstride, input_fdim=128, input_tdim=1024, kernel_size=(8,8)):
        test_input = torch.randn(1, 1, input_fdim, input_tdim)
        test_proj = nn.Conv2d(1, self.original_embedding_dim, kernel_size=kernel_size, stride=(fstride, tstride))
        test_out = test_proj(test_input)
        f_dim = test_out.shape[2]
        t_dim = test_out.shape[3]
        return f_dim, t_dim

So, What is the correct way to do this? Looking forward to your answer.

Question about pre-training on a new dataset.

Hi ,
I am trying to use the pre-trained model on my own dataset and in my own pipeline .
As recommended I am using - audioset_pretrain=True and imagenet_pretrain=True .
In the code I noticed that we call the ASTmodel again that results in an infinite loop. (line 129 in ast_models.py).
below is the snipped that I am referring to :
Is this a bug or an oversight on my part . Can you pls take a look ?
I am really looking forward to try AST on my pipeline .

`# now load a model that is pretrained on both ImageNet and AudioSet
        elif audioset_pretrain == True:
            if audioset_pretrain == True and imagenet_pretrain == False:
                raise ValueError('currently model pretrained on only audioset is not supported, please set imagenet_pretrain = True to use audioset pretrained model.')
            if model_size != 'base384':
                raise ValueError('currently only has base384 AudioSet pretrained model.')
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            if os.path.exists('../../pretrained_models/audioset_10_10_0.4593.pth') == False:
                # this model performs 0.4593 mAP on the audioset eval set
                audioset_mdl_url = 'https://www.dropbox.com/s/cv4knew8mvbrnvq/audioset_0.4593.pth?dl=1'
                wget.download(audioset_mdl_url, out='../../pretrained_models/audioset_10_10_0.4593.pth')
            sd = torch.load('../../pretrained_models/audioset_10_10_0.4593.pth', map_location=device)
            **audio_model = ASTModel(label_dim=527, fstride=10, tstride=10, input_fdim=128, input_tdim=1024, imagenet_pretrain=False, audioset_pretrain=False, model_size='base384', verbose=False)`**

Thanks in advance for your time.

Use librosa for inference.py instead of torchaudio

Hi I was going through inference pipeline and i wanted to know if there is a way we can replace Kaldi Fbank implementation to livbrosa library, I am hoping to run it on my jeson device and kaldi uses mkl library which is not suitable for ARM architectures.

I've tried multiple methods but the results are not same as kaldi's fbank implementation. Any help would be appreciated. Thankyou.

@JeffC0628 @YuanGongND

Running AST on a downstream task.

Dear Yuan,

Thank you for creating this SOTA model for audio processing.

I want to run AST on an Audio dataset. I have prepared the data in a similar manner as the data prepared for ESC50 dataset. I wanted to run the model but then I noticed that you took dataset specific mean and std to normalize the dataset. Can you please share the method you used to find these two metrics.

Regards
Saif

在实际的应用中，录的音频会产生很多环境噪音，请问有什么好的办法降噪么？

Wrong .pth name?

Hi, thanks for the awesome contribution!
I have prepared my data using your pipeline. When running the experiments, I get:

ImageNet pretraining: True, AudioSet pretraining: True
Traceback (most recent call last):
  File "../../src/run.py", line 99, in <module>
    audioset_pretrain=args.audioset_pretrain, model_size='base384')
  File "/home/user/PycharmProjects/ast/src/models/ast_models.py", line 143, in __init__
    sd = torch.load('../../pretrained_models/ast_audioset.pth', map_location=device)
  File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 579, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/user/anaconda3/envs/ast/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '../../pretrained_models/ast_audioset.pth'

Looking at the ast_models.py file in line 143, the name of the model is different from the name of the downloaded model:

wget.download(audioset_mdl_url, out='../../pretrained_models/audioset_10_10_0.4593.pth')
sd = torch.load('../../pretrained_models/ast_audioset.pth', map_location=device)

Changing the name of the file from "ast_audioset.pth" to "audioset_10_10_0.4593.pth" just fixed the missing file error.
Posted this in case someone needs it.

Temporal organization of tokens

Hi!

To start with - great work with the model and thanks for sharing!

I already ran it for standard classification cases and it worked as expected. However, now I want to treat the network's outputs as a sequence organized by time dimension. I have a few points / questions related to that:

I noticed in your paper that you tried the 128 x 2 input patches in your ablation studies - do you have the weights saved and would you be willing to share them? Maybe despite worse results they could be useful in my case.
You mentioned that the 128 x 2 trained better on purely on AudioSet. Have you considered also pretraining on ImageNet using those parameters? Was the reason behind not checking this computational complexity or something else (e.g. you believe that it wouldn't train well on ordinary image data)?
Do you see a way to use the majority of the network as it currently is (with 16 x 16 input patches) and adding some layer (e.g. conv1D) at the top to make it combine outputs corresponding to specific time frames? How would you approach this?

Thanks!
Michał

Question about wav2fbin detail

Hi, thank you for sharing the reproducible code.

Let me have questions about the detail for getting fbanks.

According to the paper, Hamming window would be used. But following code uses Hann. Then the Hanning is the one actually used?
https://github.com/YuanGongND/ast/blob/master/src/dataloader.py#L129
All other parameters to get the fbanks are the default, right?

Thanks in advance!

Where reflected the variable input length input in ATSModel?

Hello Yuan,
Thank you for your excellent code, in your paper you mentioned that the AST model can support variable-length inputs, but I noticed that the following parts of the code didn't seem to support variable-length input:

ast/src/models/ast_models.py

Line 188 in 6f4e193

test_input = torch.rand([10, input_tdim, 128])

So how can you solve the above problems?

--obsidian

Wonderful work! questions about feature size

Hi, there:
Thank you for open sourcing this piece of implementation!
It is very inspiring to see timm works in the audio settings.

Q: I tried the pipeline with a smaller feature size e.g. 64x400, and end up with 39x5 patches, and AST would be stuck at 0.01 mAP.
Tried upsampling to your feature size 128x1024, and brought it up to 0.10 mAP. I guess your intuition is to "take advantage of" the 384x384 position (originally 576 n_patches), so 1212 patches would be roughly 2x the 576 patches. Still curious is there a way to do this with a smaller feature dimension.

	if [ $set == balanced ]
	then
	bal=none
	lr=5e-5
	epoch=25
	tr_data=/data/sls/scratch/yuangong/aed-pc/src/enhance_label/datafiles_local/balanced_train_data_type1_2_mean.json
	else
	bal=bal
	lr=1e-5
	epoch=5
	tr_data=/data/sls/scratch/yuangong/aed-pc/src/enhance_label/datafiles_local/whole_train_data.json
	fi

yuangongnd / ast Goto Github PK

ast's People

Contributors

Stargazers

Watchers

Forkers

ast's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs