qiuqiangkong / audioset_tagging_cnn Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 249.0 3.96 MB

License: MIT License

Python 97.99% Shell 2.01%

audioset_tagging_cnn's People

Contributors

Stargazers

Watchers

Forkers

yijiuzai liushenme leetsinghua punchyou redscv entn-at blues-green steelbin nicofarr ekunish wuchaowei2012 wanghelin1997 beachboysqq ml-illustrated anaselmhamdi 553566286 splinter21 wuqiangch jac002020 qiaoyinglin19 lukewys zclccc hvt1609 vanova phoenix9032 jvdahemad guitarmind bz6102365 daywatch balrajashwath jeremmyzong mikful aihill dengbohhxx gopi-durgaprasad shasha-lin lorinsweeney dcastillost tenglang123 zeroo1 kamalsky tontsam nan-wang mo5mami dsilkersahin anashas manojkl changbin-jeon adeyinka-hub wesbz yuangongnd piperod atag34 wuseguang motus flaber123 aprilsin wwymak machengnan ankitshah009 xiongmaoxia andrewyurick vyoz talgold9 jonnor makogarei zaburo-ch xfguo-ucas dung-n-tran liroda nyctalope-de-tarascon wubinbai fytrace vancause faithkaixuan pppku jouvencia richermans jimmy-inl sungbohsun qoboty lvchigo xinhaomei gandolfxu philippschw ummaruje proling1994 seismozhou zanilzanzan ishine hwijune arceushui callzhang chatcharoen world2vec k-bs onfireai jnaranjo-alcazar anitalp jonathan-leroux

audioset_tagging_cnn's Issues

Confusion about Finetune

Hi bro,
When I use the model for finetune training, the training task is the type of guns. I tried to change the lr and epochs, and the results were bad. Then I use a simple Vgg16 structure, and it can achieve good results. Could you please answer my confusion? Many thanks!

How can i download the DataSet ?

Hi, when I run runme.sh, I got many errors likes this : sh: 1: youtube-dl: not found

So could you tell me another way to download this large DataSet?

Thank you!

Assertion error and low MAP on bal/eval set

I am getting the assertion error while running your script to create hdf5 files. It occurs in float32_to_int16() conversion. Here is a simplified version.

def float32_to_int16(x):
    assert np.max(np.abs(x)) <= 1.
    return (x * 32767.).astype(np.int16)

aud, sr = librosa.core.load(wav_files[0], sr=32000, mono=True)
aud = float32_to_int16(aud)

print (np.max(np.abs(aud)))
>>> 1.0048816

Some of my audio files are out of range. If I comment out the assertion then everything works. Will it be correct to remove the assertion?

I am also getting a low MAP scores on balanced set and evaluation set by using your trained models.

The ResNet38 
bal set :: 0.52
eval set :: 0.37

CNN10 
bal set :: 0.48
eval set :: 0.32

Do you think the above issue has anything to do with it? I mean I prepare the data by commenting out the assertion.

ERROR: -0BIyqJj9ZU: YouTube said: Invalid parameters.

Hi,

While downloading the wavform, I am getting the following errors ,

ERROR: -0BIyqJj9ZU: YouTube said: Invalid parameters.
root : INFO 5 -0CamVQdP_Y start_time: 0.0, end_time: 6.0

Kindly help.

code to plot "log spectrogram"+class probabilities

Hi I wanted to ask if you could please provide me with the code for your visualization. I would really like to reproduce your plot with other audios.

In detail: The visualization of sound event detection with the log spectrogram on the top and the class probabilities in the bottom. The image can be found in resources/sed_R9_ZSCveAHg_7s.png

That would be really great!
Thanks in advance.
Lydia

Can you provide DEMO for iOS and Android mobile devices?

I don't know yet how to rewrite this code (https://github.com/pytorch/ios-demo-app) to realize the recognition of sound events.

Why transpose(1, 3) before BatchNorm?

May I ask why do transpose(1, 3) before BN? Is it intended to do batch normalization for each frequency bin, what is the advantage for this? Thanks.

x = x.transpose(1, 3)
x = self.bn0(x)
x = x.transpose(1, 3)

Get embedding not classification

Is there an implementation of this anywhere that can be used to ouput embeddings of audio using any of th epretrained models, rather than classifications, so we could use these to train our own classifiers (e.g random forests) using these embeddings? Similar to how you can easily get a 128 embedding using VGGish.

Is it possible to put the spec_augmenter in front of the nn.BatchNorm2d

Hi,
SpecAugmentation masks a block of consecutive time steps or mel frequency channels. But why the order is input, BatchNorm, spec_augmenter? Is there any reason for it. Can i adjust the order to input, spec_augmenter, BatchNorm?

Thanks

Change License to Reflect Proper Authors

Hello, I am interested in leveraging the great work you folks have done here. However, the current MIT License appears to just be a copy of the one used for the AngularJS project and thus doesn't reflect that the copyright holders are the authors of the associated paper "Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley". Updating this to would be greatly appreciated!

Other Pretrained Models

Really Amazing stuff there
Can you provide other pretrained models too like mobilnets for audio tagging
Thank You

Shape doesn't match when inferencing Cnn14_16k model

Great work! And appreciate for sharing!

When I run this code according to readme:

python pytorch/inference.py audio_tagging --sample_rate=16000 --window_size=512 --hop_size=160 --mel_bins=64 --fmin=50 --fmax=8000 --model_type="Cnn14_16k" --checkpoint_path="Cnn14_16k_mAP=0.438.pth" --audio_path='resources/R9_ZSCveAHg_7s.mp3'

raise error:

Traceback (most recent call last):
File "pytorch/inference.py", line 201, in
audio_tagging(args)
File "pytorch/inference.py", line 42, in audio_tagging
model.load_state_dict(checkpoint['model'])
File "/home/zongbowen/anaconda2/envs/tensorflow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Cnn14_16k:
size mismatch for spectrogram_extractor.stft.conv_real.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for spectrogram_extractor.stft.conv_imag.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for logmel_extractor.melW: copying a param with shape torch.Size([257, 64]) from checkpoint, the shape in current model is torch.Size([129, 64]).

The sample rate of CNN14_emb128_mAP=0.412.pth

thanks for your work. I want to know the sample rate of the CNN14_emb128_mAP=0.412.pth.

What's the input size of CNN

Hello,

I try to print the input size of each layer, take the Cnn14 model code for example:

use function librosa.load to load audio wav. [1, 32000]
spectrogram_extractor: [1, 1, 1001, 513]
logmel_extractor: [1, 1, 1001, 64]

I have three questions:

Different audio has different length, for example, some audio may be [1, 32000], others may be [1, 294198], so they have different size after spectrogram_extractor. Why can you input different size of tensor into CNN? Or have you reshape them into the same size?
How do you input a (1001, 64) size( not the same width and length) into CNN?
I test your model , the accuracy is really high. I try to extract audio features using mfcc, and train the audioset on VGGNet, but the accuracy is about 50%. So how do you improve your model‘s accuracy?

Looking forward to your reply. Thank you.

please create an ipynb file

It would be awesome to see the flow of the pre-trained model in an Ipython notebook

object of type 'NoneType' has no len()

How to Resolve this issue

Is it fully compatible with mixed precision ?

Hello,

I thank you for sharing the weights and experiment of your papers, it is a very good work and very helpful.

I am experimenting your Wavegram_Logmel_Cnn14 model on a custom dataset and I have seen some issue when I am using mixed precision in pytorch 1.6 with the layer LogmelFilterBank. In fact, I get sometimes nan values in the forward output of this layer which makes nan value in the loss function later.
I was wondering if you have an idea why ? I do not have this issue when I am not using mixed precision.

First float32_to_int16, and then int16_to_float32?

during training, you transform the waveform from float32 to int16, and then back to float32. could you tell me why ?

but in pytorch/inference.py , you don't do this. could you tell me why ?

Provide code for metrics calculation mAP and mAUC

Could you please provide a code for the metrics used in the paper?
Thank you

have u ever tried transformer based model

recently several works of audio classification and recognition tasks are based transformer based model and work good. Have u ever tried transformer

The procedure

panns_transfer_to_gtzan数据集链接失效

您好，感谢您的出色工作。当我想复您的工程时，发现数据集下载链接http://marsyas.info/downloads/datasets.html已失效。您可以重新发下吗？

Transfer learning for a few classes

Hey, thanks for the great work.

I want to fine-tune your pre-trained models for less classes than 527.
Can you please guide me?

I have run finetune_template.

GPU number: 1 Load pretrained model successfully! Process finished with exit code 0

That's the only output.

Also tried to train from scratch with just 2 classes.
but I got several errors because of indexing.
I just followed runme.sh for training from scratch.

Thx

IndexError: index 0 is out of bounds for axis 0 with size 0

I am getting the error when I running the keras_main.py.
The error occur in line [for batch_data_dict in train_loader: ].
Do you have any suggestion?

Can the model be used on Android mobile terminal?

Hi,

I would like to know if the model can be transferred to mobile terminal?

Coud not find file "unbalanced_train_segments_part38_partial.z01" in dataset

I am reproducing your paper recently.
But after downloading your dataset, I found that the dataset is missing this file.

"unbalanced_train_segments_part38_partial.z01"

It seems that this file has not been uploaded. Could you please upload this missing file?

Literature pointers for better understanding the `Cnn14_DecisionLevelAtt` model

Hello, Thanks for the awesome repo.

I am new to Audio & SED domain. I have been using your arch for one of the recent Kaggle competition and getting decent result. Therefore, I would like to better understand details of Cnn14_DecisionLevelAtt

I have read the PANNs paper, but it mostly focuses on the CNN feature extractor part. I am interested in understanding why things are done in the way they are for the Cnn14_DecisionLevelAtt model ( basically everything beside the CNN feature extractor ). Can you point me to some write-ups that explains this ?

Thanks

The procedure

Hi, I was confused about the procedure of the experiment although I had looked through the README.md. Could you list out the steps of the experiment? Thanks a lot.

KeyError: 'framewise_output'

Hi,

When I run
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py sound_event_detection --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path="resources/7061-6-0-0.wav" --cuda
I got an error saying

Traceback (most recent call last):
  File "pytorch/inference.py", line 202, in <module>
    sound_event_detection(args)
  File "pytorch/inference.py", line 132, in sound_event_detection
    framewise_output = batch_output_dict['framewise_output'].data.cpu().numpy()[0]

Then if I print batch_output_dict I see that the keys are: dict_keys(['clipwise_output', 'embedding']). Am I missing something ?

Thanks

convert the sound detection event predicting image into csv (Pandas format)

`def plot_sound_event_detection_result(framewise_output):
"""Visualization of sound event detection result.

Args:
  framewise_output: (time_steps, classes_num)
"""
out_fig_path = 'results/sed_result.png'
os.makedirs(os.path.dirname(out_fig_path), exist_ok=True)

classwise_output = np.max(framewise_output, axis=0) # (classes_num,)

idxes = np.argsort(classwise_output)[::-1]
idxes = idxes[0:5]

ix_to_lb = {i : label for i, label in enumerate(labels)}
lines = []
for idx in idxes:
    line, = plt.plot(framewise_output[:, idx], label=ix_to_lb[idx])
    lines.append(line)

plt.legend(handles=lines)
plt.xlabel('Frames')
plt.ylabel('Probability')
plt.ylim(0, 1.)
plt.savefig(out_fig_path)
print('Save fig to {}'.format(out_fig_path))

convert this into pandas format (timestamp,class_Name) on which particular time which kind of classes are predicting?

Pretrained Cnn14 16kHz wrong shape errors

After downloading Cnn14_16k_mAP=0.438.pth and following these instructions:

MODEL_TYPE="Cnn14_16k"
CHECKPOINT_PATH="Cnn14_16k_mAP=0.438.pth"   # Trained by a later version of code, achieves higher mAP than the paper.
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py audio_tagging --sample_rate=16000 --window_size=512 --hop_size=160 --mel_bins=64 --fmin=50 --fmax=8000 --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path='resources/R9_ZSCveAHg_7s.wav' --cuda

I get the following error:

Traceback (most recent call last):
  File "pytorch/inference.py", line 201, in <module>
    audio_tagging(args)
  File "pytorch/inference.py", line 42, in audio_tagging
    model.load_state_dict(checkpoint['model'])
  File "/home/*user*/anaconda3/envs/onseilake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Cnn14_16k:
	size mismatch for spectrogram_extractor.stft.conv_real.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
	size mismatch for spectrogram_extractor.stft.conv_imag.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
	size mismatch for logmel_extractor.melW: copying a param with shape torch.Size([257, 64]) from checkpoint, the shape in current model is torch.Size([129, 64]).

Thank you for open sourcing everything!

Input wav's time length for model "Wavegram_Logmel_Cnn14"?

Is there a requirement for Input wav's time length? 4s or 2s or any time?

I think this line code "x = torch.cat((x, a1), dim=1)" decide time length should be a certain value，right?

ERROR - code is too big

When I run panns-reference with CPU, it shows "ERROR - code is too big".

Is panns-reference only available on GPU? Why does this error occur when using the CPU?

Last dropout is disconnected from fc_audioset layer

It appears that, in all CNN models, the last dropout, i.e., embedding = F.dropout(x, p=0.5, training=self.training), is actually disconnected from the output linear layer, i.e., self.fc_audioset(x).
Indeed, the forward method of these models reads:

x = F.relu_(self.fc1(x))
embedding = F.dropout(x, p=0.5, training=self.training)
clipwise_output = torch.sigmoid(self.fc_audioset(x))

By reading the arXiv paper, it seems that the last dropout should have instead connected the 2048-embedding layer to the 527-output layer. Indeed, the paper reads:

"Dropout [38] is applied after each downsampling operation and fully connected layers to prevent systems from overfitting."

Therefore, I expected to see the following:

x = F.relu_(self.fc1(x))
embedding = F.dropout(x, p=0.5, training=self.training)
clipwise_output = torch.sigmoid(self.fc_audioset(embedding))

Am I missing something?

Thank you,
Alessandro

DecisionLevelMax with mobilenet2

Is there are DecisionLevelMax type model for MobileNetV2?

batchnorm1d doesn't seem to be used in attention block

In class AttBlock(nn.Module) the __init__ has

self.bn_att = nn.BatchNorm1d(n_out)

but the forward doesn't seem to be using it.

Also, temperature variable does not seem to be used.

Can these be removed without affecting the learning?

Could i change the input size of wav files?

Hi, there.
I have a question about the input size of wav files.
So, I'm doing some work on a transfer learning task based on your pretrained model.
In config.py, you set

sample_rate = 32000
clip_samples = sample_rate * 10     # Audio clips are 10-second

I'm wondering : could i change this two number?
If i changed them, does it means I can't use your pretrained model for next steps?

class_labels_indices.csv is missing

Hey guys!

Thanks for sharing the code, but running the inference, this error pops up:

Traceback (most recent call last):
File "pytorch/inference_template.py", line 26, in
import config
File "/Users/admin/Desktop/audioset_tagging_cnn/pytorch/../utils/config.py", line 8, in
with open('metadata/class_labels_indices.csv', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'metadata/class_labels_indices.csv'

Could you pls. add that file?

Thx a lot,
Max

numba.decorators ModuleNotFoundError

Hi. I got this error when trying to run pytorch/inference.py .
I installed the required packages by running pip install -r requirements.txt

Below is the traceback:

Traceback (most recent call last): File "pytorch/inference.py", line 6, in <module> import librosa File "/usr/local/lib/python3.7/dist-packages/librosa/__init__.py", line 12, in <module> from . import core File "/usr/local/lib/python3.7/dist-packages/librosa/core/__init__.py", line 109, in <module> from .time_frequency import * # pylint: disable=wildcard-import File "/usr/local/lib/python3.7/dist-packages/librosa/core/time_frequency.py", line 10, in <module> from ..util.exceptions import ParameterError File "/usr/local/lib/python3.7/dist-packages/librosa/util/__init__.py", line 71, in <module> from . import decorators File "/usr/local/lib/python3.7/dist-packages/librosa/util/decorators.py", line 9, in <module> from numba.decorators import jit as optional_jit ModuleNotFoundError: No module named 'numba.decorators'

Can't download the AudioSet

When using youtube-dl to download the AudioSet, it return an exception:
OSError: ERROR: Unable to download webpage: HTTP Error 429: Too Many Requests (caused by <HTTPError 429: 'Too Many Requests'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

It seems that youtube has already ban my ip. Have any suggestion?

Balanced/unbalanced training

Are the models trained on balanced trainset? Did you use unbalanced data for training?

Can not reproducing the result of audio_tagging result of mobilenetv1 in the PANNs paper, is there any tricks when training?

Hello, thanks for providing the source code and traning data.
I have download the audioset dataset from Baidu network disk you provided, and train the mobilenetv1 model from scratch following the steps you mentioned in "Train PANNs from scratch". But the problem is, I can not reproducing your training result which you provided.(MobileNetV1_mAP=0.389.pth)
When my training iteration reaches 234000, the LOSS is still 1.1358, and the Validate bal mAP is 0.005 and Validate Test mAP is 0.005. It seems that the two mAP never changed and the model can not convergent.
would you please give me some guidance? Is there any tricks when traning the model?

Looking forward for your reply~ thank you

Author‘s age

secret？

framewise output loss

Can you show me framewise_output loss?

[asdf problem] Hello, is this typo in "utils/plot_statistics.py " ?

Hello,

first thank you for the good reference

please check your code "utils/plot_statistics.py "

line 1961, there is text "asdf"

thank you

SED in unavailable for some models

Even if the we can use the sound_event_detection on the model "Cnn14_DecisionLevelMax_mAP=0.385.pth" with the command :
python pytorch/inference.py sound_event_detection --model_type="Cnn14_DecisionLevelMax" --checkpoint_path="models\Cnn14_DecisionLevelMax_mAP=0.385.pth" --audio_path="examples/R9_ZSCveAHg_7s.wav" --cuda

The models "MobileNetV1_mAP=0.389.pth" and "Wavegram_Logmel_Cnn14_mAP=0.439.pth" does not work with command :
python pytorch/inference.py sound_event_detection --model_type="Wavegram_Logmel_Cnn14" --checkpoint_path="models\Wavegram_Logmel_Cnn14_mAP=0.439.pth" --audio_path="examples/R9_ZSCveAHg_7s.wav" --cuda

Indeed, the 'framewise_output' is not given by the model raising the error :
Traceback (most recent call last): File "pytorch/inference.py", line 202, in <module> sound_event_detection(args) File "pytorch/inference.py", line 132, in sound_event_detection framewise_output = batch_output_dict['framewise_output'].data.cpu().numpy()[0] KeyError: 'framewise_output'

Binarizing output values

Hi Qiuqiang,

I would like to know what is the best way to binarize the linear predicted probabilities in a way that :

0 : audio label is absent
1: audio label is present

If you have any suggestion for binarization issue , it would be great to know it.

And one more question about clipwise_output , as I understood from the paper linear probability value for each label shows the presence of that audio label in the input audio and probability value doesn't depend on the duration of period of audio label happens. I mean if it happens during the very short duration or long duration. Am I right?

It would be great for me to get your answers for above mentioned questions.

Anar Sultani

Feeding long audio data vs second-by-second or smaller chunks

Dear authors,

Thanks for the great work!

I would like to ask a question that is there any potential difference between feeding audio data that is typically 20-90 seconds long vs slicing it in chunks or running second-by-second predictions. I fed the CNN14 model with audio data that is typically 20-90 seconds long and after getting linear predicted probabilities I checked feature importance, it was almost near to 0 for all the audio labels.
And after binarizing them with threshold=0.3 it was clear that support was extremely low for 525/527 labels(except Speech & Music)

Now I am thinking that maybe feeding the model with second-by-second audio data may increase the accuracy because with sec-by-sec data each instance has the chance to be monophonic which may lead us to better results.

I would like to know your opinion about the above-mentioned thoughts if possible.

Best Regards

Variable Length Sequences

Hi,
How to use your CNN14 network with batches of input audio sequences of variable lengths? Also, is there a recommended length for audio input to the pretrained Cnn14_16k_mAP=0.438.pth?

Can use 32k mobilenetv2 model to fineturn 16k mobilenetv2 model?

It seems no 16k mobilenetv2 pretrain model provide. Thank you.

qiuqiangkong / audioset_tagging_cnn Goto Github PK

audioset_tagging_cnn's People

Contributors

Stargazers

Watchers

Forkers

audioset_tagging_cnn's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs