qiuqiangkong / audioset_tagging_cnn Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Hi bro,
When I use the model for finetune training, the training task is the type of guns. I tried to change the lr and epochs, and the results were bad. Then I use a simple Vgg16 structure, and it can achieve good results. Could you please answer my confusion? Many thanks!
Hi, when I run runme.sh, I got many errors likes this : sh: 1: youtube-dl: not found
So could you tell me another way to download this large DataSet?
Thank you!
def float32_to_int16(x):
assert np.max(np.abs(x)) <= 1.
return (x * 32767.).astype(np.int16)
aud, sr = librosa.core.load(wav_files[0], sr=32000, mono=True)
aud = float32_to_int16(aud)
print (np.max(np.abs(aud)))
>>> 1.0048816
Some of my audio files are out of range. If I comment out the assertion then everything works. Will it be correct to remove the assertion?
The ResNet38
bal set :: 0.52
eval set :: 0.37
CNN10
bal set :: 0.48
eval set :: 0.32
Do you think the above issue has anything to do with it? I mean I prepare the data by commenting out the assertion.
Hi,
While downloading the wavform, I am getting the following errors ,
ERROR: -0BIyqJj9ZU: YouTube said: Invalid parameters.
root : INFO 5 -0CamVQdP_Y start_time: 0.0, end_time: 6.0
Kindly help.
Hi I wanted to ask if you could please provide me with the code for your visualization. I would really like to reproduce your plot with other audios.
In detail: The visualization of sound event detection with the log spectrogram on the top and the class probabilities in the bottom. The image can be found in resources/sed_R9_ZSCveAHg_7s.png
That would be really great!
Thanks in advance.
Lydia
I don't know yet how to rewrite this code (https://github.com/pytorch/ios-demo-app) to realize the recognition of sound events.
May I ask why do transpose(1, 3) before BN? Is it intended to do batch normalization for each frequency bin, what is the advantage for this? Thanks.
x = x.transpose(1, 3)
x = self.bn0(x)
x = x.transpose(1, 3)
Is there an implementation of this anywhere that can be used to ouput embeddings of audio using any of th epretrained models, rather than classifications, so we could use these to train our own classifiers (e.g random forests) using these embeddings? Similar to how you can easily get a 128 embedding using VGGish.
Hi,
SpecAugmentation masks a block of consecutive time steps or mel frequency channels. But why the order is input, BatchNorm, spec_augmenter? Is there any reason for it. Can i adjust the order to input, spec_augmenter, BatchNorm?
Thanks
Hello, I am interested in leveraging the great work you folks have done here. However, the current MIT License appears to just be a copy of the one used for the AngularJS project and thus doesn't reflect that the copyright holders are the authors of the associated paper "Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley". Updating this to would be greatly appreciated!
Really Amazing stuff there
Can you provide other pretrained models too like mobilnets for audio tagging
Thank You
Great work! And appreciate for sharing!
When I run this code according to readme:
python pytorch/inference.py audio_tagging --sample_rate=16000 --window_size=512 --hop_size=160 --mel_bins=64 --fmin=50 --fmax=8000 --model_type="Cnn14_16k" --checkpoint_path="Cnn14_16k_mAP=0.438.pth" --audio_path='resources/R9_ZSCveAHg_7s.mp3'
raise error:
`
Traceback (most recent call last):
File "pytorch/inference.py", line 201, in
audio_tagging(args)
File "pytorch/inference.py", line 42, in audio_tagging
model.load_state_dict(checkpoint['model'])
File "/home/zongbowen/anaconda2/envs/tensorflow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Cnn14_16k:
size mismatch for spectrogram_extractor.stft.conv_real.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for spectrogram_extractor.stft.conv_imag.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for logmel_extractor.melW: copying a param with shape torch.Size([257, 64]) from checkpoint, the shape in current model is torch.Size([129, 64]).
`
thanks for your work. I want to know the sample rate of the CNN14_emb128_mAP=0.412.pth.
Hello,
I try to print the input size of each layer, take the Cnn14 model code for example:
I have three questions:
Looking forward to your reply. Thank you.
It would be awesome to see the flow of the pre-trained model in an Ipython notebook
How to Resolve this issue
Hello,
I thank you for sharing the weights and experiment of your papers, it is a very good work and very helpful.
I am experimenting your Wavegram_Logmel_Cnn14 model on a custom dataset and I have seen some issue when I am using mixed precision in pytorch 1.6 with the layer LogmelFilterBank. In fact, I get sometimes nan values in the forward output of this layer which makes nan value in the loss function later.
I was wondering if you have an idea why ? I do not have this issue when I am not using mixed precision.
during training, you transform the waveform from float32 to int16, and then back to float32. could you tell me why ?
but in pytorch/inference.py , you don't do this. could you tell me why ?
Could you please provide a code for the metrics used in the paper?
Thank you
recently several works of audio classification and recognition tasks are based transformer based model and work good. Have u ever tried transformer
您好,感谢您的出色工作。当我想复您的工程时,发现数据集下载链接http://marsyas.info/downloads/datasets.html已失效。您可以重新发下吗?
Hey, thanks for the great work.
I want to fine-tune your pre-trained models for less classes than 527.
Can you please guide me?
I have run finetune_template.
GPU number: 1 Load pretrained model successfully! Process finished with exit code 0
That's the only output.
Also tried to train from scratch with just 2 classes.
but I got several errors because of indexing.
I just followed runme.sh for training from scratch.
Thx
I am getting the error when I running the keras_main.py.
The error occur in line [for batch_data_dict in train_loader: ].
Do you have any suggestion?
Hi,
I would like to know if the model can be transferred to mobile terminal?
I am reproducing your paper recently.
But after downloading your dataset, I found that the dataset is missing this file.
"unbalanced_train_segments_part38_partial.z01"
It seems that this file has not been uploaded. Could you please upload this missing file?
Hello, Thanks for the awesome repo.
I am new to Audio & SED domain. I have been using your arch for one of the recent Kaggle competition and getting decent result. Therefore, I would like to better understand details of Cnn14_DecisionLevelAtt
I have read the PANNs paper, but it mostly focuses on the CNN feature extractor part. I am interested in understanding why things are done in the way they are for the Cnn14_DecisionLevelAtt
model ( basically everything beside the CNN feature extractor ). Can you point me to some write-ups that explains this ?
Thanks
Hi, I was confused about the procedure of the experiment although I had looked through the README.md. Could you list out the steps of the experiment? Thanks a lot.
Hi,
When I run
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py sound_event_detection --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path="resources/7061-6-0-0.wav" --cuda
I got an error saying
Traceback (most recent call last):
File "pytorch/inference.py", line 202, in <module>
sound_event_detection(args)
File "pytorch/inference.py", line 132, in sound_event_detection
framewise_output = batch_output_dict['framewise_output'].data.cpu().numpy()[0]
Then if I print batch_output_dict I see that the keys are: dict_keys(['clipwise_output', 'embedding'])
. Am I missing something ?
Thanks
`def plot_sound_event_detection_result(framewise_output):
"""Visualization of sound event detection result.
Args:
framewise_output: (time_steps, classes_num)
"""
out_fig_path = 'results/sed_result.png'
os.makedirs(os.path.dirname(out_fig_path), exist_ok=True)
classwise_output = np.max(framewise_output, axis=0) # (classes_num,)
idxes = np.argsort(classwise_output)[::-1]
idxes = idxes[0:5]
ix_to_lb = {i : label for i, label in enumerate(labels)}
lines = []
for idx in idxes:
line, = plt.plot(framewise_output[:, idx], label=ix_to_lb[idx])
lines.append(line)
plt.legend(handles=lines)
plt.xlabel('Frames')
plt.ylabel('Probability')
plt.ylim(0, 1.)
plt.savefig(out_fig_path)
print('Save fig to {}'.format(out_fig_path))
`
convert this into pandas format (timestamp,class_Name) on which particular time which kind of classes are predicting?
After downloading Cnn14_16k_mAP=0.438.pth
and following these instructions:
MODEL_TYPE="Cnn14_16k"
CHECKPOINT_PATH="Cnn14_16k_mAP=0.438.pth" # Trained by a later version of code, achieves higher mAP than the paper.
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py audio_tagging --sample_rate=16000 --window_size=512 --hop_size=160 --mel_bins=64 --fmin=50 --fmax=8000 --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path='resources/R9_ZSCveAHg_7s.wav' --cuda
I get the following error:
Traceback (most recent call last):
File "pytorch/inference.py", line 201, in <module>
audio_tagging(args)
File "pytorch/inference.py", line 42, in audio_tagging
model.load_state_dict(checkpoint['model'])
File "/home/*user*/anaconda3/envs/onseilake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Cnn14_16k:
size mismatch for spectrogram_extractor.stft.conv_real.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for spectrogram_extractor.stft.conv_imag.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for logmel_extractor.melW: copying a param with shape torch.Size([257, 64]) from checkpoint, the shape in current model is torch.Size([129, 64]).
Thank you for open sourcing everything!
Is there a requirement for Input wav's time length? 4s or 2s or any time?
I think this line code "x = torch.cat((x, a1), dim=1)" decide time length should be a certain value,right?
When I run panns-reference with CPU, it shows "ERROR - code is too big".
Is panns-reference only available on GPU? Why does this error occur when using the CPU?
It appears that, in all CNN models, the last dropout, i.e., embedding = F.dropout(x, p=0.5, training=self.training)
, is actually disconnected from the output linear layer, i.e., self.fc_audioset(x)
.
Indeed, the forward
method of these models reads:
x = F.relu_(self.fc1(x))
embedding = F.dropout(x, p=0.5, training=self.training)
clipwise_output = torch.sigmoid(self.fc_audioset(x))
By reading the arXiv paper, it seems that the last dropout should have instead connected the 2048-embedding layer to the 527-output layer. Indeed, the paper reads:
"Dropout [38] is applied after each downsampling operation and fully connected layers to prevent systems from overfitting."
Therefore, I expected to see the following:
x = F.relu_(self.fc1(x))
embedding = F.dropout(x, p=0.5, training=self.training)
clipwise_output = torch.sigmoid(self.fc_audioset(embedding))
Am I missing something?
Thank you,
Alessandro
Is there are DecisionLevelMax type model for MobileNetV2?
In class AttBlock(nn.Module)
the __init__
has
self.bn_att = nn.BatchNorm1d(n_out)
but the forward
doesn't seem to be using it.
Also, temperature
variable does not seem to be used.
Can these be removed without affecting the learning?
Hi, there.
I have a question about the input size of wav files.
So, I'm doing some work on a transfer learning task based on your pretrained model.
In config.py, you set
sample_rate = 32000
clip_samples = sample_rate * 10 # Audio clips are 10-second
I'm wondering : could i change this two number?
If i changed them, does it means I can't use your pretrained model for next steps?
Hey guys!
Thanks for sharing the code, but running the inference, this error pops up:
Traceback (most recent call last):
File "pytorch/inference_template.py", line 26, in
import config
File "/Users/admin/Desktop/audioset_tagging_cnn/pytorch/../utils/config.py", line 8, in
with open('metadata/class_labels_indices.csv', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'metadata/class_labels_indices.csv'
Could you pls. add that file?
Thx a lot,
Max
Hi. I got this error when trying to run pytorch/inference.py
.
I installed the required packages by running pip install -r requirements.txt
Below is the traceback:
Traceback (most recent call last): File "pytorch/inference.py", line 6, in <module> import librosa File "/usr/local/lib/python3.7/dist-packages/librosa/__init__.py", line 12, in <module> from . import core File "/usr/local/lib/python3.7/dist-packages/librosa/core/__init__.py", line 109, in <module> from .time_frequency import * # pylint: disable=wildcard-import File "/usr/local/lib/python3.7/dist-packages/librosa/core/time_frequency.py", line 10, in <module> from ..util.exceptions import ParameterError File "/usr/local/lib/python3.7/dist-packages/librosa/util/__init__.py", line 71, in <module> from . import decorators File "/usr/local/lib/python3.7/dist-packages/librosa/util/decorators.py", line 9, in <module> from numba.decorators import jit as optional_jit ModuleNotFoundError: No module named 'numba.decorators'
When using youtube-dl to download the AudioSet, it return an exception:
OSError: ERROR: Unable to download webpage: HTTP Error 429: Too Many Requests (caused by <HTTPError 429: 'Too Many Requests'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
It seems that youtube has already ban my ip. Have any suggestion?
Are the models trained on balanced trainset? Did you use unbalanced data for training?
Hello, thanks for providing the source code and traning data.
I have download the audioset dataset from Baidu network disk you provided, and train the mobilenetv1 model from scratch following the steps you mentioned in "Train PANNs from scratch". But the problem is, I can not reproducing your training result which you provided.(MobileNetV1_mAP=0.389.pth)
When my training iteration reaches 234000, the LOSS is still 1.1358, and the Validate bal mAP is 0.005 and Validate Test mAP is 0.005. It seems that the two mAP never changed and the model can not convergent.
would you please give me some guidance? Is there any tricks when traning the model?
Looking forward for your reply~ thank you
secret?
Can you show me framewise_output loss?
Hello,
first thank you for the good reference
please check your code "utils/plot_statistics.py "
line 1961, there is text "asdf"
thank you
Even if the we can use the sound_event_detection on the model "Cnn14_DecisionLevelMax_mAP=0.385.pth" with the command :
python pytorch/inference.py sound_event_detection --model_type="Cnn14_DecisionLevelMax" --checkpoint_path="models\Cnn14_DecisionLevelMax_mAP=0.385.pth" --audio_path="examples/R9_ZSCveAHg_7s.wav" --cuda
The models "MobileNetV1_mAP=0.389.pth" and "Wavegram_Logmel_Cnn14_mAP=0.439.pth" does not work with command :
python pytorch/inference.py sound_event_detection --model_type="Wavegram_Logmel_Cnn14" --checkpoint_path="models\Wavegram_Logmel_Cnn14_mAP=0.439.pth" --audio_path="examples/R9_ZSCveAHg_7s.wav" --cuda
Indeed, the 'framewise_output' is not given by the model raising the error :
Traceback (most recent call last): File "pytorch/inference.py", line 202, in <module> sound_event_detection(args) File "pytorch/inference.py", line 132, in sound_event_detection framewise_output = batch_output_dict['framewise_output'].data.cpu().numpy()[0] KeyError: 'framewise_output'
Hi Qiuqiang,
I would like to know what is the best way to binarize the linear predicted probabilities in a way that :
If you have any suggestion for binarization issue , it would be great to know it.
And one more question about clipwise_output , as I understood from the paper linear probability value for each label shows the presence of that audio label in the input audio and probability value doesn't depend on the duration of period of audio label happens. I mean if it happens during the very short duration or long duration. Am I right?
It would be great for me to get your answers for above mentioned questions.
Anar Sultani
Dear authors,
Thanks for the great work!
I would like to ask a question that is there any potential difference between feeding audio data that is typically 20-90 seconds long vs slicing it in chunks or running second-by-second predictions. I fed the CNN14 model with audio data that is typically 20-90 seconds long and after getting linear predicted probabilities I checked feature importance, it was almost near to 0 for all the audio labels.
And after binarizing them with threshold=0.3 it was clear that support was extremely low for 525/527 labels(except Speech & Music)
Now I am thinking that maybe feeding the model with second-by-second audio data may increase the accuracy because with sec-by-sec data each instance has the chance to be monophonic which may lead us to better results.
I would like to know your opinion about the above-mentioned thoughts if possible.
Best Regards
Hi,
How to use your CNN14 network with batches of input audio sequences of variable lengths? Also, is there a recommended length for audio input to the pretrained Cnn14_16k_mAP=0.438.pth?
It seems no 16k mobilenetv2 pretrain model provide. Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.