drscotthawley / panotti Goto Github PK

View Code? Open in Web Editor NEW

269.0 15.0 72.0 1.42 MB

A multi-channel neural network audio classifier using Keras

License: MIT License

Python 86.70% Shell 0.32% HTML 11.49% CSS 1.48%

keras audio-classification neural-network convolutional-neural-networks tensorflow music-tagging

panotti's Introduction

Panotti: A Convolutional Neural Network classifier for multichannel audio waveforms

(Image of large-eared Panotti people, Wikipedia)

This is a version of the audio-classifier-keras-cnn repo (which is a hack of @keunwoochoi's compact_cnn code). Difference with Panotti is, it has been generalized beyond mono audio, to include stereo or even more "channels." And it's undergone many refinements.

NOTE: The majority of issues people seem to have in using this utility, stem from inconsistencies in their audio datasets. This is to the point where I hesitate to delve into such reports. I suggest trying the binaural audio example and see if your same problems arise. -SH

Installation

UPDATE June 9, 2020: There is an updated version of Panotti that works with TensorFlow 2, currently in the panotti branch called 'tf2'. I'm not ready to merge that branch with master until Vibrary is also updated for TF2.

Preface: Requirements

Probably Mac OS X or Linux. (Windows users: I have no experience to offer you.) Not everything is required, here's a overview:

Required:
- Python 3.5
- numpy
- keras
- tensorflow
- librosa
- matplotlib
- h5py
Optional:
- sox ("Sound eXchange": command-line utility for examples/binaural. Install via "apt-get install sox")
- pygame (for exampes/headgames.py)
- For sorting-hat: flask, kivy kivy-garden

...the requirements.txt file method is going to try to install both required and optional packages.

Installation:

git clone https://github.com/drscotthawley/panotti.git

cd panotti

pip install -r requirements.txt

Demo

I'm not shipping this with any audio but you can generate some for the 'fake binaural' example (requires sox):

cd examples
./binaural_setup.sh
cd binaural
../../preprocess_data.py --dur=2 --clean
../../train_network.py

Quick Start

Make a folder called Samples/ and inside it create sub-folders with the names of each category you want to train on. Place your audio files in these sub-folders accordingly.
run python preprocess_data.py
run python train_network.py
run python eval_network.py - This applies the trained network to the testing dataset and gives you accuracy reports.

Data Preparation

Data organization:

Sound files should go into a directory called Samples/ that is local off wherever the scripts are being run. Within Samples, you should have subdirectories which divide up the various classes.

Example: for the IDMT-SMT-Audio-Effects database, using their monophonic guitar audio clips...

$ ls -F Samples/
Chorus/  Distortion/  EQ/  FeedbackDelay/  Flanger/   NoFX/  Overdrive/  Phaser/  Reverb/  SlapbackDelay/
Tremolo/  Vibrato/
$

(Within each subdirectory of Samples, there are loads of .wav or .mp3 files that correspond to each of those classes.)

"Is there any sample data that comes with this repo?" Not the data itself, but check out the examples/ directory. ;-)

Data augmentation & preprocessing:

(Optional) Augmentation:

The "augmentation" will vary the speed, pitch, dynamics, etc. of the sound files ("data") to try to "bootstrap" some extra data with which to train. If you want to augment, then you'll run it as

$ python augment_data.py <N> Samples/*/*

where N is how many augmented copies of each file you want it to create. It will place all of these in the Samples/ directory with some kind of "_augX" appended to the filename (where X just counts the number of the augmented data files). For augmentation it's assumed that all data files have the same length & sample rate.

(Required) Preprocessing:

When you preprocess, the data-loading will go much faster (e.g., 100 times faster) the next time you try to train the network. So, preprocess.

Preprocessing will pad the files with silence to fit the length to the length of the longest file and the number of channels to the file with the most channels. It will then generate mel-spectrograms of all data files, and create a "new version" of Samples/ called Preproc/.

It will do an 80-20 split of the dataset, so within Preproc/ will be the subdirectories Train/ and Test/. These will have the same subdirectory names as Samples/, but all the .wav and .mp3 files will have ".npy" on the end now. Datafiles will be randomly assigned to Train/ or Test/, and there they shall remain.

To do the preprocessing you just run

$ python preprocess_data.py

Training & Evaluating the Network

$ python train_network.py That's all you need. (I should add command-line arguments to adjust the layer size and number of layers...later.)

It will perform an 80-20 split of training vs. testing data, and give you some validation scores along the way.

It's set to run for 2000 epochs, feel free to shorten that or just ^C out at some point. It automatically does checkpointing by saving(/loading) the network weights via a new file weights.hdf5, so you can interrupt & resume the training if you need to.

After training, more diagnostics -- ROC curves, AUC -- can be obtained by running

$ python eval_network.py

(Changing the batch_size variable between training and evaluation may not be a good idea. It will probably screw up the Batch Normalization...but maybe you'll get luck.)

Results

On the IDMT Audio Effects Database using the 20,000 monophonic guitar samples across 12 effects classes, this code achieved 99.7% accuracy and an AUC of 0.9999. Specifically, 11 mistakes were made out of about 4000 testing examples; 6 of those were for the 'Phaser' effect, 3 were for EQ, a couple elsewhere, and most of the classes had zero mistakes. (No augmentation was used.)

This accuracy is comparable to the original 2010 study by Stein et al., who used a Support Vector Machine.

This was achieved by running for 10 hours on our workstation with an NVIDIA GTX1080 GPU.

Extra Tricks

We have multi-GPU training. The saving & loading means we get warning messages from Keras. Ignore those. It's because if we compile both the parallel model and its serial counterpart, it breaks things. So we leave the serial one uncompiled and that's the one we have to save. I regard this problem as a 'bug' in the Keras multi-gpu protocols.
Speaking of saving & loading, we encode the names of the output classes in the weights.hdf5 file using a HDF5 attribute 'class_names'.

-- [@drscotthawley](https://drscotthawley.github.io)

panotti's People

Contributors

Stargazers

Watchers

panotti's Issues

Some doubts about function build_dataset

in file panotti/datautils.py, the func build_dataset:

X[load_count,:,0:use_len] = melgram[:,:,0:use_len]

Maybe you want:

X[load_count,:,:,0:use_len] = melgram[:,:,0:use_len]

[New Feature Request] tracking where in an audio a given sound exists

Dear Professor,
Thanks for making this amazing project available!.

I was wondering if/how this project may be extended beyond its current offering to "track a particular class of sound in a test audio".

Say we have 5 folders with 5 types of sounds and the model is trained on them.
then when a new audio(say of duration 3 minutes) is given as the test input.
_I would like to track where in that audio file a particular class of sound occurs._
Example :
sounds are : rings , clicks , drums, guitar and whistling.
Test audio is a 3 minute long .mp3 made of non overlapping ^ sounds.

If I'd like to track the class whistling , the model should tell me where in the test audio is the whistling present, if at all present.

Thank you.

Modifying original data in augment_audio.py

You might want to change line 34 from y_mod = y to y_mod = y.copy() or something similar, otherwise some of the operations end up modifying the original data.

Very useful script by the way, thank you!

Training Shape Error: Shape must be rank 1 but is rank 4 for 'batch_normalization

latest keras 2.2.0 with tensorflow 1.8.0 backend:
After successfull pre-processing results in Shape error.

h 10830 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:01:00.0, compute capability: 7.0)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1567, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 1 but is rank 4 for 'batch_normalization_1/cond/FusedBatchNorm' (op: 'FusedBatchNorm') with input shapes: [?,94,21009,32], [1,94,1,1], [1,94,1,1], [1,94,1,1], [1,94,1,1].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_network.py", line 63, in <module>
    val_split=args.val, tile=args.tile)
  File "train_network.py", line 32, in train_network
    model, serial_model = setup_model(X_train, class_names, weights_file=weights_file)
  File "/data/code/panotti/panotti/models.py", line 210, in setup_model
    serial_model = MyCNN_Keras2(X.shape, nb_classes=len(class_names), nb_layers=nb_layers)
  File "/data/code/panotti/panotti/models.py", line 47, in MyCNN_Keras2
    model.add(BatchNormalization(axis=1))
  File "/src/keras/engine/sequential.py", line 187, in add
    output_tensor = layer(self.outputs[0])
  File "/src/keras/engine/base_layer.py", line 460, in __call__
    output = self.call(inputs, **kwargs)
  File "/src/keras/layers/normalization.py", line 204, in call
    training=training)
  File "/src/keras/backend/tensorflow_backend.py", line 3069, in in_train_phase
    x = switch(training, x, alt)
  File "/src/keras/backend/tensorflow_backend.py", line 3004, in switch
    else_expression_fn)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2072, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1913, in BuildCondBranch
    original_result = fn()
  File "/src/keras/layers/normalization.py", line 165, in normalize_inference
    epsilon=self.epsilon)
  File "/src/keras/backend/tensorflow_backend.py", line 1894, in batch_normalization
    is_training=False
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 904, in fused_batch_norm
    name=name)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 3429, in _fused_batch_norm
    is_training=is_training, name=name)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1734, in __init__
    control_input_ops)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1570, in _create_c_op
    raise ValueError(str(e))
ValueError: Shape must be rank 1 but is rank 4 for 'batch_normalization_1/cond/FusedBatchNorm' (op: 'FusedBatchNorm') with input shapes: [?,94,21009,32], [1,94,1,1], [1,94,1,1], [1,94,1,1], [1,94,1,1].

Seems reshape step missing or did I make a mistake of adding --phase during pre-processing?
P.S. Impressive library, please keep going.

pooling

I want to use gated pooling instead of max pooling, but i do not know how to tune the pooling gate function with the code. any help, please

`def gated_pooling(inputs, filter, size=2, learn_option='l/c'):
"""Gated pooling operation, responsive
Combine max pooling and average pooling in a mixing proportion,
which is obtained from the inner product between the gating mask and the region being
pooled and then fed through a sigmoid:
fgate(x) = sigmoid(wx) fmax(x) + (1-sigmoid(wx)) favg(x)

   arguments:
     inputs: input of shape [batch size, height, width, channels]
     filter: filter size of the input layer, used to initialize gating mask
     size: an integer, width and height of the pooling filter
     learn_option: learning options of gated pooling, include:
                    'l/c': learn a mask per layer/channel
                    'l/r/c': learn a mask per layer/pooling region/channel combined
   return:
     outputs: tensor with the shape of [batch_size, height//size, width//size, channels]

"""
if learn_option == 'l':
    gating_mask = all_channel_connected2d(inputs)
if learn_option == 'l/c':
    w_gated = tf.Variable(tf.truncated_normal([size,size,filter,filter], stddev=2/(size*size*filter*2)**0.5))
    gating_mask = tf.nn.conv2d(inputs, w_gated, strides=[1,size,size,1], padding='VALID')
if learn_option == 'l/r/c':
    gating_mask = locally_connected2d(inputs)

alpha = tf.sigmoid(gating_mask)

x1 = tf.contrib.layers.max_pool2d(inputs=inputs, kernel_size=[size, size], stride=2, padding='VALID')
x2 = tf.contrib.layers.avg_pool2d(inputs=inputs, kernel_size=[size, size],stride=2, padding='VALID')
outputs = tf.add(tf.multiply(x1, alpha), tf.multiply(x2, (1-alpha)))
return outputs`

Please give me advice how get better accuracy

Hi i using your CNN and work for me everything very good. I using your CNN for sound recognizing from dataset ESC 50 https://github.com/karoldvl/ESC-50. I get choice 10 class from this dataset it is 400 sound file (40 sound on class). I using for training 80% data (320 sound file) and for testing 20% data (80 sound file). My accuracyy is approximately 57%. How i can get better accuracy?

I run training with these setting: python train_network.py -- batch_size 30 --epochs 300

My accuracy after this training is 57% but i need more ideal is ≈75%

Please help thanks for reply.

Error after run preprocess_data.py

Hi I try run this neural network but after the first step i run script preprocess_data.py i get error: ValueError: not enough values to unpack (expected 13, got 0)

I tried too previous version: https://github.com/drscotthawley/audio-classifier-keras-cnn and this code work correctly

I use this previous version it on OS: WIndows 10 and trained neural network on my graphics card GTX950m

I can use previously version but don't know how i can change number of layers and other parameters this neural network so that work it.

Here is image my error: https://imgur.com/a/NdjFV

Sorry for my bad english and thanks for reply.

Error: "Only one class present in y_true. ROC AUC score is not defined in that case"

Running the eval_network script I got the following error that arises after the predict phase. What could possibly cause this error? Thanks in advance!

Running predict...
Counting mistakes
Found 1846 total mistakes out of 16960 attempts
Mistakes by class:
...
/* details of the mistakes follow */
...
Measuring ROC...

/apps/K80/PYTHON/3.6.3_ML/lib/python3.6/site-packages/sklearn/metrics/ranking.py:571: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
UndefinedMetricWarning)
Traceback (most recent call last):
File "./eval_network.py", line 135, in
eval_network(weights_file=args.weights, classpath=args.classpath, batch_size=args.batch_size)
File "./eval_network.py", line 88, in eval_network
auc_score = roc_auc_score(Y_test, y_scores)
File "/apps/K80/PYTHON/3.6.3_ML/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 277, in roc_auc_score
sample_weight=sample_weight)
File "/apps/K80/PYTHON/3.6.3_ML/lib/python3.6/site-packages/sklearn/metrics/base.py", line 118, in _average_binary_score
sample_weight=score_weight)
File "/apps/K80/PYTHON/3.6.3_ML/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 268, in _binary_roc_auc_score
raise ValueError("Only one class present in y_true. ROC AUC score "
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Output

Good night.

The results of the test set present the accuracy of the different wav files?
I want to create a model that classifies laughing, crying and other emotions.
The results appear on the command line?

Thanks in advance.

Problem loading class data in train_network

I'm trying to run a test with ~500 classes and ~80K files, and right at the end of the first step of loading the classes input data, I'm getting the following error:

File "panotti/panotti/datautils.py", line 235, in build_dataset
X[load_count,:,:] = melgram
IndexError: index 64837 is out of bounds for axis 0 with size 64837

At the beginning, the output shows that this is the total number of files that are going to be loaded:

going to load total_load = 64837

I was able to reproduce the problem with a smaller input (~700 files). This is the full output:

Using TensorFlow backend.
class_names = ['1', '2', '3', '4', '5']
total files = 780 , going to load total_load = 760
total files = 780 , going to load total_load = 760
get_sample_dimensions: 1-140000-145000-44100-2-348093.wav.npz: melgram.shape = (1, 96, 431, 2)
melgram dimensions: (1, 96, 431, 2)

Loading class 1/5: '1', File 275/275: Preproc/Train/1/1-55500-60500-44100-2-348093.wav.npz
Loading class 2/5: '2', File 216/216: Preproc/Train/2/2-105000-110000-44100-2-275506.wav.npz
Loading class 3/5: '3', File 108/108: Preproc/Train/3/3-54000-59000-44100-2-139023.wav.npz
Loading class 4/5: '4', File 101/162: Preproc/Train/4/4-49500-54500-44100-2-166661.wav.npz
Loading class 5/5: '5', File 1/19: Preproc/Train/5/5-10000-15000-44100-2-28040.wav.npz Traceback (most recent call last):
File "train_network.py", line 63, in
val_split=args.val, tile=args.tile)
File "train_network.py", line 29, in train_network
X_train, Y_train, paths_train, class_names = build_dataset(path=classpath, batch_size=batch_size, tile=tile)
File "panotti/panotti/datautils.py", line 235, in build_dataset
X[load_count,:,:] = melgram
IndexError: index 760 is out of bounds for axis 0 with size 760

Looking into panotti/datautils.py line 235, it seems that the problem is that the load_count counter reaches the value of total_load before processing the last class, so the break statement in line 240 stops the inner loop (that iterates files in a class), but not the outer loop (that iterates through classes), so it still tries to process the next class' files, and it crashes when indexing the array X out of scope because load_count is already over the limit.

Confusing about measuring the accuracy of the model using cross validation.

How to check the accuracy of each fold?
Is the accuracy is the mean of the accuracy of each fold??

error in train_network.py

C:\Users\HHH\Desktop\audio\panotti-master\venv\Scripts\python.exe C:/Users/HHH/Desktop/audio/panotti-master/train_network.py
['C:\Users\HHH\Desktop\audio\panotti-master', 'C:\Users\HHH\Desktop\audio\panotti-master', 'C:\Users\HHH\AppData\Local\Programs\Python\Python38\python38.zip', 'C:\Users\HHH\AppData\Local\Programs\Python\Python38\DLLs', 'C:\Users\HHH\AppData\Local\Programs\Python\Python38\lib', 'C:\Users\HHH\AppData\Local\Programs\Python\Python38', 'C:\Users\HHH\Desktop\audio\panotti-master\venv', 'C:\Users\HHH\Desktop\audio\panotti-master\venv\lib\site-packages']
3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)]
2021-08-13 22:19:49.157866: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-08-13 22:19:49.158224: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "C:/Users/HHH/Desktop/audio/panotti-master/train_network.py", line 17, in
from panotti.models import *
File "C:\Users\HHH\Desktop\audio\panotti-master\panotti\models.py", line 21, in
from keras.optimizers import SGD, Adam
ImportError: cannot import name 'SGD' from 'keras.optimizers' (C:\Users\HHH\Desktop\audio\panotti-master\venv\lib\site-packages\keras\optimizers.py)

Process finished with exit code 1

eval_network.py error

Using TensorFlow backend.
class_names = ['music_wav', 'speech_wav']
total files = 26 , going to load total_load = 0
total files = 26 , going to load total_load = 0
get_sample_dimensions: bagpipe.wav.npz: melgram.shape = (1, 96, 2584, 1)
melgram dimensions: (1, 96, 2584, 1)

Loading class 1/2: 'music_wav', File 1/13: Preproc/Test/music_wav/bagpipe.wav.npz Traceback (most recent call last):
File "eval_network.py", line 135, in
eval_network(weights_file=args.weights, classpath=args.classpath, batch_size=args.batch_size)
File "eval_network.py", line 63, in eval_network
X_test, Y_test, paths_test, class_names = build_dataset(path=classpath, batch_size=batch_size)
File "panotti\datautils.py", line 235, in build_dataset
X[load_count,:,:] = melgram
IndexError: index 0 is out of bounds for axis 0 with size 0

How do I iterate over an audio file to not miss a part that fits a class?

I have an audio file that contains a part that matches a class I trained, for instance the letter R in a speech.

I would set an arbitrary length, like 20ms. Then I would split the audio file in 20ms intervals, send each to predict_class and take the part where the probability for my class is the highest. Yet with this method I could be exactly at the corner of the wanted area, it could be stretched(longer than the original file) etc..

How do I iterate over the audio file to not miss it?

Accuracy too good to be true?

Hi.

I've been recording two different R/C boats while driving them around. So thats two classes there and one more that is just silence. I've removed parts of the original recordings of the boats that were basically silent and chopped them up into 1-second segments and made 4 augmentations of each file.

So each class consists of about 3000 1s clip (that includes the augmentations) and when training with batch size of 20 and 2000 epochs, I get 99.3% accuracy with 14 mistakes out of 2160 attempts.

That doesn't seem right to me and would love to get feedback from anyone with some experience since I'm totally new to this.

Thank you, Scott, for creating panotti.

eval_network error

File "D:/Anaconda3/Lib/site-packages/panotti-master/eval_network.py", line 141, in
eval_network(weights_file=args.weights, classpath=args.classpath, batch_size=args.batch_size)

File "D:/Anaconda3/Lib/site-packages/panotti-master/eval_network.py", line 91, in eval_network
fpr[i], tpr[i], _ = roc_curve(Y_test[:, i], y_scores[:, i])

File "D:\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py", line 622, in roc_curve
y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)

File "D:\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py", line 396, in _binary_clf_curve
raise ValueError("{0} format is not supported".format(y_type))

ValueError: continuous format is not supported

AttributeError: 'module' object has no attribute 'cpu_count' when running preprocess_data.py

Thank you for the nice project Professor.
I am getting the following error when running the preprocess_data.py
Traceback (most recent call last):
File "preprocess_data.py", line 152, in
preprocess_dataset(resample=44100, already_split=args.already, sequential=args.sequential, mono=args.mono)
File "preprocess_data.py", line 109, in preprocess_dataset
pool = Pool(os.cpu_count())
AttributeError: 'module' object has no attribute 'cpu_count'

I google and found : tensorflow/tensorflow#6513
but no change even after trying :

pip install --user --upgrade psutil
Collecting psutil
Installing collected packages: psutil
Successfully installed psutil-5.4.3

different training accuracy every time i run the code ... how to decide the best accuracy? how many times running the code?

eval_network.py gets classes muddled up

Hello again! I have progressed a bit in making a snare and kick discerning network. It seems to be doing pretty well after being trained for about 300 epochs.

I am a bit confused by this output of eval_network.py though as there are clearly some snares in the 13 kick class "mistakes" that, to me, look like they have been correctly identified as snares. Any idea what's going on there? I have zipped up the weights and the Preproc/Test directory if you want to have a go yourself.

Counting mistakes 
    Found 24 total mistakes out of 1301 attempts
      Mistakes by class: 
          class 'kick': 13
                   Preproc/Test/kick/Elektron__Elektron_MachineDrum_SPS1_MKII__Kicks_0024_padded.wav.npy: should be kick but came out as snare
                   Preproc/Test/kick/Alesis__Alesis_Performance-Pad__Kicks__Kick_29_padded.wav.npy      : should be kick but came out as snare
                   Preproc/Test/kick/Roland__Roland_MC-505__BassDrum__BassDrum_026_padded.wav.npy       : should be kick but came out as snare
                   Preproc/Test/snare/EKO__EKO_Ritmo-12__SnareDrum1_padded.wav.npy                      : should be kick but came out as snare
                   Preproc/Test/snare/Alesis__Alesis_Performance-Pad__Snares__Snare_11_padded.wav.npy   : should be kick but came out as snare
                   Preproc/Test/snare/Yamaha__Yamaha_RY-10__SnareDrum_40_padded.wav.npy                 : should be kick but came out as snare
                   Preproc/Test/kick/Kawai__Kawai_R-50e__BassDrum1_Elec_padded.wav.npy                  : should be kick but came out as snare
                   Preproc/Test/kick/Novation__Novation_Nova__BassDrum2_padded.wav.npy                  : should be kick but came out as snare
                   Preproc/Test/kick/SoundMaster__SoundMaster_SM-8__Kick_08_padded.wav.npy              : should be kick but came out as snare
                   Preproc/Test/snare/Yamaha__Yamaha_SU700__Snare__SnareDrum_26_padded.wav.npy          : should be kick but came out as snare
                   Preproc/Test/snare/Alesis__Alesis_Performance-Pad__Snares__Snare_11_padded.wav.npy   : should be kick but came out as snare
                   Preproc/Test/snare/Korg__Korg_Wavestation__Snare_2_padded.wav.npy                    : should be kick but came out as snare
                   Preproc/Test/snare/Kay__Kay_DRM1__SnareDrum6_padded.wav.npy                                : should be kick but came out as snare
          class 'snare': 11
                   Preproc/Test/snare/Vermona__Vermona_DRM1-MK2__Snare__Snare2_padded.wav.npy: should be snare but came out as kick
                   Preproc/Test/snare/Alesis__Alesis_Performance-Pad__Snares__Snare_75_padded.wav.npy   : should be snare but came out as kick
                   Preproc/Test/snare/Pearl__Pearl_Drum-X__Snare_15_padded.wav.npy                      : should be snare but came out as kick
                   Preproc/Test/snare/Yamaha__Yamaha_PS-1__SnareDrum2_padded.wav.npy                    : should be snare but came out as kick
                   Preproc/Test/snare/Roland__Roland_MC-909__Snare30_padded.wav.npy                     : should be snare but came out as kick
                   Preproc/Test/snare/Yamaha__Yamaha_CS15D__Snare_2_padded.wav.npy                      : should be snare but came out as kick
                   Preproc/Test/snare/Akai__Akai_XR-20__Snare__Snare_180_padded.wav.npy                 : should be snare but came out as kick
                   Preproc/Test/snare/Roland__Roland_MC-909__Snare31_padded.wav.npy                     : should be snare but came out as kick
                   Preproc/Test/snare/Roland__Roland_TR-909__Set2__SNARE26_padded.wav.npy               : should be snare but came out as kick
                   Preproc/Test/snare/Korg__Korg_DS-10__Snare5_padded.wav.npy                           : should be snare but came out as kick
                   Preproc/Test/snare/Electro-Harmonix__Electro-Harmonix_DRM-15__SnareDrum_rim2_padded.wav.npy: should be snare but came out as kick

Error after run train_network.py but not always

Hi i got error after run train_network.py but not everytimes.
I trained on my graphics card.
One class is allways 40 npy files.
If used 2 classes (80 npy files) everythng work correctly,
If used 4 classes (160 npy files) and i got a error (i thing that it's memory problem but i don't know)
This is errorlist: https://pastebin.com/iFTNVEQn

My laptop specification: OS: Windows 10
Graphics card: Nvidia GTX 950m 2GB
RAM: 8GB DDR4

I finally want used 10 classes but at this moment work only 2 classes

Please help me.

Negative dimension size error with Conv2D when using KB6 drum samples

Hey, thanks very much for this work. I am just dabbling in machine learning a bit so may be doing something very stupid.

I downloaded drum samples from KB6 and am trying to classify them into snare or kick.
I padded them with silence to all be the same length and made them all mono
I ran ./preprocess_data.py
When I run ./train_network.py I run into this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Negative dimension size caused by subtracting 3 from 1 for 'Conv2D' (op: 'Conv2D') with input shapes: [?,1,96,924], [3,3,924,32].

Using TensorFlow backend.
class_names =  ['kick', 'snare']
total files =  16 , going to load total_load =  16
   get_sample_dimensions: Access__Access_Virus_-_TI__Access_Virus_TI_-_Cyberworm__BassDrum_01_padded.wav.npy: melgram.shape =  (1, 1, 96, 924)
 melgram dimensions:  (1, 1, 96, 924)

 Loading class 1/2: 'kick', File 1/8: Preproc/Train/kick/Access__Access_Virus_-_TI__ Loading class 1/2: 'kick', File 8/8: Preproc/Train/kick/Access__Access_Virus_-_B__BassDrum_01_padded.wav.npy                  
 Loading class 2/2: 'snare', File 1/8: Preproc/Train/snare/Akai__Akai_MPC500__4._Too Loading class 2/2: 'snare', File 8/8: Preproc/Train/snare/Access__Access_Virus_-_TI__Access_Virus_TI_-_Cyberworm__Snare01_padded.wav.npy                  
Looking for previous weights...
No weights file detected, so starting from scratch.
Making Keras 1 version of model
 MyCNN: X.shape =  (16, 1, 96, 924) , channels =  1
Traceback (most recent call last):
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/framework/common_shapes.py", line 686, in _call_cpp_shape_fn_impl
    input_tensors_as_shapes, status)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Negative dimension size caused by subtracting 3 from 1 for 'Conv2D' (op: 'Conv2D') with input shapes: [?,1,96,924], [3,3,924,32].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./train_network.py", line 59, in <module>
    train_network(weights_file=args.weights, classpath=args.classpath)
  File "./train_network.py", line 31, in train_network
    model = make_model(X_train, class_names, no_cp_fatal=False, weights_file=weights_file)
  File "/home/kaspar.emanuel/drums/panotti/panotti/models.py", line 125, in make_model
    model = MyCNN(X, nb_classes=len(class_names), nb_layers=nb_layers)
  File "/home/kaspar.emanuel/drums/panotti/panotti/models.py", line 44, in MyCNN
    border_mode='valid', input_shape=input_shape))
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/keras/models.py", line 299, in add
    layer.create_input_layer(batch_input_shape, input_dtype)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/keras/engine/topology.py", line 401, in create_input_layer
    self(x)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/keras/engine/topology.py", line 572, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/keras/engine/topology.py", line 635, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/keras/engine/topology.py", line 166, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/keras/layers/convolutional.py", line 475, in call
    filter_shape=self.W_shape)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2691, in conv2d
    x = tf.nn.conv2d(x, kernel, strides, padding=padding)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 631, in conv2d
    data_format=data_format, name=name)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2958, in create_op
    set_shapes_for_outputs(ret)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2209, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2159, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/framework/common_shapes.py", line 627, in call_cpp_shape_fn
    require_shape_fn)
  File "/home/kaspar.emanuel/drums/panotti/venv3/lib/python3.5/site-packages/tensorflow/python/framework/common_shapes.py", line 691, in _call_cpp_shape_fn_impl
    raise ValueError(err.message)
ValueError: Negative dimension size caused by subtracting 3 from 1 for 'Conv2D' (op: 'Conv2D') with input shapes: [?,1,96,924], [3,3,924,32].

Here is a sample set of just 10 samples for each category if you would like to try and reproduce the error quickly.

I tried it with audio-classifier-keras-cnn and Python2 as well but run into the same error. I am just running the Physionet example to see what that does but it will take a while to do the preprocessing (I didn't want to change anything in case that causes errors).

Cross Validation

Excuse me
How to use cross validation?

iOS implementation

Hi there,

First off thanks for the classification model! I was wondering if you had any ideas about how to implement this in swift? I believe I successfully converted the HDF5 to a MLmodel, but it wants to accept an MLmultiarray as input. Any ideas about how to convert an audio stream to such an array? Thanks in advance!

Training and Validation accuracy

Is it fine if the training and validation loss decrease then increase then decrease?
How to know that my model is not overfitting or underfitting?

Librosa not installing due to llvmlite issue

Hi, wanted your suggestions on how to resolve this issue :
Building wheels for collected packages: llvmlite
Building wheel for llvmlite (setup.py) ... error

Running setup.py clean for llvmlite
Failed to build llvmlite

Attempting uninstall: llvmlite
Found existing installation: llvmlite 0.23.0
Uninstalling llvmlite-0.23.0:
Successfully uninstalled llvmlite-0.23.0
Running setup.py install for llvmlite ... error

AttributeError: 'MultiGPUModelCheckpoint' object has no attribute 'on_train_batch_begin'

How to solve this error?
Thanks in advance!

Dimensions of feature

Hi,
Could anyone explain this?
I am unable to understand the dimensions of the feature vector.
It is 94x31x32.
I believe the 32 value is one one channel since we have a 96 Mel bins.
I chose audio length = 1 sec
Sampling frequency = 16 KHz
Mel bins = 96
Didn't change the FFT value. It's 256 by default?

Thanks for your time.

How to use augment_data.?

How to handle if sample melgram's shape is smaller so that I cant broadcast the next file.?

ERROR: mel_dims = (1, 1, 96, 156) , melgram.shape = (1, 1, 96, 147)

I can't pad it to fix this. Do you got any util function to fix this ?

GTZAN dataset

I am trying to train on GTZAN dataset. The accuracy is around 50% which seems very low. I have a question, the mel spectrogram has dimension of 96x2584 for each audio sample. Shall i use the whole sample as one "image" for the CNN network or do I need to divide the audio file into samples like 2048 and use CNN on that one.

not a bug: more a feedback with windows machine to be used by others

Hi i discovered this git few days ago, seems very nice job!

I work on windows and not a great pyhton guy but very interested in voice.

For now the main issues i found are related to path. Python windows don't like a lot relative path and so for now i try to 'make it work' in rewriting path in full not in relative.
Also have to add in the environment variable the path to ffmpeg that i have to grab for librosa to work with mp3.

Memory Error during train

Hi, thanks for your amazing work!

I'm facing an issue when trying to train.
I'm training on 99 classes with a total of 18000 samples but i'm facing a Memory Error problem.
Here is the traceback:
Using TensorFlow backend. class_names = ['Acoustic_guitar', 'Airplane', 'Applause', 'Bark', 'Bass_drum', 'Bass_guitar', 'Breathing', 'Brushing_teeth', 'Burping_or_eructation', 'Bus', 'Can_opening', 'Car_horn', 'Cat', 'Cello', 'Chainsaw', 'Chime', 'Chirping_birds', 'Church_bells', 'Clapping', 'Clarinet', 'Clock_alarm', 'Clock_tick', 'Coin_(dropping)', 'Computer_keyboard', 'Cough', 'Coughing', 'Cow', 'Cowbell', 'Crackling_fire', 'Crash_cymbal', 'Crickets', 'Crow', 'Crying_baby', 'Dishes_and_pots_and_pans', 'Dog', 'Door_wood_creaks', 'Door_wood_knock', 'Double_bass', 'Drawer_open_or_close', 'Drinking_sipping', 'Electric_piano', 'Engine', 'Fart', 'Finger_snapping', 'Fire', 'Fireworks', 'Flute', 'Footsteps', 'Frog', 'Glass', 'Glass_breaking', 'Glockenspiel', 'Gong', 'Gunshot', 'Hand_saw', 'Harmonica', 'Helicopter', 'Hen', 'Hi-hat', 'Insects', 'Keyboard_typing', 'Keys_jangling', 'Knock', 'Laughing', 'Laughter', 'Meow', 'Microwave_oven', 'Mouse_click', 'Oboe', 'Piano', 'Pig', 'Pouring_water', 'Rain', 'Rooster', 'Saxophone', 'Scissors', 'Sea_waves', 'Shatter', 'Sheep', 'Siren', 'Slam', 'Snare_drum', 'Sneezing', 'Snoring', 'Squeak', 'Tambourine', 'Tearing', 'Telephone', 'Thunderstorm', 'Toilet_flush', 'Train', 'Trumpet', 'Vacuum_cleaner', 'Violin_or_fiddle', 'Walk_or_footsteps', 'Washing_machine', 'Water_drops', 'Wind', 'Writing'] total files = 18001 , going to load total_load = 18000 total files = 18001 , going to load total_load = 18000 get_sample_dimensions: 41514.wav.npz: melgram.shape = (1, 96, 2586, 2) melgram dimensions: (1, 96, 2586, 2) Traceback (most recent call last): File "/home/guardian/.vscode/extensions/ms-python.python-2019.5.18875/pythonFiles/ptvsd_launcher.py", line 43, in <module> main(ptvsdArgs) File "/home/guardian/.vscode/extensions/ms-python.python-2019.5.18875/pythonFiles/lib/python/ptvsd/__main__.py", line 434, in main run() File "/home/guardian/.vscode/extensions/ms-python.python-2019.5.18875/pythonFiles/lib/python/ptvsd/__main__.py", line 312, in run_file runpy.run_path(target, run_name='__main__') File "/usr/lib/python3.6/runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/guardian/Desktop/audio-classification/panotti/train_network.py", line 62, in <module> val_split=args.val, tile=args.tile) File "/home/guardian/Desktop/audio-classification/panotti/train_network.py", line 29, in train_network X_train, Y_train, paths_train, class_names = build_dataset(path=classpath, batch_size=batch_size, tile=tile) File "/home/guardian/Desktop/audio-classification/panotti/panotti/datautils.py", line 217, in build_dataset X = np.zeros((total_load, mel_dims[1], mel_dims[2], mel_dims[3])) MemoryError

Does samples have to be of a maximum length( i.e 5 seconds)?

IndexError: index 0 is out of bounds for axis 0 with size 0

Hi Scotthawley,

I tried to run the train_network.py few days already but still got errors message "IndexError: index 0 is out of bounds for axis 0 with size 0"
It was happened in panotti\ datautils.py of line 237 and 238 and couldn't manage to solve until now even I debugged many times.
X[load_count,:,:] = melgram
Y[load_count,:] = this_Y
I used the data sets of drums and guitars not sure that affect to the training process and hopefully kindly advise me from someone.

Best regards

preprocess_data out of memory

Hi, im trying to preprocess 12 classes with approximately 400 samples in each. In total ~4500 samples. Each is a mono wav file, 2 seconds long. Doing this I run out of memory and python crashes.

I tried to increase the page file from 5 gb to 10 gb, but it still crashes.
Running on windows 10, i5-8250U, 8 gb ram.
Any help appreciated :)

Logfile: https://pastebin.com/PAEvzUpP

Filters coefficients

I want to ask you: What kind filters can we use for Conv2D? I see only 3*3 (That's the dimension) but I don't know what this Matrix has as parameters or coeffecients. Or when I just give that function Conv2D with numbers of filters and the filter's dimension, does it choose random coeffecients for this Matrix? What I did right noyw: Generate spectograms and save them as data (Matrix), but I'm confuses about this filters and the only thing that I cannot understand. Can you give me some informations about it?