mdenil / dropout Goto Github PK

A theano implementation of Hinton's dropout.

License: MIT License

Python 97.64% Shell 2.36%

dropout's Introduction

Theano implementation of dropout.  See: http://arxiv.org/abs/1207.0580

Run with:

     ./mlp.py dropout

for dropout, or 

    ./mlp.py backprop

for regular backprop with no dropout.

Use:

    ./plot_results.sh results.png

to visualize the results.

Based on code from:
- http://deeplearning.net/tutorial/mlp.html
- http://deeplearning.net/tutorial/logreg.html

Use the data here to make the units of the results comparable to Hinton's paper:
- http://www.cs.ox.ac.uk/people/misha.denil/hidden/mnist_batches.npz

dropout's People

Contributors

Stargazers

Watchers

dropout's Issues

Do all the weights multiply the included probability p during testing?

I didn't find the code which de-weight model parameters by p during testing, as the original paper suggests. Is this correct or am I missing something?

About the Resample Issue

Hi, I am just a little confused about the setting. For hinton's paper you need to sample the dropout units again but this code seems to fix dropout units at the beginning and never resample again. Am I missing something here ? Thanks in advance

-Lin

no bias in mlp.py

Great initiative to implement the dropout technique in theano! But is there any reason why the mpl.py code doesnt use bias for its neurons?

Difference between 'dropout' and 'backprop' arguements in script

I'm not clear regarding the difference between dropout and backprop? Even in backprop you have used dropout layers? What exactly is the difference between the two? Which is better?

Why set the W by this formula W=layer.W / (1 - dropout_rates[layer_counter]) in testing?

Hi,
Thanks for this code.
My Question about the code is similar to @droid666 ’s question, why set the W by this formula W=layer.W / (1 - dropout_rates[layer_counter]) in testing but not W = layer.W ?

dropout trainig doesn't work with over 3 hiddent layers

Regardless of the width of the hidden units, it seems if I have more than 3 hidden layers, the dropout training does not work. I wonder if some bug is causing it.

4-layer with backprop only works

$ python mlp.py backprop
... building the model: hidden layers [600, 200, 100, 100], dropout: False [0.0, 0.0, 0.0, 0.0, 0.0]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 7448 iter 930)  **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 7448 iter 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 7448 iter 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 7448 iter 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 7448 iter 4654)
epoch 6, test error 0.3625 (290), learning_rate=0.99003992008 (patience: 7448 iter 5585)  **
epoch 7, test error 0.33875 (271), learning_rate=0.98805984024 (patience: 22340.0 iter 6516)  **
epoch 8, test error 0.3175 (254), learning_rate=0.986083720559 (patience: 26064.0 iter 7447)  **
epoch 9, test error 0.32375 (259), learning_rate=0.984111553118 (patience: 29788.0 iter 8378)
epoch 10, test error 0.325 (260), learning_rate=0.982143330012 (patience: 29788.0 iter 9309)

4-layer with dropout does not work

$ python mlp.py dropout
... building the model: hidden layers [600, 200, 100, 100], dropout: [0.0, 0.5, 0.5, 0.5, 0.5]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 27930 / 930)  **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 27930 / 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 27930 / 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 27930 / 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 27930 / 4654)
...
epoch 29, test error 0.375 (300), learning_rate=0.945486116479 (patience: 27930 / 26998)
epoch 30, test error 0.375 (300), learning_rate=0.943595144246 (patience: 27930 / 27929)
epoch 31, test error 0.375 (300), learning_rate=0.941707953958 (patience: 27930 / 28860)

3-layer with dropout works

$ python mlp.py dropout
... building the model: hidden layers [600, 200, 100], dropout: [0.0, 0.5, 0.5, 0.5]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 27930 / 930)  **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 27930 / 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 27930 / 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 27930 / 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 27930 / 4654)
epoch 6, test error 0.365 (292), learning_rate=0.99003992008 (patience: 27930 / 5585)  **
epoch 7, test error 0.3625 (290), learning_rate=0.98805984024 (patience: 27930 / 6516)  **
epoch 8, test error 0.3375 (270), learning_rate=0.986083720559 (patience: 27930 / 7447)  **
epoch 9, test error 0.3275 (262), learning_rate=0.984111553118 (patience: 29788 / 8378)  **
epoch 10, test error 0.33375 (267), learning_rate=0.982143330012 (patience: 33512 / 9309)
epoch 11, test error 0.32875 (263), learning_rate=0.980179043352 (patience: 33512 / 10240)
epoch 12, test error 0.315 (252), learning_rate=0.978218685265 (patience: 33512 / 11171)  **

Constrain weight matrix columns instead of rows

This block of code constrains the norms of the rows of the weight matrix:

https://github.com/mdenil/dropout/blob/master/mlp.py#L245-L254

It should constrain the norms of the columns as described in the original paper:

Instead of penalizing the squared length (L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming weight vector for each individual hidden unit. If a weight-update violates this constraint, we renormalize the weights of the hidden unit by division.

The matrix orientation in the code means that each column corresponds to a hidden unit (see: https://github.com/mdenil/dropout/blob/master/mlp.py#L38-L41), so it is the columns and not the rows that should be constrained.

Momentum bug

Reported by Ian Goodfellow:

Hi Misha,
I think I found a bug in the momentum for your dropout demo. This came
up when someone suggested adding some code that was partially copied
from your demo to pylearn2.

The bug is with these lines:

for gparam_mom, gparam in zip(gparams_mom, gparams):
updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam

# ... and take a step along that direction
for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param - (1.-mom) * learning_rate * gparam_mom

There are two things I think are wrong here:

When you update stepped_param, you want to use updates[gparam_mom]
and not gparam_mom. gparam_mom is one time step too old. Only
updates[gparam_mom] has been updated with the current gradient.
gparam_mom won't contain the updated value until after the theano
function finishes executing. (At first I thought you were doing
Nesterov momentum, but that would need the gradient from t+1, not t-1)
If you expand the recurrence for stepped_param, it doesn't match
the formula in appendex A1 of Geoff's paper. It ends up multiplying by
(1-mom)^2 instead of (1-mom). This probably makes you need a bigger
learning rate to compensate, since 1-mom will be a small number.

Hope that helps,
Ian

Random dropout at each mini-batch?

In Hinton's paper, it is said that "On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5".

To my understanding, in function _dropout_from_layer,
srng = theano.tensor.shared_randomstreams.RandomStreams(
rng.randint(999999))
srng seems fixed and thus those random dropout hidden units are determined (before seeing the samples) and the same for each epoch & mini-batch. In that case, it is far from real dropout. But I might be wrong since I am not that familiar with theano. Can someone clarify this for me?

Regards,

License

Would you please add a LICENSE file?

Dropout rate should be set to 0 if not using dropout

Hi,
Thanks for this code.
From looking at it, I think the dropout rate should be set to 0 if not using dropout. Since it looks like the code is going to adapt the weights W based on that rate independent of whether dropout is used or not.

dropping output units rather than connections

Hi,

I think you've got a bug in your implementation: you're applying the dropout mask to output units rather than elements of your weight matrices, which is what the original version of dropout is intended to do. This means that you're dropping out bias units randomly, which might disrupt the model averaging interpretation of dropout.

I'm not sure if you intended to do this, but if not, you should reconsider having _droput_from_layer apply a mask directly to the Ws, and then computing the layer output (see eqn's 2.3 -- 2.6 of Nitish's thesis)

Momentum again

I use the update rule for the momentum and the gradient rule to train my network. However, the error first start to decrease for the early few epochs and then flatten out.
I check those update rule and compare them with Hinton's dropout paper (also the ImageNet paper), and find that they seem different.

The current update rules implemented are as follow:

updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
    updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam

for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param - learning_rate * updates[gparam_mom]

According to my understanding of Appendix A.1 in the dropout paper, learning_rate is NOT multiplied to mom * gparam_mom, while the above code is. According to the formula therein, we should have:

updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
    updates[gparam_mom] = mom * gparam_mom - (1. - mom) * learning_rate * gparam

for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param + updates[gparam_mom]

Am I right? Or is that the update rule currently implemented is actually used somewhere that I am not ware of? In that case, I'd love some pointers.

As another note, since learning_rate is multiplied by (1. - mom), a large learning_rate is expected to give good reults (no wonder why Hinton use 10 now...)

However, in their ImageNet paper, they use a slightly different rule for updating the momentum, which includes weight decay. Also, learning_rate is no longer multiplied by (1. - mom). In that case, we should expect a small learning_rate. In code, the ImageNet update rule is:

updates = OrderedDict()
for gparam_mom, gparam, param in zip(gparams_mom, gparams, params):
    updates[gparam_mom] = mom * gparam_mom - learning_rate * (weight_decay*param + gparam)

Regards,

Incorrect weight scaling on inputs

The input is dropout'd with probability 0.2 which means the weight matrix should be multiplied by 0.8, but the code multiplies it by 0.5.

mdenil / dropout Goto Github PK

dropout's Introduction

dropout's People

Contributors

Stargazers

Watchers

Forkers

dropout's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs