mdenil / dropout Goto Github PK
View Code? Open in Web Editor NEWA theano implementation of Hinton's dropout.
License: MIT License
A theano implementation of Hinton's dropout.
License: MIT License
Theano implementation of dropout. See: http://arxiv.org/abs/1207.0580 Run with: ./mlp.py dropout for dropout, or ./mlp.py backprop for regular backprop with no dropout. Use: ./plot_results.sh results.png to visualize the results. Based on code from: - http://deeplearning.net/tutorial/mlp.html - http://deeplearning.net/tutorial/logreg.html Use the data here to make the units of the results comparable to Hinton's paper: - http://www.cs.ox.ac.uk/people/misha.denil/hidden/mnist_batches.npz
I didn't find the code which de-weight model parameters by p during testing, as the original paper suggests. Is this correct or am I missing something?
Hi, I am just a little confused about the setting. For hinton's paper you need to sample the dropout units again but this code seems to fix dropout units at the beginning and never resample again. Am I missing something here ? Thanks in advance
-Lin
Great initiative to implement the dropout technique in theano! But is there any reason why the mpl.py code doesnt use bias for its neurons?
I'm not clear regarding the difference between dropout and backprop? Even in backprop you have used dropout layers? What exactly is the difference between the two? Which is better?
Hi,
Thanks for this code.
My Question about the code is similar to @droid666 โs question, why set the W by this formula W=layer.W / (1 - dropout_rates[layer_counter]) in testing but not W = layer.W ?
Regardless of the width of the hidden units, it seems if I have more than 3 hidden layers, the dropout training does not work. I wonder if some bug is causing it.
4-layer with backprop only works
$ python mlp.py backprop
... building the model: hidden layers [600, 200, 100, 100], dropout: False [0.0, 0.0, 0.0, 0.0, 0.0]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 7448 iter 930) **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 7448 iter 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 7448 iter 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 7448 iter 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 7448 iter 4654)
epoch 6, test error 0.3625 (290), learning_rate=0.99003992008 (patience: 7448 iter 5585) **
epoch 7, test error 0.33875 (271), learning_rate=0.98805984024 (patience: 22340.0 iter 6516) **
epoch 8, test error 0.3175 (254), learning_rate=0.986083720559 (patience: 26064.0 iter 7447) **
epoch 9, test error 0.32375 (259), learning_rate=0.984111553118 (patience: 29788.0 iter 8378)
epoch 10, test error 0.325 (260), learning_rate=0.982143330012 (patience: 29788.0 iter 9309)
4-layer with dropout does not work
$ python mlp.py dropout
... building the model: hidden layers [600, 200, 100, 100], dropout: [0.0, 0.5, 0.5, 0.5, 0.5]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 27930 / 930) **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 27930 / 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 27930 / 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 27930 / 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 27930 / 4654)
...
epoch 29, test error 0.375 (300), learning_rate=0.945486116479 (patience: 27930 / 26998)
epoch 30, test error 0.375 (300), learning_rate=0.943595144246 (patience: 27930 / 27929)
epoch 31, test error 0.375 (300), learning_rate=0.941707953958 (patience: 27930 / 28860)
3-layer with dropout works
$ python mlp.py dropout
... building the model: hidden layers [600, 200, 100], dropout: [0.0, 0.5, 0.5, 0.5]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 27930 / 930) **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 27930 / 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 27930 / 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 27930 / 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 27930 / 4654)
epoch 6, test error 0.365 (292), learning_rate=0.99003992008 (patience: 27930 / 5585) **
epoch 7, test error 0.3625 (290), learning_rate=0.98805984024 (patience: 27930 / 6516) **
epoch 8, test error 0.3375 (270), learning_rate=0.986083720559 (patience: 27930 / 7447) **
epoch 9, test error 0.3275 (262), learning_rate=0.984111553118 (patience: 29788 / 8378) **
epoch 10, test error 0.33375 (267), learning_rate=0.982143330012 (patience: 33512 / 9309)
epoch 11, test error 0.32875 (263), learning_rate=0.980179043352 (patience: 33512 / 10240)
epoch 12, test error 0.315 (252), learning_rate=0.978218685265 (patience: 33512 / 11171) **
This block of code constrains the norms of the rows of the weight matrix:
https://github.com/mdenil/dropout/blob/master/mlp.py#L245-L254
It should constrain the norms of the columns as described in the original paper:
Instead of penalizing the squared length (L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming weight vector for each individual hidden unit. If a weight-update violates this constraint, we renormalize the weights of the hidden unit by division.
The matrix orientation in the code means that each column corresponds to a hidden unit (see: https://github.com/mdenil/dropout/blob/master/mlp.py#L38-L41), so it is the columns and not the rows that should be constrained.
Reported by Ian Goodfellow:
Hi Misha,
I think I found a bug in the momentum for your dropout demo. This came
up when someone suggested adding some code that was partially copied
from your demo to pylearn2.
The bug is with these lines:
for gparam_mom, gparam in zip(gparams_mom, gparams):
updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam
# ... and take a step along that direction
for param, gparam_mom in zip(classifier.params, gparams_mom):
stepped_param = param - (1.-mom) * learning_rate * gparam_mom
There are two things I think are wrong here:
Hope that helps,
Ian
In Hinton's paper, it is said that "On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5".
To my understanding, in function _dropout_from_layer,
srng = theano.tensor.shared_randomstreams.RandomStreams(
rng.randint(999999))
srng seems fixed and thus those random dropout hidden units are determined (before seeing the samples) and the same for each epoch & mini-batch. In that case, it is far from real dropout. But I might be wrong since I am not that familiar with theano. Can someone clarify this for me?
Regards,
Would you please add a LICENSE file?
Hi,
Thanks for this code.
From looking at it, I think the dropout rate should be set to 0 if not using dropout. Since it looks like the code is going to adapt the weights W based on that rate independent of whether dropout is used or not.
Hi,
I think you've got a bug in your implementation: you're applying the dropout mask to output units rather than elements of your weight matrices, which is what the original version of dropout is intended to do. This means that you're dropping out bias units randomly, which might disrupt the model averaging interpretation of dropout.
I'm not sure if you intended to do this, but if not, you should reconsider having _droput_from_layer apply a mask directly to the Ws, and then computing the layer output (see eqn's 2.3 -- 2.6 of Nitish's thesis)
I use the update rule for the momentum and the gradient rule to train my network. However, the error first start to decrease for the early few epochs and then flatten out.
I check those update rule and compare them with Hinton's dropout paper (also the ImageNet paper), and find that they seem different.
The current update rules implemented are as follow:
updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam
for param, gparam_mom in zip(classifier.params, gparams_mom):
stepped_param = param - learning_rate * updates[gparam_mom]
According to my understanding of Appendix A.1 in the dropout paper, learning_rate is NOT multiplied to mom * gparam_mom, while the above code is. According to the formula therein, we should have:
updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
updates[gparam_mom] = mom * gparam_mom - (1. - mom) * learning_rate * gparam
for param, gparam_mom in zip(classifier.params, gparams_mom):
stepped_param = param + updates[gparam_mom]
Am I right? Or is that the update rule currently implemented is actually used somewhere that I am not ware of? In that case, I'd love some pointers.
As another note, since learning_rate is multiplied by (1. - mom), a large learning_rate is expected to give good reults (no wonder why Hinton use 10 now...)
However, in their ImageNet paper, they use a slightly different rule for updating the momentum, which includes weight decay. Also, learning_rate is no longer multiplied by (1. - mom). In that case, we should expect a small learning_rate. In code, the ImageNet update rule is:
updates = OrderedDict()
for gparam_mom, gparam, param in zip(gparams_mom, gparams, params):
updates[gparam_mom] = mom * gparam_mom - learning_rate * (weight_decay*param + gparam)
Regards,
The input is dropout'd with probability 0.2 which means the weight matrix should be multiplied by 0.8, but the code multiplies it by 0.5.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.