Exactly like stupid single-hidden-layer MLP, but... deeper. Also, of course, sparsification is now on all layers - and the order on which low magnitude params are killed off is defined over all the layers, too. The Sustkever et al smart initialization was too rich for my blood, so the layers are just drawn from Gaussians with exponentially increasing standard deviation. You can get away with even less burn-in to still get that fat tail on the weight histogram by cranking up the learning rate a lot just on those burn-in steps.
lebo124 / stupiddnn Goto Github PK
View Code? Open in Web Editor NEWThis project forked from howonlee/stupiddnn
Stupid greedy layerwise DNN