GithubHelp home page GithubHelp logo

xgastaldi / shake-shake Goto Github PK

View Code? Open in Web Editor NEW
295.0 295.0 32.0 30 KB

2.86% and 15.85% on CIFAR-10 and CIFAR-100

Home Page: https://arxiv.org/abs/1705.07485

License: BSD 3-Clause "New" or "Revised" License

Lua 100.00%
regularization resnet torch7

shake-shake's People

Contributors

xgastaldi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

shake-shake's Issues

CIFAR 100 training too slow

Hi, I am trying to get the reproduced result in CIFAR 100.

I am using the script in README

CUDA_VISIBLE_DEVICES=0,1 th main.lua -dataset cifar100 -depth 29 -baseWidth 64 -groups 4 -weightDecay 5e-4 -batchSize 32 -netType shakeshake -nGPU 2 -LR 0.025 -nThreads 8 -shareGradInput true -nEpochs 1800 -lrShape cosine -forwardShake true -backwardShake false -shakeImage true

I checked that the Top1 error is 99 at epoch 111, and I'm afraid that it will not converge in the future.

I am running with up-to-date Torch source (built from source), copied necessary files from fb.resnet in Ubuntu 14.04 / CUDA 8 / CuDNN v4

Should I wait with patience? or is there any extra tricks to see faster convergence?

Error rates

In the readme.md, you writted :"Table 1: Error rates (%) on CIFAR-10 (Top 1 of the last epoch)".

However the best error rate may not in the last epoch. For example when the parameter "32296" was used , the top1 error of last epoch and the count the sixth of the epoch were respectively 2.86 and 2.79.

Can I consider the best top1 error is 2.79?
How I select he best top1 error?

CUDA out of memory when checkpointing

Hi,

When running the code the model trains for 1 epoch without running out of memory but while checkpointing the first time it tries to make a copy in GPU (not CPU mode). I guess there is a reason to change from default checkpointing done in fb.resnet.torch. Can you please explain the reason.

Thanks

Two improving approchs

I think there are two approchs to improve the accuracy. Whether these methods were feasible?

The first method:
The validation dataset splitted from train dataset. Then the adaptive learning rate automatically adjusted with the validation accuracy.

The second method:
In the process of training, the test top1 once is lower than a fixed value the leaning rate of the epochs after this epoch settled to zero.

For example, the code with parameter( nEpochs 400) run ,the log file is as follows.
epoch   test top 1   learning rate
370    3.62    0.02
.......    ......    .......
400    3.82    0.00

In the log file , the best test top 1 was 3.62,but the result generated at the 370th epoch.
If the learning rate between epoch 371 and 400 set as zero, the test top 1 of epoch between 371 and 400 all should be 3.62 ?
I had experimented this method and found the test top 1 after the 371th epoch still slightly surge/change.

Can you give me some suggestion about above two methods?
Do you compare the adapting learning rate updating method such as rmsprop,adadelta with SGD?
Thank you very much!

Why is the test top1 different?

I changed the shakeshakeblock.lua, then run the code with 400 epochs.
The next picture was the log text within the process of training.
a

After 400 epochs train, the code CUDA_VISIBLE_DEVICES=0,1,2,3 th main.lua -dataset cifar10 -nGPU 4 -testOnly true -retrain ./checkpoints/model_best.t7 run.
The result was Results top1: 3.670 top5: 0.020

Qestion1: Why the test top 1 of "testOnly" was less than the test top1 in the process of training.
Qestion2: What was the difference of the best test top1 , last epoch's test top1 and the "testOnly"'s top1?
Qestion3: Because the top1(3.67) was generated by the network's model, can I consider my model had the performance :top1 3.67?

Question regarding Pooling

I have a small question (I do not know Torch).

In this code when you make skip connection when decreasing resolution:

 -- Skip path #1
 s1 = nn.Sequential()
 s1:add(nn.SpatialAveragePooling(1, 1, stride, stride))
 s1:add(Convolution(nInputPlane, nOutputPlane/2, 1,1, 1,1, 0,0))

 -- Skip path #2
 s2 = nn.Sequential()
 -- Shift the tensor by one pixel right and one pixel down (to make the 2nd path "see" different pixels)
 s2:add(nn.SpatialZeroPadding(1, -1, 1, -1))
 s2:add(nn.SpatialAveragePooling(1, 1, stride, stride))

Skip path #1 Will take 'top left' pixel in each 2x2 square if I understand it correctly

Skip path #2 Will take 'bottom right' pixel in each 2x2 square. Here is the point I do not understand.
If this code first appends zeros to top and left parts of 'feature image' then it will have lot of pixels with value '0' after this downsampling. It would be better to append zeros to right and bottom and remove first row and first column. So instead of s2:add(nn.SpatialZeroPadding(1, -1, 1, -1)) use s2:add(nn.SpatialZeroPadding(-1, 1, -1, 1)).

Tell me if it works right now as I described, I might be wrong because as I said I don't know Torch to well.

the question about the epoch number of data converging

Using the same hyper parameter :22632 the data will converge at about 1850 epoch.

After several times of training using other hyper parameter, however, the data will converge at the number outnumber 3500 using the same hyper parameter:22632.

Two experiments have the same hyper parameter and the same code.
The only difference of the second experiment related to the first one is that we have trained the data several times.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.