xgastaldi / shake-shake Goto Github PK

View Code? Open in Web Editor NEW

295.0 295.0 32.0 30 KB

2.86% and 15.85% on CIFAR-10 and CIFAR-100

Home Page: https://arxiv.org/abs/1705.07485

License: BSD 3-Clause "New" or "Revised" License

Lua 100.00%

regularization resnet torch7

shake-shake's People

Contributors

Stargazers

Watchers

shake-shake's Issues

CIFAR 100 training too slow

Hi, I am trying to get the reproduced result in CIFAR 100.

I am using the script in README

CUDA_VISIBLE_DEVICES=0,1 th main.lua -dataset cifar100 -depth 29 -baseWidth 64 -groups 4 -weightDecay 5e-4 -batchSize 32 -netType shakeshake -nGPU 2 -LR 0.025 -nThreads 8 -shareGradInput true -nEpochs 1800 -lrShape cosine -forwardShake true -backwardShake false -shakeImage true

I checked that the Top1 error is 99 at epoch 111, and I'm afraid that it will not converge in the future.

I am running with up-to-date Torch source (built from source), copied necessary files from fb.resnet in Ubuntu 14.04 / CUDA 8 / CuDNN v4

Should I wait with patience? or is there any extra tricks to see faster convergence?

Could you give me some suggestions about some possible ways to improve test accuracy.

Thank your sharing your code!
Your result is the best one I have seen. Now i try to improve the test error based on your code. Could you give me some suggestions about some possible ways to improve test accuracy?

You can connect me with my private email: [email protected].

Thanks.

Error rates

In the readme.md, you writted :"Table 1: Error rates (%) on CIFAR-10 (Top 1 of the last epoch)".

However the best error rate may not in the last epoch. For example when the parameter "32296" was used , the top1 error of last epoch and the count the sixth of the epoch were respectively 2.86 and 2.79.

Can I consider the best top1 error is 2.79?
How I select he best top1 error?

CUDA out of memory when checkpointing

Hi,

When running the code the model trains for 1 epoch without running out of memory but while checkpointing the first time it tries to make a copy in GPU (not CPU mode). I guess there is a reason to change from default checkpointing done in fb.resnet.torch. Can you please explain the reason.

Thanks

Two improving approchs

I think there are two approchs to improve the accuracy. Whether these methods were feasible?

The first method:
The validation dataset splitted from train dataset. Then the adaptive learning rate automatically adjusted with the validation accuracy.

The second method:
In the process of training, the test top1 once is lower than a fixed value the leaning rate of the epochs after this epoch settled to zero.

For example, the code with parameter( nEpochs 400) run ，the log file is as follows.
epoch 　 test top 1 　 learning rate
370 　　 3.62 　　 0.02
....... 　　 ...... 　　 .......
400 　　 3.82 　　 0.00

In the log file , the best test top 1 was 3.62,but the result generated at the 370th epoch.
If the learning rate between epoch 371 and 400 set as zero, the test top 1 of epoch between 371 and 400 all should be 3.62 ?
I had experimented this method and found the test top 1 after the 371th epoch still slightly surge/change.

Can you give me some suggestion about above two methods?
Do you compare the adapting learning rate updating method such as rmsprop,adadelta with SGD?
Thank you very much!

Are you sure the model need training 1800 epochs?

Could you please provide train logs for reference?

The third column of results lists the wrong network architecture

The columns are listed as:
Forward Backward Level 26 2x32d 26 2x64d 26 2x32d
in the README
but are
Forward Backward Level 26 2x32d 26 2x64d 26 2x96d
in the paper.

Why is the test top1 different?

I changed the shakeshakeblock.lua, then run the code with 400 epochs.
The next picture was the log text within the process of training.

After 400 epochs train, the code CUDA_VISIBLE_DEVICES=0,1,2,3 th main.lua -dataset cifar10 -nGPU 4 -testOnly true -retrain ./checkpoints/model_best.t7 run.
The result was Results top1: 3.670 top5: 0.020

Qestion1: Why the test top 1 of "testOnly" was less than the test top1 in the process of training.
Qestion2: What was the difference of the best test top1 , last epoch's test top1 and the "testOnly"'s top1?
Qestion3: Because the top1(3.67) was generated by the network's model, can I consider my model had the performance :top1 3.67?

Question regarding Pooling

I have a small question (I do not know Torch).

In this code when you make skip connection when decreasing resolution:

 -- Skip path #1
 s1 = nn.Sequential()
 s1:add(nn.SpatialAveragePooling(1, 1, stride, stride))
 s1:add(Convolution(nInputPlane, nOutputPlane/2, 1,1, 1,1, 0,0))

 -- Skip path #2
 s2 = nn.Sequential()
 -- Shift the tensor by one pixel right and one pixel down (to make the 2nd path "see" different pixels)
 s2:add(nn.SpatialZeroPadding(1, -1, 1, -1))
 s2:add(nn.SpatialAveragePooling(1, 1, stride, stride))

Skip path #1 Will take 'top left' pixel in each 2x2 square if I understand it correctly

Skip path #2 Will take 'bottom right' pixel in each 2x2 square. Here is the point I do not understand.
If this code first appends zeros to top and left parts of 'feature image' then it will have lot of pixels with value '0' after this downsampling. It would be better to append zeros to right and bottom and remove first row and first column. So instead of s2:add(nn.SpatialZeroPadding(1, -1, 1, -1)) use s2:add(nn.SpatialZeroPadding(-1, 1, -1, 1)).

Tell me if it works right now as I described, I might be wrong because as I said I don't know Torch to well.

the question about the epoch number of data converging

Using the same hyper parameter :22632 the data will converge at about 1850 epoch.

After several times of training using other hyper parameter, however, the data will converge at the number outnumber 3500 using the same hyper parameter:22632.

Two experiments have the same hyper parameter and the same code.
The only difference of the second experiment related to the first one is that we have trained the data several times.

xgastaldi / shake-shake Goto Github PK

shake-shake's People

Contributors

Stargazers

Watchers

Forkers

shake-shake's Issues

CIFAR 100 training too slow

Could you give me some suggestions about some possible ways to improve test accuracy.

Error rates

CUDA out of memory when checkpointing

Two improving approchs

Are you sure the model need training 1800 epochs?

The third column of results lists the wrong network architecture

Why is the test top1 different?

Question regarding Pooling

the question about the epoch number of data converging

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs