xgastaldi / shake-shake Goto Github PK
View Code? Open in Web Editor NEW2.86% and 15.85% on CIFAR-10 and CIFAR-100
Home Page: https://arxiv.org/abs/1705.07485
License: BSD 3-Clause "New" or "Revised" License
2.86% and 15.85% on CIFAR-10 and CIFAR-100
Home Page: https://arxiv.org/abs/1705.07485
License: BSD 3-Clause "New" or "Revised" License
Hi, I am trying to get the reproduced result in CIFAR 100.
I am using the script in README
CUDA_VISIBLE_DEVICES=0,1 th main.lua -dataset cifar100 -depth 29 -baseWidth 64 -groups 4 -weightDecay 5e-4 -batchSize 32 -netType shakeshake -nGPU 2 -LR 0.025 -nThreads 8 -shareGradInput true -nEpochs 1800 -lrShape cosine -forwardShake true -backwardShake false -shakeImage true
I checked that the Top1 error is 99 at epoch 111, and I'm afraid that it will not converge in the future.
I am running with up-to-date Torch source (built from source), copied necessary files from fb.resnet in Ubuntu 14.04 / CUDA 8 / CuDNN v4
Should I wait with patience? or is there any extra tricks to see faster convergence?
Thank your sharing your code!
Your result is the best one I have seen. Now i try to improve the test error based on your code. Could you give me some suggestions about some possible ways to improve test accuracy?
You can connect me with my private email: [email protected].
Thanks.
In the readme.md, you writted :"Table 1: Error rates (%) on CIFAR-10 (Top 1 of the last epoch)".
However the best error rate may not in the last epoch. For example when the parameter "32296" was used , the top1 error of last epoch and the count the sixth of the epoch were respectively 2.86 and 2.79.
Can I consider the best top1 error is 2.79?
How I select he best top1 error?
Hi,
When running the code the model trains for 1 epoch without running out of memory but while checkpointing the first time it tries to make a copy in GPU (not CPU mode). I guess there is a reason to change from default checkpointing done in fb.resnet.torch. Can you please explain the reason.
Thanks
I think there are two approchs to improve the accuracy. Whether these methods were feasible?
The first method:
The validation dataset splitted from train dataset. Then the adaptive learning rate automatically adjusted with the validation accuracy.
The second method:
In the process of training, the test top1 once is lower than a fixed value the leaning rate of the epochs after this epoch settled to zero.
For example, the code with parameter( nEpochs 400) run ,the log file is as follows.
epoch test top 1 learning rate
370 3.62 0.02
....... ...... .......
400 3.82 0.00
In the log file , the best test top 1 was 3.62,but the result generated at the 370th epoch.
If the learning rate between epoch 371 and 400 set as zero, the test top 1 of epoch between 371 and 400 all should be 3.62 ?
I had experimented this method and found the test top 1 after the 371th epoch still slightly surge/change.
Can you give me some suggestion about above two methods?
Do you compare the adapting learning rate updating method such as rmsprop,adadelta with SGD?
Thank you very much!
Could you please provide train logs for reference?
The columns are listed as:
Forward Backward Level 26 2x32d 26 2x64d 26 2x32d
in the README
but are
Forward Backward Level 26 2x32d 26 2x64d 26 2x96d
in the paper.
I changed the shakeshakeblock.lua, then run the code with 400 epochs.
The next picture was the log text within the process of training.
After 400 epochs train, the code CUDA_VISIBLE_DEVICES=0,1,2,3 th main.lua -dataset cifar10 -nGPU 4 -testOnly true -retrain ./checkpoints/model_best.t7
run.
The result was Results top1: 3.670 top5: 0.020
Qestion1: Why the test top 1 of "testOnly" was less than the test top1 in the process of training.
Qestion2: What was the difference of the best test top1 , last epoch's test top1 and the "testOnly"'s top1?
Qestion3: Because the top1(3.67) was generated by the network's model, can I consider my model had the performance :top1 3.67?
I have a small question (I do not know Torch).
In this code when you make skip connection when decreasing resolution:
-- Skip path #1
s1 = nn.Sequential()
s1:add(nn.SpatialAveragePooling(1, 1, stride, stride))
s1:add(Convolution(nInputPlane, nOutputPlane/2, 1,1, 1,1, 0,0))
-- Skip path #2
s2 = nn.Sequential()
-- Shift the tensor by one pixel right and one pixel down (to make the 2nd path "see" different pixels)
s2:add(nn.SpatialZeroPadding(1, -1, 1, -1))
s2:add(nn.SpatialAveragePooling(1, 1, stride, stride))
Skip path #1 Will take 'top left' pixel in each 2x2 square if I understand it correctly
Skip path #2 Will take 'bottom right' pixel in each 2x2 square. Here is the point I do not understand.
If this code first appends zeros to top and left parts of 'feature image' then it will have lot of pixels with value '0' after this downsampling. It would be better to append zeros to right and bottom and remove first row and first column. So instead of s2:add(nn.SpatialZeroPadding(1, -1, 1, -1)) use s2:add(nn.SpatialZeroPadding(-1, 1, -1, 1)).
Tell me if it works right now as I described, I might be wrong because as I said I don't know Torch to well.
Using the same hyper parameter :22632 the data will converge at about 1850 epoch.
After several times of training using other hyper parameter, however, the data will converge at the number outnumber 3500 using the same hyper parameter:22632.
Two experiments have the same hyper parameter and the same code.
The only difference of the second experiment related to the first one is that we have trained the data several times.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.