soumith / imagenet-multigpu.torch Goto Github PK

View Code? Open in Web Editor NEW

400.0 400.0 159.0 90 KB

an imagenet example in torch.

License: BSD 2-Clause "Simplified" License

Lua 100.00%

imagenet-multigpu.torch's People

Contributors

Stargazers

Watchers

Forkers

chagge alband samuel1208 xuwangyin yusuke0519 jhjin yanweifu salemameen neviim jakezhaojb bobbens sanealytics guadoc pedropgusmao ml-lab deltheil leotam smerity manjari25 mfigurnov codeaudit andrubrown e-lab cbarokk leetaewoo hyqleonardo caomw xinfushe mlosch f34150 nguyenductung mersoy zhitongxiong cadene valkwarble atveit linyang916 szagoruyko eriche2016 seth-park zhefan apaszke ybou pandav2 lim0606 lukacf kristjankelt lixiangnlp didw anuragranj solertis zjucsxxd angelikilazaridou vesvis agrotov yanbeic whatif2003 imclab jimeiyang eywalker fantine16 chenyanl lvaleriu ksarma cafesiesta sdberardinelli ilovecv 2php chakkritte zealcui harrywy adamist521 windscope kconor matsaragas rainstrom llerrito george-lukas jangkyung wucpmark gjtjx mudelin noxgreeneye xhzhao shuzi tuanthng frgomes tomsercu xshhhm abhishek-somani amit2014 dreadlord1984 williford hgajiayou shaoli-huang andrew-yang0722 ominux tianlu-wang hanzy123 clear-datacenter

imagenet-multigpu.torch's Issues

Multi Label Example

Hi,
This example framework has been super useful! It is possible to showcase a classifier where each photo has 1 or more appropriate labels (not per pixel, but rather overall)

It would be a departure from the 1 folder = 1 class approach - perhaps instead a text file could be supplied where its in the form of:
imageFileName 1 2 3 4 5 ... where each number indicates an appropriate class.

Thanks again and keep up the good work!

Different performance when model is reloaded

Hi
I am training alexnet on my own image data. approximately 10,000 images and 3 classes. I ran the training procedure and saved the models at the end of every epoch. train and test log files show accuracies of > 90% . But when I load the model and test it on the same training and testing data I get very poor results.
img = trainHook(base_dir .. file)
preds = model:forward( img:cuda() )
_, pred_sorted = preds:sort( true )
predictions[ file ] = pred_sorted[1]
TrainHook takes care of cropping and mean,std normalization.
Do you have any idea why this might happen? I can provide more information if it is not clear

low accuracy on alexnetowtbn

I trained the Alexnet model with batch normalization (alexnetowtbn) with 4 GPU and batchSize 256. after 50 epochs my top-1 acuracy is %45 . I couldn't find any result of alexnet trained with batchnormalization. Is this number ok? It seems much lower than 57% which is reported in caffe.

confused about GPU-util

Hi,
I tried to run goolenet_cudnn and found that GPU utilization of my 4 K20s would be 0% sometimes.

At first, I thought it might be due to slow I/O and I transferred the whole imagenet dataset to SSD and increased the number of threads for data loading with -nDonkeys 10.

But the problem still happens. However, GPU utilization of VGG-A is quite stable, so is Alexnet.

Any thoughts? Thank you very much!

When do we need to use cutorch.synchronize

@soumith , I noticed that there are many place to use cutorch.synchronize function in the file train.lua, I want to know is it necessary to use cutorch.synchronize in these place, because in DataParallelTable module, there is already cutorch,synchronize to consider synchronize stuff ex, the updateOutput method. can you give me some tips about this ? much appreciate it..

Which Class Labels To Indices In Output Tensor?

I have a map_clsloc.txt file from ILSVRC2015 and was wondering
n02110185 3 Siberian_husky
n02096294 4 Australian_terrier
n02102040 5 English_springer
n02066245 6 grey_whale
n02509815 7 lesser_panda
n02124075 8 Egyptian_cat
n02417914 9 ibex
if each index here matches the index in the output 1000 dimensional tensor. Thanks!

Too much time per epoch

Hey Soumith,
I am training Alexnet on a single GPU(with NN backend). A single epoch takes around 7 hours. Is this much time per epoch normal ? Also, would switching to CUDNN backend make too much of a difference wrt time per epoch ?

some bug in vggbn_cudnn.lua

in line 43 and line 46

features:add(nn.BatchNormalization(4096, 1e-3))
features:add(nn.BatchNormalization(4096, 1e-3))

I think it should be classifier rather than features, because the whole fully connected layer lies in GPU1 (or the preferred one)

random loading data sequence

Technically this is not an issue, but I feel more stable if this can be fixed.
The possible problem lies in 'dataset.lua' around line 105.
When function dir.getdirectories(path) seems to return a list of directories based on the modified time.
I prefer to sort the names so that the list of classes and their numbers are fixed.
It is especially required if you want to load some pre-trained model for example vgg16 because most of them arrange the classes and their numbers in alphabetical order
May be do a sort before the line 107 will solve the problem

Some help debugging?

After some time I get this

Epoch: [6][4563/10000]  Time 0.425 Err 6.8380 Top1-%: 0.39 LR 1e-02 DataLoadingTime 0.006
Epoch: [6][4564/10000]  Time 0.440 Err 6.8348 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.007
Epoch: [6][4565/10000]  Time 0.438 Err 6.7681 Top1-%: 1.56 LR 1e-02 DataLoadingTime 0.007
Epoch: [6][4566/10000]  Time 0.426 Err 6.7398 Top1-%: 1.56 LR 1e-02 DataLoadingTime 0.005
Epoch: [6][4567/10000]  Time 0.423 Err 6.7808 Top1-%: 1.17 LR 1e-02 DataLoadingTime 0.004
/usr/local/bin/luajit: /usr/local/share/lua/5.1/threads/threads.lua:255: 
[thread 22 callback] bad argument #2 to '?' (out of range at /tmp/luarocks_torch-scm-1-9679/torch7/generic/Tensor.c:880)
stack traceback:
        [C]: in function 'error'
        /usr/local/share/lua/5.1/threads/threads.lua:255: in function 'synchronize'
        /usr/local/share/lua/5.1/threads/threads.lua:196: in function 'addjob'
        /home/atcold/Work/GitHub/multiGPU-train/train.lua:99: in function 'train'
        main.lua:38: in main chunk
        [C]: in function 'dofile'
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406260

Then, after restarting,

Loading model from file: results-5k10k/20151104-162101-alexnetowtbn,batchSize=256,nDonkeys=24,nGPU=4,netType=alexnetowtbn,normalize=f/model_5.t7
==> Converting model to CUDA
Loading optimState from file: results-5k10k/20151104-162101-alexnetowtbn,batchSize=256,nDonkeys=24,nGPU=4,netType=alexnetowtbn,normalize=f/optimState_5.t7
==> doing epoch on training data:
==> online epoch # 6
/usr/local/bin/luajit: /usr/local/share/lua/5.1/threads/threads.lua:255: 
[thread 16 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 8 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 1 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 6 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 13 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 2 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30

and so on.

Newbie question: Is my model making any progress

Sorry if this is a stupid question - this is my first time training on imagenet. In the past, I only worked on smaller datasets such as cifar10 where few iterations through the dataset got to 60% accuracy.

After 10 epochs I still see that the Top-1 error is close to 0 for most of the batches (see log below). The error is still at 6.9% and not really going down. Is my model actually learning and just slow because imagenet is too huge or something wrong with my setup? Any help is appreciated.

Epoch: [11][1383/10000] Time 0.675 Err 6.9076 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.015
Epoch: [11][1384/10000] Time 0.675 Err 6.9091 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.014
Epoch: [11][1385/10000] Time 0.672 Err 6.9048 Top1-%: 1.56 LR 1e-02 DataLoadingTime 0.016
Epoch: [11][1386/10000] Time 0.678 Err 6.9069 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.014
Epoch: [11][1387/10000] Time 0.674 Err 6.9074 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.016
Epoch: [11][1388/10000] Time 0.672 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.014
Epoch: [11][1389/10000] Time 0.674 Err 6.9067 Top1-%: 1.56 LR 1e-02 DataLoadingTime 0.015
Epoch: [11][1390/10000] Time 0.675 Err 6.9071 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.015
Epoch: [11][1391/10000] Time 0.674 Err 6.9045 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.015

Wrong assertion in AbstractParallel.lua

I think in AbstractParallel.lua line 140 to 141, the assertion should be
assert(self.input_gpu[gpuid]:getDevice() ==
gpuid)

Instead of

assert(self.input_gpu[gpuid]:getDevice() ==
self.gpu_assignments[gpuid])

training on multiple GPUs gives nan

Hi,
I setup the code on my computer.
When I am using one GPU it seems fine and there is no problem:

th main.lua

==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/5636] Time 2.380 Err 3.9442 Top1-%: 2.34 LR 1e-02 DataLoadingTime 1.595
Epoch: [1][2/5636] Time 0.470 Err 3.6877 Top1-%: 21.09 LR 1e-02 DataLoadingTime 0.366
Epoch: [1][3/5636] Time 0.454 Err 3.2735 Top1-%: 28.12 LR 1e-02 DataLoadingTime 0.365
Epoch: [1][4/5636] Time 0.451 Err 3.2096 Top1-%: 25.78 LR 1e-02 DataLoadingTime 0.365
Epoch: [1][5/5636] Time 0.442 Err 2.8022 Top1-%: 33.59 LR 1e-02 DataLoadingTime 0.368
Epoch: [1][6/5636] Time 0.448 Err 3.0409 Top1-%: 24.22 LR 1e-02 DataLoadingTime 0.368
Epoch: [1][7/5636] Time 0.446 Err 2.7138 Top1-%: 32.81 LR 1e-02 DataLoadingTime 0.365
Epoch: [1][8/5636] Time 0.449 Err 2.7420 Top1-%: 25.78 LR 1e-02 DataLoadingTime 0.366
Epoch: [1][9/5636] Time 0.449 Err 2.7148 Top1-%: 22.66 LR 1e-02 DataLoadingTime 0.366
Epoch: [1][10/5636] Time 0.436 Err 2.6244 Top1-%: 19.53 LR 1e-02 DataLoadingTime 0.367
Epoch: [1][11/5636] Time 0.444 Err 2.7249 Top1-%: 17.19 LR 1e-02 DataLoadingTime 0.368
Epoch: [1][12/5636] Time 0.443 Err 2.5064 Top1-%: 27.34 LR 1e-02 DataLoadingTime 0.366

but when I am trying to use more than one GPU it sucks, also the runtime is getting worst! :

th main.lua -nGPU 2

Epoch: [1][1/5636] Time 2.001 Err nan Top1-%: 2.34 LR 1e-02 DataLoadingTime 2.173
Epoch: [1][2/5636] Time 2.416 Err nan Top1-%: 2.34 LR 1e-02 DataLoadingTime 0.366
Epoch: [1][3/5636] Time 2.416 Err nan Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.368
Epoch: [1][4/5636] Time 2.416 Err nan Top1-%: 3.12 LR 1e-02 DataLoadingTime 0.367
Epoch: [1][5/5636] Time 2.416 Err nan Top1-%: 2.34 LR 1e-02 DataLoadingTime 0.369

Changing GPU device id to >1 gives an error.

Getting this bug when changing the GPU device id from 1 to 2 or 3 (I have 3 GPUs on the same machine). https://github.com/soumith/imagenet-multiGPU.torch/blob/master/opts.lua#L28
Leaving it to 1 works fine.

(with GPU=2, nGPU =2)
==> doing epoch on training data:
==> online epoch # 1
Debugging session completed (traced 3 instructions).
/home/mf/Toolkits/torch/install/bin/luajit: /home/mf/Toolkits/torch/install/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed. at /home/mf/Toolkits/torch/extra/cutorch/lib/THC/THCTensorMath.cu:30
stack traceback:
[C]: in function 'zero'
/home/mf/Toolkits/torch/install/share/lua/5.1/nn/Module.lua:70: in function 'zeroGradParameters'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters'
...s/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:458: in function 'zeroGradParameters'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:167: in function 'opfunc'
/home/mf/Toolkits/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:174: in function 'f2'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/data.lua:36: in function 'addjob'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:97: in function 'train'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/main.lua:45: in main chunk
[C]: in function 'dofile'
...kits/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
Program completed in 28.99 seconds (pid: 6719).

(with GPU=2, nGPU =1)
==> doing epoch on training data:
==> online epoch # 1
/home/mf/Toolkits/torch/install/bin/luajit: /home/mf/Toolkits/torch/install/share/lua/5.1/nn/THNN.lua:177: Assertion `THCudaTensor_checkGPU( state, 4, input, target, output, total_weight )' failed. at /home/mf/Toolkits/torch/extra/cunn/lib/THCUNN/ClassNLLCriterion.cu:123
stack traceback:
[C]: in function 'v'
/home/mf/Toolkits/torch/install/share/lua/5.1/nn/THNN.lua:177: in function 'ClassNLLCriterion_updateOutput'
...its/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:41: in function 'forward'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:169: in function 'opfunc'
/home/mf/Toolkits/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:174: in function 'f2'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/data.lua:36: in function 'addjob'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:97: in function 'train'
/home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/main.lua:45: in main chunk
[C]: in function 'dofile'
...kits/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
Program completed in 4.10 seconds (pid: 11801).

NOTE: torch is updated and cloned the master-branch repo.

Data loading time

One question. This is how my log looks like.

=> Criterion
nn.ClassNLLCriterion
==> Converting model to CUDA
==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/10000]     Time 6.079 Err 8.5379 Top1-%: 0.20 LR 1e-02 DataLoadingTime 67.312
Epoch: [1][2/10000]     Time 1.972 Err 8.5480 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.043
Epoch: [1][3/10000]     Time 1.968 Err 8.5506 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.054
Epoch: [1][4/10000]     Time 1.957 Err 8.5445 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.055
Epoch: [1][5/10000]     Time 1.979 Err 8.5556 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.040
Epoch: [1][6/10000]     Time 1.932 Err 8.5436 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.046
Epoch: [1][7/10000]     Time 1.973 Err 8.5321 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.041
Epoch: [1][8/10000]     Time 1.955 Err 8.5400 Top1-%: 0.10 LR 1e-02 DataLoadingTime 0.033
Epoch: [1][9/10000]     Time 1.937 Err 8.5451 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.033
Epoch: [1][10/10000]    Time 1.963 Err 8.5365 Top1-%: 0.10 LR 1e-02 DataLoadingTime 0.029
Epoch: [1][11/10000]    Time 2.195 Err 8.5423 Top1-%: 0.00 LR 1e-02 DataLoadingTime 21.177
Epoch: [1][12/10000]    Time 1.960 Err 8.5410 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.048
Epoch: [1][13/10000]    Time 1.939 Err 8.5555 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.036
Epoch: [1][14/10000]    Time 1.969 Err 8.5508 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.037
Epoch: [1][15/10000]    Time 1.976 Err 8.5580 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.046
Epoch: [1][16/10000]    Time 1.947 Err 8.5506 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.036
Epoch: [1][17/10000]    Time 2.032 Err 8.5355 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.034
Epoch: [1][18/10000]    Time 2.012 Err 8.5335 Top1-%: 0.10 LR 1e-02 DataLoadingTime 0.050
Epoch: [1][19/10000]    Time 1.990 Err 8.5225 Top1-%: 0.10 LR 1e-02 DataLoadingTime 0.043
Epoch: [1][20/10000]    Time 1.983 Err 8.5323 Top1-%: 0.10 LR 1e-02 DataLoadingTime 0.048
Epoch: [1][21/10000]    Time 2.193 Err 8.5370 Top1-%: 0.10 LR 1e-02 DataLoadingTime 24.326
Epoch: [1][22/10000]    Time 2.027 Err 8.5282 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.049
Epoch: [1][23/10000]    Time 1.904 Err 8.5333 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.041
Epoch: [1][24/10000]    Time 1.983 Err 8.5305 Top1-%: 0.10 LR 1e-02 DataLoadingTime 0.065
Epoch: [1][25/10000]    Time 1.954 Err 8.5322 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.056
Epoch: [1][26/10000]    Time 1.947 Err 8.5401 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.037
Epoch: [1][27/10000]    Time 1.964 Err 8.5405 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.043
Epoch: [1][28/10000]    Time 2.014 Err 8.5287 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.046
Epoch: [1][29/10000]    Time 1.997 Err 8.5278 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.050
Epoch: [1][30/10000]    Time 2.017 Err 8.5354 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.032
Epoch: [1][31/10000]    Time 2.215 Err 8.5288 Top1-%: 0.00 LR 1e-02 DataLoadingTime 25.476

There's a ~20 s loading time every 10 lines (I'm using 10 donkeys). Is this 'OK'?

Changing data path at the time of re-training doesn't make any difference !!!

I started my training with datapath : /home/praveer/raid/imagenet. Ran it till 9 epochs. Then stopped and re-started, this time with the path /home/praveer/ssd/imagenet. For the first few epochs, I had the raid folder mounted so I dint encounter any problem. But as soon as I unmount the raid, it could not trace the imagenet folder. I have re-checked with the datapath (given as -data in the command line) and that seems fine. I initially thought that working with the SSD is equally slow as the HDD but in fact it is the dataloader that has been reading from the HDD instead of switching to the SSD.

png file in ILSVRC12

The n02105855_2933.JPEG in ILSVRC12_img_train.tar is actually a PNG file. This is also reported here: https://github.com/cytsai/ilsvrc-cmyk-image-list

This will crash the image loader because image.load uses extension to get file format.

Did you manually convert it to jpg? Maybe add this to README to help others avoid this.

What is the best way to test the1-bit quantization routines?

Hi, What is the best way to test the1-bit quantization routines with the multi-gpu imagenet example?

Thank you.

CPU version

Do we have CPU backend for this imagenet for Torch?

Alexnet code for fbcunn

I had been using cudnn and cunn as backend for the alexnet code (https://github.com/soumith/imagenet-multiGPU.torch). Recently I was looking at moving to fbcunn and wanted to run the same alexnet code.

I dont see any backend available as 'fbcunn' to use. What modifications do I have to make to get the alexnet code working with fbcunn ?

multi GPU: select GPU to use and #to use

in order to use GPUs 3,4 in a system with 4 GPUS, for examples, we would need to change https://github.com/soumith/imagenet-multiGPU.torch/blob/master/models/alexnet_ccn2.lua#L38 to "for i=opt.GPU,nGPU do", right?

GoogLeNet not training

Hi guys,
I'm having some very weird issues. AlexNet is training fine, but when changing the model to GoogLeNet, the network just never seems to train.
I originally tried with a small dataset I already had and even after many epochs the classification result was still random.
Currently testing it with ImageNet and have the same problem. I am using batch size 60 and epoch size 10000. Of course I should probably wait for a few epochs, but after 1-2 epochs the Loss function error has not moved away from 6.9 and the accuracy is virtually 0.
I have also tried many different model implementations of GoogLeNet in case there is a problem with this particular one, but to no avail.
Any idea what might be the problem? Can it be that data augmentation is a prerequisite or something?

std calculation

I don't think this matters much, but you're calculating the per channel std without the global mean you precalculated. (donkey.lua line 193). This is incorrect, but assuming you wanted to do it like this, you could collapsed the mean and std loop into a single one instead of doing two. The proper way would be to calculate the std using the previously calculated mean. However, I don't think this will change much in practice.

vggbn training's validation accuracy is very low

I was trying to train vggbn (16-layer, model 'D'), using batchsize = 24 with everything else in default setting, but the training does not converge at all, at the 5th iteration the training loss is still ~6.9.

then i tried adding nn.SpatialBatchNormalization after each convolutional layer and re-started the training, the training finally started to move. However, the validation accuracies are very low, these are my results for the first 8 iterations:

% top1 accuracy (test set) (center crop) avg loss (test set)
2.8140e+00 5.9005e+00
8.1700e+00 4.9847e+00
1.1460e+01 4.6748e+00
1.4304e+01 4.3753e+00
1.8798e+01 3.9825e+00
2.2012e+01 3.7526e+00
2.2900e+01 3.7133e+00

any thoughts..?

Train Alexnet with 4 GPUs seems slower than one

Hi.

I have 4 GPUs on my machine, and they're connected with PCIE16. When I train AlexnetOWT model with nGPU=4, one batch takes 2.3s; but when nGPU=1, one batch takes only 0.48s. Does this look correct?

backend is cudnn, nDockeys=32 as there's 32 cpu cores, others use the default parameter.

Thanks

ccn2: output dimension after first convolution

Hey,

in
https://github.com/soumith/imagenet-multiGPU.torch/blob/master/models/alexnet_ccn2.lua
why is the output dimension after the first convolution 55 and not
54 = floor(224 - 11 + 2 * 0 / 4) + 1?
Is it a result of an implicit padding of 2?

Thanks for your help and great work.

Class has zero samples

Hello,
I am getting this error when I try to run the program. I have the training data and validation data in place in the appropriate folders. Can you guys help me to resolve this problem?

opts.backend not supported, breaks model.lua

In opts.lua :

cmd:option('-backend',     'cudnn', 'Options: cudnn | ccn2 | cunn')

selecting any other option apart from cudnn breaks at, model.lua :

if opt.backend == 'cudnn' then
    require 'cudnn'
    cudnn.convert(model, cudnn)
 elseif opt.backend ~= 'nn' then
    error'Unsupported backend'

Since the above script only supports 'cudnn'. We should consider either updating opts.lua or model.lua.

Predict bug attempt to call global 'testHook' (a nil value)

I tried this bit of code from the readme and got attempt to call global 'testHook' (a nil value). So I changed local testHook = function(self, path) to testHook = function(self, path) and it works. Just posting here to report that the read me example might be wrong.

dofile('donkey.lua')
img = testHook({loadSize}, 'test.jpg')
model = torch.load('model_10.t7')
if img:dim() == 3 then
  img = img:view(1, img:size(1), img:size(2), img:size(3))
end
predictions = model:forward(img:cuda())

Code does not resize images to 256x256

Hi,
I want to train AlexNet on ImageNet dataset. I preprocessed the original images with the following command given in the Readme file.

find . -name "*.JPEG" | xargs -I {} convert {} -resize "256^>" {}

Then I got images that have one of their sides sized 256. After this I was thinking that the images are resized to 256x 256 and then cropped to 224x224 as written in the AlexNet paper. But the resizing to 256x256 does not take place in the loadImage() function in donkey.lua . The input image has the same size as the output that comes out of image.scale() function. So training is done on 224x224 images that are croppped from 256xN sized images where N>= 256.

Is this a bug or is it intended to be this way?
I added the loadImage function to the end of the message for reference.

Best,
Serdar

opt.imageSize = 256
opt.cropSize = 224

local loadSize = {3, opt.imageSize, opt.imageSize}
local sampleSize = {3, opt.cropSize, opt.cropSize}

local function loadImage(path)
local input = image.load(path, 3, 'float')
-- find the smaller dimension, and resize it to loadSize (while keeping aspect ratio)
if input:size(3) < input:size(2) then
input = image.scale(input, loadSize[2], loadSize[3] * input:size(2) / input:size(3))
else
input = image.scale(input, loadSize[2] * input:size(3) / input:size(2), loadSize[3])
end
return input
end

Vggbn

Hi,
Thanks for the great script.
There is a problem in vggbn model,
in the classifier part
classifier:add(nn.Linear(4096, 1000))
shouldn't be replaced with:
classifier:add(nn.Linear(4096, nrclasses))

memory usage of gpu0 is doubled when use Multi-GPU training

Hi,
I am new to torch and I tried to train VGG model using the model file of models/vgg_cudnn.lua. I used 4 K20 and found the memory usage of GPU0 is about doubled (4199MiB) comprared with the others (2495MiB each). And as the comment in models/vgg_cudnn.lua, I did meet the problem of run out of memory using VGG-D.
Can you please give me any advice on this?
Thank you very much in advance

A tiny tiny bug

donkey.lua
line 122: h1+oH

Threaded data loading

Looking at the addjob call here it seems that each thread loads a batch, which is then sent to the main thread to be passed through the model.
However, this is happening in a multi-GPU setting, with data and model parallelism.
In a single GPU setting, if I want to have a thread that loads a batch and augments the data, while another thread passes the batch through the model, how do I go about implementing such a scenario?

nGPU=4 is slower than nGPU=1

Hi,

Do you know why the results showed that “nGPU=4” is still slower than “nGPU=1”?

-nGPU 1 -batchSize 128
Epoch: [1][2/10000] Time 0.835 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][3/10000] Time 0.834 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.233
Epoch: [1][4/10000] Time 0.834 Err 6.9104 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003
Epoch: [1][5/10000] Time 0.834 Err 6.9075 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.871
Epoch: [1][6/10000] Time 0.836 Err 6.9064 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003
Epoch: [1][7/10000] Time 0.833 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.776

-nGPU 4 -batchSize 512
Epoch: [1][2/10000] Time 3.915 Err 6.9070 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003
Epoch: [1][3/10000] Time 4.449 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 11.843
Epoch: [1][4/10000] Time 3.906 Err 6.9079 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][5/10000] Time 3.898 Err 6.9078 Top1-%: 0.20 LR 1e-02 DataLoadingTime 7.108
Epoch: [1][6/10000] Time 3.902 Err 6.9079 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.005

-nGPU 8 -batchSize 1024
Epoch: [1][2/10000] Time 7.186 Err 6.9080 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.006
Epoch: [1][3/10000] Time 6.947 Err 6.9079 Top1-%: 0.10 LR 1e-02 DataLoadingTime 24.149
Epoch: [1][4/10000] Time 6.724 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.007
Epoch: [1][5/10000] Time 7.773 Err 6.9080 Top1-%: 0.10 LR 1e-02 DataLoadingTime 16.892
Epoch: [1][6/10000] Time 6.731 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.007

Thank you.

Chien-Lin

mattorch.load with multi threading.

Hi,
I am using mattorch.load to read ".mat" data files saved using "-v7.3" in matlab. It is working fine with "-nDonkeys 0 or 1". However when I run it with "-nDonkeys 2 or more " it gives segmentation fault (core dump) and sometimes it throws bus error (core dump).
Thank you in advance.

A question about predicting individual images

Good morning!

Thank you for sharing your work. It is really helpful.

May I ask a question about predicting individual images? I tried to do it by following the code on the ReadMe file, but there were some errors. My full code is:

require 'torch'
require 'cutorch'
require 'paths'
require 'xlua'
require 'optim'
require 'nn'
require 'cudnn'
require 'cunn'

torch.setdefaulttensortype('torch.FloatTensor')
local opts = paths.dofile('opts.lua')
opt = opts.parse(arg)

paths.dofile('donkey.lua')
img = testHook({loadSize}, './cr-test.jpg')
model = torch.load('./model_1.t7')
model:evaluate()
if img:dim() == 3 then
    img = img:view(1, img:size(1), img:size(2), img:size(3))
end

-- the next line causes error
predictions = model:forward(img:cuda())

and, the error message is:

-- ignore option data   
-- ignore option optimState 
-- ignore option cache  
-- ignore option netType    
-- ignore option retrain    
Loading train metadata from cache   
Loading test metadata from cache    
Loaded mean and std from cache. 
/home/jaewoo/programs/torch/install/bin/luajit: ...oo/programs/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 2 module of nn.Sequential:
In 4 module of nn.Sequential:
...torch/install/share/lua/5.1/cudnn/BatchNormalization.lua:44: assertion failed!
stack traceback:
    [C]: in function 'assert'
    ...torch/install/share/lua/5.1/cudnn/BatchNormalization.lua:44: in function 'createIODescriptors'
    ...torch/install/share/lua/5.1/cudnn/BatchNormalization.lua:64: in function <...torch/install/share/lua/5.1/cudnn/BatchNormalization.lua:63>
    [C]: in function 'xpcall'
    ...oo/programs/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    ...o/programs/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function <...o/programs/torch/install/share/lua/5.1/nn/Sequential.lua:41>
    [C]: in function 'xpcall'
    ...oo/programs/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    ...o/programs/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    main-temp.lua:32: in main chunk
    [C]: in function 'dofile'
    ...rams/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above.
stack traceback:
    [C]: in function 'error'
    ...oo/programs/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
    ...o/programs/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    main-temp.lua:32: in main chunk
    [C]: in function 'dofile'
    ...rams/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

Would you help me? Thank you for reading!

Torch.save on dataset class error with gm package

I was trying to save the dataset class (as described in dataset.lua) but kept running into the following error:

/home/vrkrishn/torch/install/share/lua/5.1/torch/File.lua:107: Unwritable object
stack traceback:
[C]: in function 'error'
/home/vrkrishn/torch/install/share/lua/5.1/torch/File.lua:107: in function 'writeObject'

As shown in the error message, there was an error in Lua's writeObject function. When I printed out the key that caused the error, I found that it originated from the parseExif function (specifically the fromBlob function) that is part of gm:Image(). I deduced that the dataset module could not properly save the defaultSampleHook field.

Therefore, I removed defaultSampleHook from the dataset.lua and the save process worked.

I am using Lua 5.1 and the latest version of Torch/gm.

opt was expecting

sampleSize was using opt.cropSize while main was actually setting opt.imageCrop... Sorry. =S

PS: The error is correct in my last commit

issue with clearState

following this simple code I will get an error:

mm = nn.Sequential():add(cudnn.SpatialConvolution(3, 96, 11,11,4,4,2,2)):cuda()
x = torch.rand(2,3,224,224):cuda()
mm:forward(x);
mm:clearState();
mm:forward(x);

...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:386: attempt to perform arithmetic on a nil value
stack traceback:
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:386: in function 'updateOutput'
.../mohammadr/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
[string "mm:forward(x);"]:1: in main chunk
[C]: in function 'xpcall'
/home/mohammadr/torch/install/share/lua/5.1/trepl/init.lua:669: in function 'repl'
...madr/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
[C]: at 0x00406670

Error attempt to index field 'THNN' (a nil value)

I am running this code on my on dataset with 826 classes using the following command:
th main.lua -data myDataset -netType vgg -batchSize 12 -nClasses 826
The type of the VGG network I'm using is 'E' after the first epoch I decide to try out by myself how the model is performing using this code:

img = testHook(224, "myImage.jpg")
model = torch.load("model_1.t7")
if img:dim() == 3 then
   img = img:view(1, img:size(1), img:size(2), img:size(3))
end
predictions = model:forward(img:cuda())

I'm getting this error

Error attempt to index field 'THNN' (a nil value)

I do not know where this error is coming from...

Is there any way to fix that?
Is it an error from any luarocks installation?

Thanks in advance

How can I use fbcunn libs in the exmaple codes?

Hi,

How can I use "fbcunn.OneBitDataParallel" or any fbcunn libs in the example code?

Thank you.

Chien-Lin

weighted aux_classifier of googlenet

Does it supported weighted aux_classifier in googlenet according in the paper

I got an error (attempt to call method 'for_each' (a nil value)).

I run with the following command:

th main.lua -data /devData/BenchmarkData/ILSVRC2012_resize  -netType overfeat

But I got the following error messages. Any suggestions? Thanks.

/home/jwh/torch/install/bin/luajit: ...est_torch/imagenet-multiGPU.torch/fbcunn_files/Optim.lua:61: attempt to call method 'for_each' (a nil value)
stack traceback:
    ...est_torch/imagenet-multiGPU.torch/fbcunn_files/Optim.lua:61: in function '__init'
    /home/jwh/torch/install/share/lua/5.1/torch/init.lua:54: in function </home/jwh/torch/install/share/lua/5.1/torch/init.lua:50>
    [C]: in function 'Optim'
    /home/jwh/test_torch/imagenet-multiGPU.torch/train.lua:35: in main chunk
    [C]: in function 'dofile'
    main.lua:35: in main chunk
    [C]: in function 'dofile'
    .../jwh/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x00406670

Cannot train googlenet

I tried to train googlenet with the command:

 th main.lua -data ./imagenet -netType googlenet -nGPU 1 -nDonkeys 4

And it throws a lot of exceptions at the very beginning:

==> doing epoch on training data:
==> online epoch # 1
/home/wyx/torch/install/bin/luajit: /home/wyx/torch/install/share/lua/5.1/threads/threads.lua:264: 
[thread 3 endcallback] ...wyx/torch/install/share/lua/5.1/cudnn/SpatialSoftMax.lua:71: assertion failed!
stack traceback:
        [C]: in function 'assert'
        ...wyx/torch/install/share/lua/5.1/cudnn/SpatialSoftMax.lua:71: in function 'updateGradInput'
        /home/wyx/torch/install/share/lua/5.1/nn/Module.lua:30: in function 'backward'
        /home/wyx/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
        /home/wyx/torch/install/share/lua/5.1/nn/Concat.lua:70: in function 'backward'
        /home/wyx/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
        /home/wyx/imagenet-multiGPU.torch/train.lua:171: in function 'opfunc'
        /home/wyx/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
        /home/wyx/imagenet-multiGPU.torch/train.lua:174: in function </home/wyx/imagenet-multiGPU.torch/train.lua:155>
        [C]: in function 'xpcall'
        /home/wyx/torch/install/share/lua/5.1/threads/threads.lua:173: in function 'dojob'
        /home/wyx/torch/install/share/lua/5.1/threads/threads.lua:220: in function 'addjob'
        /home/wyx/imagenet-multiGPU.torch/train.lua:97: in function 'train'
        main.lua:44: in main chunk
        [C]: in function 'dofile'
        .../wyx/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405d30

The assertion at SpatialSoftMax.lua:71 is:

assert(gradOutput:isContiguous());

However if I change googlenet to vgg in the command, it can start training.

attempt to call global 'unpack' (a nil value)

I have a saved model using imagenet-multiGPU.torch

I am unable to torch.load(model). The error is

attempt to call global 'unpack' (a nil value)

Using Lua 5.2

License for Repo

Hi Soumith
Nice work! I was wondering if you could add a license for this?
Thanks!

Wrong model definition

googlenet.lua does not reflect GoogLeNet architecture (defined in Going Deeper with Convolutions).

In GoogLeNet, inception modules have 2 layers only with 3x3 and 5x5 convolution and 3x3 max-pooling, plus 1x1 dimension matching convolutions.

What is googlenet.lua implementing?

I think there is a little of confusion with the models. From Szegedy we have:

GoogLeNet ❌
BN-GoogLeNet
BN-Inception
Inception-v2
- Vanilla
- Label Smoothing
- Factorized 7 × 7
- BN-auxiliary
Inception-v3

and GoogLeNet is not the implemented architecture.

utils.lua nn.DataParallelTable requires 'cunn'

With nGPU>1, the following error occurs on running th main.lua

[string "model = nn.DataParallelTable(1)"]:1: attempt to call field 'DataParallelTable' (a nil value)
stack traceback:
[string "model = nn.DataParallelTable(1)"]:1: in main chunk
[C]: in function 'xpcall'
/home/aranjan/torch/install/share/lua/5.1/trepl/init.lua:669: in function 'repl'
...njan/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
[C]: at 0x00406670

This is because nn.DataParallelTable requires cunn. The following following test summarizes the issue and a possible fix.

th> model = nn.DataParallelTable(1)
[string "model = nn.DataParallelTable(1)"]:1: attempt to call field 'DataParallelTable' (a nil value)
stack traceback:
[string "model = nn.DataParallelTable(1)"]:1: in main chunk
[C]: in function 'xpcall'
/home/aranjan/torch/install/share/lua/5.1/trepl/init.lua:669: in function 'repl'
...njan/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
[C]: at 0x00406670  

th> require 'cunn'
true    
                                                                  [0.0397s] 
th> model = nn.DataParallelTable(1)
                                                                  [0.0001s]

Getting not supported error for DataParallelTable type

When I try this on multiple GPUs, I get this error:

th main.lua -netType overfeat -data /lscratch/prakash/Torch-Imagenet/ -backend cunn -nGPU 2

it gives error:

usr6/prakash/DNN/Torch/luajit-rocks/bin/luajit: ...ch/luajit-rocks/share/lua/5.1/cunn/DataParallelTable.lua:414: type() not supported for DataParallelTable.
stack traceback:
[C]: in function 'error'
...ch/luajit-rocks/share/lua/5.1/cunn/DataParallelTable.lua:414: in function 'type'
...rakash/DNN/Torch/luajit-rocks/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType'
...rakash/DNN/Torch/luajit-rocks/share/lua/5.1/nn/utils.lua:41: in function 'recursiveType'
...akash/DNN/Torch/luajit-rocks/share/lua/5.1/nn/Module.lua:123: in function 'cuda'
...ch/demos-master/imagenet-multiGPU.torch-master/model.lua:44: in main chunk

Am I missing something?

PS: It runs correctly with nGPU 1, though.

Prakash

soumith / imagenet-multigpu.torch Goto Github PK

imagenet-multigpu.torch's People

Contributors

Stargazers

Watchers

Forkers

imagenet-multigpu.torch's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs