GithubHelp home page GithubHelp logo

facebookarchive / fb.resnet.torch Goto Github PK

View Code? Open in Web Editor NEW
2.3K 121.0 663.0 100 KB

Torch implementation of ResNet from http://arxiv.org/abs/1512.03385 and training scripts

License: Other

Lua 100.00%

fb.resnet.torch's Introduction

Archived

This project is no longer maintained. Consider using ImageNet training in PyTorch instead.

ResNet training in Torch

This implements training of residual networks from Deep Residual Learning for Image Recognition by Kaiming He, et. al.

We wrote a more verbose blog post discussing this code, and ResNets in general here.

Requirements

See the installation instructions for a step-by-step guide.

If you already have Torch installed, update nn, cunn, and cudnn.

Training

See the training recipes for addition examples.

The training scripts come with several options, which can be listed with the --help flag.

th main.lua --help

To run the training, simply run main.lua. By default, the script runs ResNet-34 on ImageNet with 1 GPU and 2 data-loader threads.

th main.lua -data [imagenet-folder with train and val folders]

To train ResNet-50 on 4 GPUs:

th main.lua -depth 50 -batchSize 256 -nGPU 4 -nThreads 8 -shareGradInput true -data [imagenet-folder]

Trained models

Trained ResNet 18, 34, 50, 101, 152, and 200 models are available for download. We include instructions for using a custom dataset, classifying an image and getting the model's top5 predictions, and for extracting image features using a pre-trained model.

The trained models achieve better error rates than the original ResNet models.

Single-crop (224x224) validation error rate

Network Top-1 error Top-5 error
ResNet-18 30.43 10.76
ResNet-34 26.73 8.74
ResNet-50 24.01 7.02
ResNet-101 22.44 6.21
ResNet-152 22.16 6.16
ResNet-200 21.66 5.79

Notes

This implementation differs from the ResNet paper in a few ways:

Scale augmentation: We use the scale and aspect ratio augmentation from Going Deeper with Convolutions, instead of scale augmentation used in the ResNet paper. We find this gives a better validation error.

Color augmentation: We use the photometric distortions from Andrew Howard in addition to the AlexNet-style color augmentation used in the ResNet paper.

Weight decay: We apply weight decay to all weights and biases instead of just the weights of the convolution layers.

Strided convolution: When using the bottleneck architecture, we use stride 2 in the 3x3 convolution, instead of the first 1x1 convolution.

fb.resnet.torch's People

Contributors

arunmallya avatar aychang95 avatar colesbury avatar cubbee avatar cysu avatar d-x-y avatar facebook-github-bot avatar ffmpbgrnn avatar gchanan avatar iamaaditya avatar iassael avatar ltrottier avatar lvdmaaten avatar maraoz avatar soumith avatar szagoruyko avatar tornadomeet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fb.resnet.torch's Issues

SpatialBatchNormalization not working

Loading model from file: resnet-50.t7
=> Replacing classifier with 2-way classifier
=> Training epoch # 1
/home/.../torch/install/bin/luajit: ...h/install/share/lua/5.1/nn/SpatialBatchNormalization.lua:83: attempt to call field 'SpatialBatchNormalization_updateOutput' (a nil value)
stack traceback:
...h/install/share/lua/5.1/nn/SpatialBatchNormalization.lua:83: in function 'updateOutput'
..._Workspace/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./train.lua:56: in function 'train'
main.lua:49: in main chunk
[C]: in function 'dofile'
...pace/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

Reproducing resnet-200 results

I download the resnet 200 model and test it on ILSVRC2012 validation set and got similar top1 and top5 on 224x224 scale, but when I change the scale to 320, I got 20.5(top1) and 5.2(top5).

The main problem should be the size of feature map. I wonder how to process the feature map right after the global average pooling layer, because 7x7 pooling will get larger feature maps (2048x4x4) for 320x320 input, is it right to take the average of a 4x4 map and turn it into 1x1?

Converting models to Caffe.

Hi,

I am trying to convert models to Caffe using https://github.com/facebook/fb-caffe-exts.

While folding b-norm layers into their predecessing conv layers, this repo looks for variable bn_layer.running_std in a b-norm layer. While models trained using the code here, have bn_layer.running_var

Can you help with these conversions? Should I just converted var into std?

I see that running_var = running_std:pow(-2):add(-self.eps) should I just reverse this to get the std and go ahead with the conversions?

the generated model is in a big size

I retrain the resent-50.t7 on my own data, but the model I have trained is about 6.7G, it is too big, is that regular?? how to reduce the size ?

Converting pretrained model to nngraph fails

I try to convert the pretrained model to a graph module, however it fails at backprogation.
Here is the code:

local inp = nn.Identity()()
local m = nn.Sequential()
m:add(nn.Identity())
m:add(pretrained_resnet_model)
net= m(inp)
local model = nn.gModule({inp}, {net})

The errors happens when I am calling

self.model:backward(self.input,self.criterion.gradInput)

with the same error in the gmodule.lua file:

assert(#self.innode.data.gradOutput == 1, "expecting the innode to be used only once")

I checked the value of #self.innode.data.gradOuput and it's 0.
Is there any reason why this doesn't work for resnet?

8.5 GB models

Hi,
Is there a way to "clean" models before storing them. Models (resnet-34) are currently taking 8.5 GB after each epoch of training.

cudnn error

I installed cudnn 4 and installed the cudnn torch bindings.

When I ran
th main.lua -dataset cifar10 -nGPU 1 -depth 20
I got the following error

=> Creating model from file: models/resnet.lua
 | ResNet-20 CIFAR-10
=> Training epoch # 1
/home/zichaoy/hdd/zichaoy/torch/install/bin/luajit: ...hdd/zichaoy/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
...y/hdd/zichaoy/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED (cudnnSetFilterNdDescriptor)
stack traceback:
    [C]: in function 'error'
    ...y/hdd/zichaoy/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:45: in function 'resetWeightDescriptors'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:358: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:357>
    [C]: in function 'xpcall'
    ...hdd/zichaoy/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    ...dd/zichaoy/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./train.lua:56: in function 'train'
    main.lua:49: in main chunk
    [C]: in function 'dofile'
    ...haoy/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above.
stack traceback:
    [C]: in function 'error'
    ...hdd/zichaoy/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
    ...dd/zichaoy/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./train.lua:56: in function 'train'
    main.lua:49: in main chunk
    [C]: in function 'dofile'
    ...haoy/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

Any suggestions on how to fix this? Many thanks!

Retrained models really big

Hi,

I have some question about the size of the retrained models. The t7 files I get after retraining are really big - the smallest ones are 1.6 GB for resnet-18. Is there a reason why this is the case? Is there a way to reduce these files? Is it possible to reduce the memory of the files while training them?

Thanks,

Bojan

Unable to install cudnn bindings on Ubuntu 14.04 LTS

Hi there,

I've been following the installation process well until I get to the installation of cudnn bindings. When I run

sudo luarocks make

I get the following error message:

Missing dependencies for cudnn:
torch >= 7.0
cutorch
Error: Could not satisfy dependency: torch >= 7.0

I have run both "luarocks torch" and "luarocks cutorch" commands successfully (although the last one gave me a bunch of warnings). I am using Ubuntu 14.0.4 LTS.

Any ideas what might be causing this?

Thanks,

Bojan

How to implement the random lighting noise in datasets/transforms.lua?

Hi everybody, I was curious how to implement the random lighting noise(alexnet style) in datasets/transforms.lua. Specifically, how to get the pca eigenvalue and vectors for another dataset. like what is the dimension of the input matrix that lead to a 3X3 covariance matrix? Thanks. Any details would do.

Inconsistent model description and pre-trained model

The pretrained models (ResNet-18/34/50) have no BN layer in the shortcut connection where the number of feature maps changes. which is inconsistent with the model description here.

shortcut in pretrained model:

  (1): nn.ConcatTable {
    input
      |`-> (1): nn.Sequential {
      |      [input -> (1) -> (2) -> (3) -> (4) -> (5) -> output]
      |      (1): cudnn.SpatialConvolution(64 -> 128, 3x3, 2,2, 1,1)
      |      (2): nn.SpatialBatchNormalization
      |      (3): cudnn.ReLU
      |      (4): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
      |      (5): nn.SpatialBatchNormalization
      |    }
      |`-> (2): cudnn.SpatialConvolution(64 -> 128, 1x1, 2,2)
       ... -> output
  }

shortcut in model definition:

  (1): nn.ConcatTable {
    input
      |`-> (1): nn.Sequential {
      |      [input -> (1) -> (2) -> (3) -> (4) -> (5) -> output]
      |      (1): cudnn.SpatialConvolution(64 -> 128, 3x3, 2,2, 1,1) without bias
      |      (2): nn.SpatialBatchNormalization
      |      (3): cudnn.ReLU
      |      (4): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1) without bias
      |      (5): nn.SpatialBatchNormalization
      |    }
      |`-> (2): nn.Sequential {
      |      [input -> (1) -> (2) -> output]
      |      (1): cudnn.SpatialConvolution(64 -> 128, 1x1, 2,2) without bias
      |      (2): nn.SpatialBatchNormalization
      |    }
       ... -> output

Bounding box for ImageNet data

I am able to generate probabilities and classes on ImageNet datasets using the ResNet models but I was wondering if there is a way to generate bounding boxes for these images on the object detection dataset.

Training from scratch does not work

I tried to train the resnet-18 from scratch but I could not get the same accuracy. my top-1 is = (val 58.6%, train 54.7%) . I am using Titan-X I am wondering if that matters? I also noticed that iteration time varries a lot and as it goes more iterations it become slower. Mostly it happen in Dataloading time.

Issues reproducing results when training from scratch

I'm attempting to use your code to evaluate a new method (and it's great to have such accessible code for training state of the art models!), but I'm having issues reproducing baseline results with your unmodified code anywhere close to the validation accuracy you have listed for pre-trained models when training from scratch. In particular, when training from scratch ResNet-101, I get:

Finished top1: 22.942 top5: 6.814

and for your more recent ResNet-200 I get:

Finished top1: 21.681 top5: 6.142

The training script calls for each of these:
th main.lua -depth 101 -batchSize 256 -nGPU 8 -nThreads 8 -shareGradInput true -data ${IMAGENET_DIR}
th main.lua -depth 200 -batchSize 256 -nGPU 8 -nThreads 8 -netType preresnet -shareGradInput true -data ${IMAGENET_DIR}

The ResNet 200 experiment was run on commit a446597

The training log for resnet-200: resnet-200.zip

The machine used for training has 8x Titan X, and nvidia driver 352.63, cuda 7.5 and cudnn 4.0.7

Did this project support on cuDNN 3

Can I train this code with cuDNN 2,Because my driver is cuda-6.5. And have not the permission to update the driver I found in models/resnet.lua line 154 have following code:

if cudnn.version >= 4000 then
    v.bias = nil
    v.gradBias = nil
else
    v.bias:zero()
end
  1. Why set v.bias and v.gradBias to nil in cuDNN 4
  2. BTW, Can I train it with cunn

finetune the model for extrating feature

I want to use my own data to finetune the model for extrating feature, but the finetune procession just change the Linner layer(in file init.lua), I want to chang more layers backward, how can i do?? looking forward to your reply~

Licence?

What's the licence for the pre-trained models?

What's the meaning of following statements?

I noticed these statements in models/resnet.lua file. I know that these are used for the initializing the
respective layers. Can anyone please tell me the individual significance of the following statements?
It seems that, only one of them get executed at any given time.

ConvInit('cudnn.SpatialConvolution')
ConvInit('nn.SpatialConvolution')
BNInit('fbnn.SpatialBatchNormalization')
BNInit('cudnn.SpatialBatchNormalization')
BNInit('nn.SpatialBatchNormalization')

Thanks.

Sizings, questions, possible error in preresnet

Hi, first thank you very much for maintaining this excellent repo. I had a few questions:

  • The sizing patterns are all wrong and suspicious, though I think this is inherited from the ResNet paper. If the input is 224x224 then 7x7 conv with stride 2 pad 3 does not "work out". Does this effectively chop off one pixel from one side? This results in 112x112 maps that then undergo 3x3 pool stride 2, which again does not work out. In this case I assume the pool implementation effectively adds a padding of 1 of zeros? Any comments on these asymmetries and if they are a problem?
  • The new preresnet appears to ignore the new type variable. Shouldn't the declaration of layer read layer(block, features, count, stride, type) with the added type, which should then be passed along to children blocks? It's also a little more subtle because you only want 'first' to apply on the very first block of very first layer, so the layer function has to be careful to only pass it in when i==1.
  • May I ask what the purpose of the additional model:add(nn.Copy(nil, nil, true)) is on L132?

1202 Layer Net

Hi,

Have you ever tried to train the 1202 layer net on CIFAR-10?
Using the current code, this large net (with batch size 128 on 2 GPUs) doesn't seem to converge. Even if I lower the learning rate to 0.01 for the first 5 epochs, it doesn't seem to converge.
Would really appreciate anyone's input on this! Thanks!

Yu

threads dojob waiting forever

Hi, I have a problem that when I try to run the code twice, it runs one time and works perfect. however, when i ran another, in the dataloader.lua, the dojob can not be finished and the whole system is waiting. Any idea why this happens?

features almost the same

hi~
I have used extract-features.lua to extract features from diffrent class(5000 images), but the feature almost the same, the L2-distance almost 1, is it normal? looking foward fro replay,Thanks in advance!

The classifier result of ResNet-200

Hi, Sam! Thanks for sharing your pretrained model of ResNet-200, but when I used the model to implement the classification test of ILSVRC2012_val_00000001.JPEG(Ground Truth is Sea Snake), the result seems to be wrong as shown below:

0.3930911719799 reflex camera
0.30106797814369 marimba, xylophone
0.10095628350973 convertible
0.076501801609993 ruddy turnstone, Arenaria interpres
0.067097567021847 giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca

I have also used resnet-101.t7 before, the classification is just right:

0.52688080072403 sea snake
0.2866522371769 rock python, rock snake, Python sebae
0.073128215968609 hognose snake, puff adder, sand viper
0.048126801848412 water snake
0.021316176280379 Indian cobra, Naja naja

So I don't know why the model resnet-101.t7 will have a right result for classification while resnet-200.t7 not. Would you please tell me if there are some differences for preprocessing the dataset?

File write issue

write error: wrote 2854962 blocks instead of 3211264 at /home/.../torch/pkg/torch/lib/TH/THDiskFile.c:331
stack traceback:
[C]: in function 'write'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:201: in function <...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:107>
[C]: in function 'write'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:201: in function 'writeObject'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:226: in function 'writeObject'
...orkspace/torch/install/share/lua/5.1/cudnn/Pointwise.lua:67: in function 'write'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:201: in function 'writeObject'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:226: in function 'writeObject'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:226: in function 'writeObject'
...rush_Workspace/torch/install/share/lua/5.1/nn/Module.lua:150: in function 'write'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:201: in function 'writeObject'
...
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:226: in function 'writeObject'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:226: in function 'writeObject'
...rush_Workspace/torch/install/share/lua/5.1/nn/Module.lua:150: in function 'write'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:201: in function 'writeObject'
...ush_Workspace/torch/install/share/lua/5.1/torch/File.lua:379: in function 'save'
./checkpoints.lua:37: in function 'save'
main.lua:62: in main chunk
[C]: in function 'dofile'
...pace/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

main.lua is owned by my user.

resume error

when I want to resume my model, It happens the error that 'cannot open <optimState_64.t7> in mode r at /home/tools/torch/pkg/torch/lib/TH/THDiskFile.c:649'. what should I do to solve this problem,thanks~~

multi-gpu work issue

Hello,

Great job.

I am able to run the examples (tried only cifar10) fine with single gpu. But when using multi-gpu i get the following error:

th main.lua -dataset cifar10 -batchSize 256 -nGPU 2 -nThreads 8 -backend cudnn -nEpochs 2 -depth 32

/usr/local/torch/install/bin/luajit: ...l/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:217: attempt to compare number with table
stack traceback:
...l/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:217: in function 'add'
./models/init.lua:90: in function 'setup'
main.lua:26: in main chunk
[C]: in function 'dofile'
...ocal/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x004064d0

Not really sure why though.

I have trained fine on multiple GPU's before on Tesla k-40's on .https://github.com/soumith/imagenet-multiGPU.torch

Any help would be much appreciated !! Thanks,

Your Batch Size?

Hey Sam,

We are trying to use your code from https://github.com/facebook/fb.resnet.torch to train a 152 layer model on 4 Titan X GPUs. With a batch size of 256, we seem to get memory outage with the 152 layer model, and with a batch size of 128, it seems to just fit, each GPU taken up to 11.9 GB. I thought your code would be able to do batch size 256 with 4 GPUs?

We’d really appreciate your help!

Yu

error while extracting features via pretrained resnet-200

When I extract features by execute:

th extract-features.lua resnet-200.t7 /path/to/my/image

I got:

In 5 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 7 module of nn.Sequential:
...09/torch/install/share/lua/5.1/nn/BatchNormalization.lua:80: got 64-feature tensor, expected 256

But resnet-152.t7 is OK, so what's the problem?

Problem while initializing the network with same set of weights?

Hi,

I am doing these runs, where I have to initialize the network with same set of weights.
In this case, the training and testing errors should be similar if not the same.
But, I am getting different training and testing errors in each and every run.

IMO, this disparity can be caused by following two things:
a) Initializing the weights with different values on every run.
-- To overcome it, I have written the weights in a file, which I am loading on every run.
b) Randomness in the way we present the data to the network.
-- Data Shuffling: I have disabled it by providing the manualSeed.
-- Data preprocessing: I am only doing the color normalization.
-- I have also disabled the randomCrop which is by default 'false'.

Here is the code snippet, I have added to the 'models/resnet.lua' to write and load the weights initialized.

--a = v.weight:normal(0,math.sqrt(2/n))
filename = 'randwts/' .. name .. '_' .. k .. '_init_wts.t7'
--torch.save(filename, a)
--a = torch.load(filename)
v.weight = torch.load(filename)

What's the step which I am missing here?

Thanks.

Feature extraction error

Hi,

When I tried to run feature extraction code as you described, I got an error like below.

> th extract-features.lua ../torch_models/resnet-101.t7 ../dataset/attribute/apascal_images/2007_007277.jpg
/home/nine/.torch/install/bin/luajit: ...h/install/share/lua/5.1/nn/SpatialBatchNormalization.lua:82: attempt to index field 'running_std' (a nil value)
stack traceback:
        ...h/install/share/lua/5.1/nn/SpatialBatchNormalization.lua:82: in function 'updateOutput'
        /home/nine/.torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
        extract-features.lua:63: in main chunk
        [C]: in function 'dofile'
        ...ine/.torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

I confirmed that provided pretrained model have running_var parameters which is required at nn.SpatialBatchNormailization, but it doesn't have running_std. I also checked save_mean and save_std parameters, but they doesn't exist, too.

I installed cudnn v4, and Torch bindings@R4.

Any ideas?

bad argument #3 to 'narrow'

th main.lua -retrain resnet-50.t7 -data /data/resnetdata/ -resetClassifier true -nClasses 2

Loading model from file: resnet-50.t7
=> Replacing classifier with 2-way classifier
=> Training epoch # 1
/home/.../torch/install/bin/luajit: ./train.lua:142: bad argument #3 to 'narrow' (out of range at /home/.../torch/pkg/torch/lib/TH/generic/THTensor.c:351)
stack traceback:
[C]: in function 'narrow'
./train.lua:142: in function 'computeScore'
./train.lua:65: in function 'train'
main.lua:49: in main chunk
[C]: in function 'dofile'
...pace/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

Issue with Finetunning

I was looking the code for fine-tuning the ResNet, exactly:

linear.weight:zero() -- only trains last layer to start
linear.bias:zero()

This line set all the weights to zeros. But this cause that the weights change in that layer will be always zero (gradient will be 0), isn't it?

I made a small test using such lines with fine-tuning my net and then the net was not learning anything. Without this lines everything was fine.

Could you explain why you set this lines here? I think this line are incorrect and should be deleted or replaced (by any weight initialization method).

ResNet training hangs at random iter

On a screen session, I run:

th main.lua -depth 18 -batchSize 128 -nGPU 1 -nThreads 1 -data ../imagenet/

and then detach. Everything is going well until the training just pauses at some random iter.

| Epoch: [1][677/10010] Time 3.408 Data 3.266 Err 6.4433 top1 100.000 top5 96.875
| Epoch: [1][678/10010] Time 3.152 Data 3.008 Err 6.5600 top1 98.438 top5 97.656
| Epoch: [1][679/10010] Time 3.664 Data 3.518 Err 6.5425 top1 100.000 top5 97.656
| Epoch: [1][680/10010] Time 3.195 Data 3.051 Err 6.5057 top1 100.000 top5 96.875
| Epoch: [1][681/10010] Time 4.066 Data 3.923 Err 6.3842 top1 98.438 top5 96.875
| Epoch: [1][682/10010] Time 3.740 Data 3.595 Err 6.5573 top1 100.000 top5 98.438
| Epoch: [1][683/10010] Time 3.562 Data 3.418 Err 6.4775 top1 100.000 top5 96.875
| Epoch: [1][684/10010] Time 3.370 Data 3.228 Err 6.5334 top1 98.438 top5 96.094
| Epoch: [1][685/10010] Time 3.494 Data 3.349 Err 6.2895 top1 98.438 top5 94.531
| Epoch: [1][686/10010] Time 4.154 Data 3.987 Err 6.4592 top1 98.438 top5 96.875
| Epoch: [1][687/10010] Time 3.143 Data 2.985 Err 6.4096 top1 98.438 top5 95.312
| Epoch: [1][688/10010] Time 3.350 Data 3.187 Err 6.5375 top1 100.000 top5 98.438
| Epoch: [1][689/10010] Time 3.549 Data 3.388 Err 6.5928 top1 100.000 top5 95.312
| Epoch: [1][690/10010] Time 3.538 Data 3.396 Err 6.4876 top1 98.438 top5 96.875
| Epoch: [1][691/10010] Time 3.228 Data 3.079 Err 6.4927 top1 99.219 top5 95.312
| Epoch: [1][692/10010] Time 3.842 Data 3.687 Err 6.4195 top1 98.438 top5 96.094
| Epoch: [1][693/10010] Time 3.866 Data 3.723 Err 6.5156 top1 98.438 top5 97.656
| Epoch: [1][694/10010] Time 3.428 Data 3.279 Err 6.3716 top1 96.875 top5 92.969
| Epoch: [1][695/10010] Time 3.550 Data 3.408 Err 6.4982 top1 100.000 top5 97.656

This has happened 4 times now at different iters. Tried varying the number of threads / gpu / batchsize / etc. For example:

th main.lua -depth 18 -batchSize 256 -nGPU 3 -nThreads 8 -shareGradInput true -data /mnt/hdd1/resnet_torch/imagenet/

stopped at Epoch: [2][113/505]

Cuda version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

cudnn version:
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

define CUDNN_MAJOR 4

define CUDNN_MINOR 0

define CUDNN_PATCHLEVEL 7

Ubuntu version:
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

and I have 3 titan X's without anything else running on them that are correctly setup (have run many caffe/torch jobs before).

Suggestions on nThreads

Do you have any insight on how to select the parameter nThreads?
I'm using a server with 2 10-corse CPUs and 4 Titan Xs that would be dedicated to running this code.

Thanks!

Indicate GPU_id when training in Multi-gpu mode

Hi, say there are 16 cards in one machine, and I want to use 4 cards specifically, gpu_id=9,10,11,12. (1-indexed) How to target them? tried several solutions, all failed.

  • add CUDDA_VISIBLE_DEVICES=9,10,11,12 th xxx.lua, failed.
  • add a new flag opt.GPUStartId in opts.lua and change Line 90 in models/init.lua to local gpus = torch.range(opt.GPUStartId, opt.nGPU+opt.GPUStartId-1):totable(), and the following error occurs:

=> Training epoch # 1
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-5821/cutorch/lib/THC/generic/THCStorage.cu line=48 error=77 : an illegal memory access was encountered

I am guessing I need to write some cutorch:setDevice() around Line 94????

Thanks so much!

backend for BN and MaxPooling

Hi, is there any reason why the backend for Max and SBatchNorm is not cudnn?

local Max = nn.SpatialMaxPooling
local SBatchNorm = nn.SpatialBatchNormalization

Is that because the cudnn layers are not stable yet? It seems to me changing above modules to cudnn version has a 1.27x speedup over nn version for a minibatch of 128 on CIFAR10 and gets almost the same accuracy.

shareGradInput Problem

Hi,

Thanks a lot for sharing your code, first of all!

We found that using shareGradInput vs. not yield quite different results on cifar10. sharedGradInput tends to do significantly worse than not (and than the results we would expect). Here is a plot running your original code with the default settings on cifar10 with 4 GPUs, and shareGradInput=true vs. shareGradInput=false. This trend is verified in the 110 layer net, with true->9.61, false->6.84 %.

unknown

We suspect that there's a bug in shareGradInput, would you please take a look?

ResNet 152

Hi, it has been almost a month since the blog post was written, has the 152-layer model finished training?
Most importantly, would it be possible to post the convergence plot (val. err. vs. epoch)? That would be very helpful for those who'd like to compare the convergence of your code vs. Kaiming's.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.