jiasenlu / hiecoattenvqa Goto Github PK

View Code? Open in Web Editor NEW

348.0 15.0 123.0 1.54 MB

Python 4.93% Lua 13.28% Jupyter Notebook 81.79%

hiecoattenvqa's People

Contributors

Stargazers

Watchers

Forkers

arnabgho avisingh599 zhiqiangwan omar-florez lngvietthang gailysun tybxiaobao ilovecv caomw yangpa vyraun lijian8 haooooooqi benjamesbabala peratham wanjinchang iyulong arasharchor shu-xin cadene tianfeng80 shekharravi xhzhao kevinwenya robi56 keishinkickback hyeonwoonoh jhih-ciang suhmily geekvc jinyixin621 ykwon0407 mpyreddy chundiliu chenfei-wu andy-yangz arnabkar lemonnight iqbal-chowdhury yzabc007 xiangzi1992 raghavendranpm kennthshang cold-winter airc-keti rucaizhou manirupa dimplesl sjang3 kaix90 ajaycharan wtdeng asiddhant shubhampachori12110095 hitesh-1997 nashory zhdai tenaflyyy medhini yyf17 deshanadesai shafiahmed ankur219 yancyycwong qitong sagniklp meelement afcarl shailzajolly sargunan namisan yumere di0002ya siddheshk gegetang singsanj gulugulujiang ruizewang dtean alicedingyueming xingchengxu shreya027 empireofkings milllllk renjiezhu zawecha1 taaccoo-beta o-obigface shawn3298317 jkooy ammieqi rajeshggm jimmyalpha saicharantejbandi xhsun1997 mahesh-kart nithinv13 moqingxinai piaofu110 orangewd

hiecoattenvqa's Issues

Error when changing hidden_size

I was able to get training to work with default hyperparameters on VQA, but the accuracy was lower than reported in the paper (I got about 50%). I saw that in the paper, a hidden size of 1024 was used for VQA, instead of the default of 512. When I set -hidden_size 1024 when running train.lua, I got the following error:

/user/torch/install/bin/luajit: /user/torch/install/share/lua/5.1/nn/CAddTable.lua:16: bad argument #2 to 'add' (sizes do not match at /user/torch/extra/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:217)
stack traceback:
[C]: in function 'add'
/user/torch/install/share/lua/5.1/nn/CAddTable.lua:16: in function 'func'
/user/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
/user/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
./misc/ques_level.lua:119: in function 'forward'
train.lua:258: in function 'lossFun'
train.lua:311: in main chunk
[C]: in function 'dofile'
/user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

Is attention visualization available?

Error in training

torch/install/share/lua/5.1/hdf5/group.lua:312: HDF5Group:read() - no such child 'ques_train' for [HDF5Group 144115188075855879LL /]

Getting above error during training phase. Can someone help figure out what am I missing?

VQA preprocess

Don't have 'ans' in testset.

how to preprocess openended question and annotation files

Torch: not enough memory: you tried to allocate 30GB.

$ th prepro_img_vgg.lua -input_json ../data/vqa_data_prepro.json -image_root /home/jiasenlu/data/ -cnn_proto ../image_model/VGG_ILSVRC_19_layers_deploy.prototxt -cnn_model ../image_model/VGG_ILSVRC_19_layers.caffemodel

Torch: not enough memory: you tried to allocate 30GB.

how to solve this memory problem ?

Clearify using num_layers as n in LSTM implementation

Hello,

I try to re-implement your paper in Keras. Now, I'm struggling with your LSTM implementation.

You use num_layers as n for the LSTM initialization, but the num_layers should be the depth of the LSTM. Nevertheless, in the LSTM implementation it seems to be used as the number of timesteps L. Is this true?

HieCoAttenVQA/misc/ques_level.lua

Line 18 in 82b0bb0

self.core = LSTM.lstm(self.rnn_size, self.rnn_size, self.num_layers, dropout)

HieCoAttenVQA/misc/LSTM.lua

Line 18 in 82b0bb0

for L = 1,n do

Furthermore, there is createClones which creates multiple weights for each timestep as it seems. Is this supposed to be wanted as an LSTM should share the same weights through time or a Bug?

HieCoAttenVQA/misc/ques_level.lua

Line 52 in 82b0bb0

for t=1,self.seq_length do

Preprocessing images with ResNet

$ Torch: not enough memory: you tried to allocate 123GB.
Happens here prepro_img_residule.lua:98: in main chunk, i.e.

local feat_train=torch.FloatTensor(sz, 14, 14, 2048) --ndims)

Do you really have 123GB+ RAM? 😃

Anyway, why 14x14x2048? Shouldn't be 7x7x2048?

Fixes needed in README

python vqa_preprocessing.py --download True --split 1 should be changed to python vqa_preprocess.py --download 1 --split 1

Similar for the coco script.

For COCO-QA

$ python vqa_preprocess.py --download 1

I think it should be $ python cocoqa_preprocess.py --download 1, a small mistake.

not enough memory: you tried to allocate 15GB. Buy new RAM! ----is this about CPU RAM or GPU MEM?

I got 32g cpu ram and 2 gpu (gtx1080 8G) on my machine.
why it cannot afford 15G memory?

rzai@rzai00:/prj/HieCoAttenVQA/prepro$ CUDA_VISIBLE_DEVICES=1 th prepro_img_vgg.lua -input_json ../data/vqa_data_prepro.json -image_root /home/rzai/mscoco.org-visualqa.org/ -cnn_proto /home/rzai/VGG_ILSVRC_19_layers_deploy.prototxt -cnn_model /home/rzai/VGG_ILSVRC_19_layers.caffemodel
{
batch_size : 20
gpuid : 6
out_name_train : "../data/vqa_data_img_vgg_train.h5"
out_name_test : "../data/vqa_data_img_vgg_test.h5"
cnn_proto : "/home/rzai/VGG_ILSVRC_19_layers_deploy.prototxt"
cnn_model : "/home/rzai/VGG_ILSVRC_19_layers.caffemodel"
backend : "cudnn"
image_root : "/home/rzai/mscoco.org-visualqa.org/"
input_json : "../data/vqa_data_prepro.json"
}
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded /home/rzai/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> (41) -> (42) -> (43) -> (44) -> (45) -> (46) -> output]
(1): cudnn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
(2): cudnn.ReLU
(3): cudnn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
(4): cudnn.ReLU
(5): cudnn.SpatialMaxPooling(2x2, 2,2)
(6): cudnn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
(7): cudnn.ReLU
(8): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
(9): cudnn.ReLU
(10): cudnn.SpatialMaxPooling(2x2, 2,2)
(11): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
(12): cudnn.ReLU
(13): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(14): cudnn.ReLU
(15): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(16): cudnn.ReLU
(17): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(18): cudnn.ReLU
(19): cudnn.SpatialMaxPooling(2x2, 2,2)
(20): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
(21): cudnn.ReLU
(22): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(23): cudnn.ReLU
(24): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(25): cudnn.ReLU
(26): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(27): cudnn.ReLU
(28): cudnn.SpatialMaxPooling(2x2, 2,2)
(29): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(30): cudnn.ReLU
(31): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(32): cudnn.ReLU
(33): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(34): cudnn.ReLU
(35): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(36): cudnn.ReLU
(37): cudnn.SpatialMaxPooling(2x2, 2,2)
(38): nn.View(-1)
(39): nn.Linear(25088 -> 4096)
(40): cudnn.ReLU
(41): nn.Dropout(0.500000)
(42): nn.Linear(4096 -> 4096)
(43): cudnn.ReLU
(44): nn.Dropout(0.500000)
(45): nn.Linear(4096 -> 1000)
(46): cudnn.SoftMax
}
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output]
(1): cudnn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
(2): cudnn.ReLU
(3): cudnn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
(4): cudnn.ReLU
(5): cudnn.SpatialMaxPooling(2x2, 2,2)
(6): cudnn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
(7): cudnn.ReLU
(8): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
(9): cudnn.ReLU
(10): cudnn.SpatialMaxPooling(2x2, 2,2)
(11): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
(12): cudnn.ReLU
(13): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(14): cudnn.ReLU
(15): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(16): cudnn.ReLU
(17): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(18): cudnn.ReLU
(19): cudnn.SpatialMaxPooling(2x2, 2,2)
(20): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
(21): cudnn.ReLU
(22): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(23): cudnn.ReLU
(24): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(25): cudnn.ReLU
(26): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(27): cudnn.ReLU
(28): cudnn.SpatialMaxPooling(2x2, 2,2)
(29): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(30): cudnn.ReLU
(31): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(32): cudnn.ReLU
(33): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(34): cudnn.ReLU
(35): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(36): cudnn.ReLU
(37): cudnn.SpatialMaxPooling(2x2, 2,2)
}
processing 82460 images...
/home/rzai/torch/install/bin/luajit: $ Torch: not enough memory: you tried to allocate 15GB. Buy new RAM! at /home/rzai/torch/pkg/torch/lib/TH/THGeneral.c:270
stack traceback:
[C]: at 0x7f1d81308e80
[C]: in function 'FloatTensor'
prepro_img_vgg.lua:120: in main chunk
[C]: in function 'dofile'
...rzai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
rzai@rzai00:/prj/HieCoAttenVQA/prepro$ vim /home/rzai/torch/pkg/torch/lib/TH/THGeneral.c
rzai@rzai00:~/prj/HieCoAttenVQA/prepro$

How to process the multiple choice answer

Hi,
I am confused that how to use the multiple choice answer in the multiple-choice task when training and evaluate the model?

Can we process the multiple choice answer the same as the open-ended task?

Thanks.

Fail to run train.lua

This is maybe a trivial question but I'm completely new to torch, I tried to search on Google but no luck. I'm working with a Ubuntu 14.04 machine, cuda 7.0 and cudnn R4 version. I prepared all training files and when running train.lua it gives me this error:

{
input_img_train_h5 : "data/vqa_data_img_vgg_train.h5"
learning_rate_decay_every : 300
optim : "rmsprop"
hidden_size : 512
optim_epsilon : 1e-08
output_size : 1000
rnn_layers : 2
input_img_test_h5 : "data/vqa_data_img_vgg_test.h5"
losses_log_every : 600
id : "0"
input_ques_h5 : "data/vqa_data_prepro.h5"
learning_rate_decay_start : 0
start_from : ""
gpuid : 6
seed : 123
input_json : "data/vqa_data_prepro.json"
optim_beta : 0.995
batch_size : 20
iterPerEpoch : 1200
rnn_size : 512
max_iters : -1
checkpoint_path : "save/train_vgg"
save_checkpoint_every : 6000
learning_rate : 0.0004
co_atten_type : "Alternating"
feature_type : "VGG"
backend : "cudnn"
optim_alpha : 0.99
}
DataLoader loading h5 image file: data/vqa_data_img_vgg_train.h5
DataLoader loading h5 image file: data/vqa_data_img_vgg_test.h5
DataLoader loading h5 question file: data/vqa_data_prepro.h5
DataLoader loading json file: data/vqa_data_prepro.json
assigned 215375 images to split 0
assigned 121512 images to split 2
Building the model...
total number of parameters in word_level: 8031747
total number of parameters in phrase_level: 2889219
total number of parameters in ques_level: 5517315
constructing clones inside the ques_level
total number of parameters in recursive_attention: 2862056
/home/raamac/torch/install/bin/luajit: ./misc/word_level.lua:86: the class torch.CudaByteTensor cannot be indexed
stack traceback:
[C]: in function '__newindex'
./misc/word_level.lua:86: in function 'forward'
train.lua:253: in function 'lossFun'
train.lua:310: in main chunk
[C]: in function 'dofile'
...amac/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

maskedFill , invalid arguments

Problem statement: I am getting following error

qlua: ./misc/maskSoftmax.lua:31: invalid arguments: CudaTensor CudaTensor number
expected arguments: CudaTensor CudaByteTensor float
stack traceback:
[C]: at 0x7fd8cfc39b60
[C]: in function 'maskedFill'
./misc/maskSoftmax.lua:31: in function 'func'
/home/cse/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
/home/cse/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
./misc/word_level.lua:92: in function 'forward'

predict.lua:142: in main chunk

I didn't change anything , i am using your code as it is .
please let me know, how to figure it out.

Setup:

predict.ipynb is converted to predict.lua by replacing itorch.image(img) to image.display(img)
I have downloaded all the pretrained model for image and VQA

Excutation report:
VQA/HieCoAttenVQA-master$ qlua predict.lua
image_model/VGG_ILSVRC_19_layers_deploy.prototxt image_model/VGG_ILSVRC_19_layers.caffemodel cudnn
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded image_model/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
Load the weight...
total number of parameters in cnn_model: 20024384
total number of parameters in word_level: 8031747
total number of parameters in phrase_level: 2889219
total number of parameters in ques_level: 5517315
constructing clones inside the ques_level
total number of parameters in recursive_attention: 2862056
qlua: ./misc/maskSoftmax.lua:31: invalid arguments: CudaTensor CudaTensor number
expected arguments: CudaTensor CudaByteTensor float
stack traceback:
[C]: at 0x7fd8cfc39b60
[C]: in function 'maskedFill'
./misc/maskSoftmax.lua:31: in function 'func'
/home/cse/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
/home/cse/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
./misc/word_level.lua:92: in function 'forward'
predict.lua:142: in main chunk

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-7838/cutorch/lib/THC/generic/THCStorage.c line=32 error=59 : device-side assert triggered in train.lua

I got the following error when running train.lua:
/tmp/luarocks_cunn-scm-1-9864/cunn/lib/THCUNN/ClassNLLCriterion.cu:25: void cunn_ClassNLLCriterion_updateOutput_kernel1(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-7838/cutorch/lib/THC/generic/THCStorage.c line=32 error=59 : device-side assert triggered
/data/home/suzhou/torch/install/bin/luajit: cuda runtime error (59) : device-side assert triggered at /tmp/luarocks_cutorch-scm-1-7838/cutorch/lib/THC/generic/THCStorage.c:32
stack traceback:
[C]: at 0x7f3ad501a130
[C]: in function '__index'
...hou/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:52: in function 'updateOutput'
...torch/install/share/lua/5.1/nn/CrossEntropyCriterion.lua:13: in function 'forward'
train.lua:208: in function 'eval_split'
train.lua:334: in main chunk
[C]: in function 'dofile'
...zhou/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

About error from executing "train.lua"

I just follow the step of README, however I got some error as below.
This is the step of vgg image feature.
I don't know how to solve it.

-------------------the conents of error message ---------------------------

iter 0: 6.952219, 0.011587, 0.000400, 0.397509
validation loss: =======6.9374270200729=accuracy =======0====== 5120/5000 =========] Tot: 3s108ms | Step: 0ms
wrote json checkpoint to save/train_vgg_Alternating/checkpoint.json.json
/home/user/vqainstall/distro-cl/install/bin/luajit: ...o-cl/install/share/lua/5.1/cudnn/TemporalConvolution.lua:92: bad argument #1 to 'size' (out of range)
stack traceback:
[C]: in function 'size'
...o-cl/install/share/lua/5.1/cudnn/TemporalConvolution.lua:92: in function 'updateGradInput'
...tall/distro-cl/install/share/lua/5.1/nngraph/gmodule.lua:350: in function 'neteval'
...tall/distro-cl/install/share/lua/5.1/nngraph/gmodule.lua:384: in function 'updateGradInput'
...vqainstall/distro-cl/install/share/lua/5.1/nn/Module.lua:30: in function 'backward'
./misc/phrase_level.lua:85: in function 'updateGradInput'
...vqainstall/distro-cl/install/share/lua/5.1/nn/Module.lua:30: in function 'backward'
train.lua:278: in function 'lossFun'
train.lua:312: in main chunk
[C]: in function 'dofile'
.../distro-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405e90

Wrong Answer index is -1. Error at evaluation

In the preprocessing script, if the answer is not in top 1000 it is removed from training example but is kept in test set and is encoded as -1. When we evaluate this throws assertion error in the criterion as it need target from 1 to max number. Is this handled in the code, cause I was not able to find out where.

Error in 'train.lua'

I also follow the step of READMe, and I got this error.
It looks similar with the closed issue, but I'm using the most recent version of torch.
Can someone help me how to solve this problems?
By the way this is train.lua step
---------------------------------------error--------------------------------
~
constructing clones inside the ques_level
total number of parameters in recursive_attention: 2862056

/home/user/torch/install/bin/luajit: /home/user/torch/install/share/lua/5.1/nn/THNN.lua:110: input and gradOutput have different number of elements: input[20 x 26] has 520 elements, while gradOutput[26] has 26 elements at /home/user/torch/extra/cunn/lib/THCUNN/generic/SoftMax.cu:84
stack traceback:
[C]: in function 'v'
/home/user/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'SoftMax_updateGradInput'
./misc/maskSoftmax.lua:33: in function 'updateGradInput'
.../user/torch/install/share/lua/5.1/nngraph/gmodule.lua:420: in function 'neteval'
.../user/torch/install/share/lua/5.1/nngraph/gmodule.lua:454: in function 'updateGradInput'
/home/user/torch/install/share/lua/5.1/nn/Module.lua:31: in function 'backward'
./misc/ques_level.lua:143: in function 'updateGradInput'
/home/user/torch/install/share/lua/5.1/nn/Module.lua:31: in function 'backward'
train.lua:274: in function 'lossFun'
train.lua:313: in main chunk
[C]: in function 'dofile'
...usr/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

Bad performance on training

When i train this model(split=1) on GPU(M40), the training speed is so slow that i can hardly wait for the result. The training speed is about 3789sec/600iter, and the batchsize is 20, so one epoch(121512 images) is about 6075 iters. The total time will be 3121 hour to train the full 250 epochs.

Is this a normal training speed?

The training log is as follows :
iter 600: 2.853175, 3.237587, 0.000398, 3986.911091
iter 1200: 2.764053, 2.890393, 0.000397, 7775.439858
iter 1800: 4.232092, 2.867312, 0.000395, 11630.438926
iter 2400: 3.718163, 2.826459, 0.000394, 15476.820226
iter 3000: 2.618628, 2.725287, 0.000392, 19317.520915
iter 3600: 1.765032, 2.695489, 0.000391, 23153.146176
iter 4200: 2.188749, 2.651189, 0.000389, 26987.516512
iter 4800: 2.323560, 2.658700, 0.000388, 30819.787844
iter 5400: 2.473269, 2.561281, 0.000386, 34647.776586
iter 6000: 1.226942, 2.620815, 0.000385, 38466.522832

When i train the mode from this github, the training speed is very fast and i could train the model to the target accuracy in 2 hours.

Sequential.lua:29: index out of range stack traceback:

rzai@rzai00:/prj/HieCoAttenVQA/prepro$
rzai@rzai00:/prj/HieCoAttenVQA/prepro$ CUDA_VISIBLE_DEVICES=1 th prepro_img_vgg.lua -input_json ../data/vqa_data_prepro.json -image_root /media/rzai/ai_data/VQA-ALL/mscoco.org-visualqa.org/train2014 -cnn_proto ../image_model/VGG_ILSVRC_19_layers_deploy.prototxt -cnn_model ../image_model/VGG_ILSVRC_19_layers.caffemodel
{
batch_size : 20
gpuid : 6
out_name_train : "../data/vqa_data_img_vgg_train.h5"
out_name_test : "../data/vqa_data_img_vgg_test.h5"
cnn_proto : "../image_model/VGG_ILSVRC_19_layers_deploy.prototxt"
cnn_model : "../image_model/VGG_ILSVRC_19_layers.caffemodel"
backend : "cudnn"
image_root : "/media/rzai/ai_data/VQA-ALL/mscoco.org-visualqa.org/train2014"
input_json : "../data/vqa_data_prepro.json"
}
Successfully loaded ../image_model/VGG_ILSVRC_19_layers.caffemodel
nn.Sequential {
[input -> output]
}
/home/rzai/torch/install/bin/luajit: /home/rzai/torch/install/share/lua/5.1/nn/Sequential.lua:29: index out of range
stack traceback:
[C]: in function 'error'
/home/rzai/torch/install/share/lua/5.1/nn/Sequential.lua:29: in function 'remove'
prepro_img_vgg.lua:41: in main chunk
[C]: in function 'dofile'
...rzai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
rzai@rzai00:~/prj/HieCoAttenVQA/prepro$

I got a error,

centos7.0+cuda8.0
[hbliu@bogon HieCoAttenVQA-master]$ th train.lua -input_img_train_h5 data/vqa_data_img_vgg_train.h5 -input_img_test_h5 data/vqa_data_img_vgg_test.h5 -input_ques_h5 data/vqa_data_prepro.h5 -input_json data/vqa_data_prepro.json -co_atten_type Alternating -feature_type VGG
{
input_img_train_h5 : "data/vqa_data_img_vgg_train.h5"
learning_rate_decay_every : 300
optim : "rmsprop"
hidden_size : 512
optim_epsilon : 1e-08
output_size : 1000
rnn_layers : 2
input_img_test_h5 : "data/vqa_data_img_vgg_test.h5"
losses_log_every : 600
id : "0"
input_ques_h5 : "data/vqa_data_prepro.h5"
learning_rate_decay_start : 0
start_from : ""
gpuid : 0
seed : 123
input_json : "data/vqa_data_prepro.json"
optim_beta : 0.995
batch_size : 20
iterPerEpoch : 1200
rnn_size : 512
max_iters : -1
checkpoint_path : "save/train_vgg"
save_checkpoint_every : 6000
learning_rate : 0.0004
co_atten_type : "Alternating"
co_atten_type : "Alternating"
feature_type : "VGG"
backend : "cudnn"
optim_alpha : 0.99
}
Use GPU0
DataLoader loading h5 image file: data/vqa_data_img_vgg_train.h5
DataLoader loading h5 image file: data/vqa_data_img_vgg_test.h5
DataLoader loading h5 question file: data/vqa_data_prepro.h5
DataLoader loading json file: data/vqa_data_prepro.json
assigned 215375 images to split 0
assigned 121512 images to split 2
Building the model...
total number of parameters in word_level: 8031747
total number of parameters in phrase_level: 2889219
total number of parameters in ques_level: 5517315
constructing clones inside the ques_level
total number of parameters in recursive_attention: 2862056
Mask is a nil
/usr/local/torch7/install/bin/luajit: ./misc/word_level.lua:94: the class torch.CudaByteTensor cannot be indexed
stack traceback:
[C]: in function '__newindex'
./misc/word_level.lua:94: in function 'forward'
train.lua:254: in function 'lossFun'
train.lua:311: in main chunk
[C]: in function 'dofile'
...cal/torch7/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004064d0

/usr/local/torch7/install/bin/luajit: ./misc/word_level.lua:94: the class torch.CudaByteTensor cannot be indexed

The accuracy is weird

Hi all, I run this code with alternating attention and VGG feature and the output accuracy is weird.
Here is what it looks like

It supposed to be 60.5 according to the paper.
Btw in step of downloading image model, I didn't see image_model folder.

Not able to extract Image Features

Hello, I am trying to replicate the results using keras 2.0.
I have understood the part where we extract question features.
I am stuck with image feature extraction part.
It would be great if I could get a brief outline of how the image extraction part is working?

data preprocess for open-ended task

I see that is only codes for multiple-choice task data, are there any codes for open-ended answers?
And do you just choose one answer rather than 10 answers for one question as the training label?
thanks~

Getting 'attempt to index a nil value'

Hello
when I run this command
th prepro_img_vgg.lua -input_json ../data/cocoqa_data_prepro.json -image_root /home/jiasenlu/data/ -cnn_proto ../image_model/VGG_ILSVRC_19_layers_deploy.prototxt -cnn_model ../image_model/VGG_ILSVRC_19_layers.caffemodel

I get this error message
Successfully loaded ../image_model/VGG_ILSVRC_19_layers.caffemodel
/home/hadjer/torch/install/bin/luajit: ../image_model/VGG_ILSVRC_19_layers_deploy.prototxt.lua:3: attempt to index global 'ccn2' (a nil value)

Error while executing train

Getting below error while executing
command:
th train.lua

Error:
/home/ubuntu/src/torch/install/bin/luajit: /home/ubuntu/src/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (59) : device-side assert triggered at /home/ubuntu/src/torch/extra/cunn/lib/THCUNN/generic/ClassNLLCriterion.cu:87

Entire stack trace:
stack traceback: [C]: in function 'v' /home/ubuntu/src/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'ClassNLLCriterion_updateOutput' ...src/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:41: in function 'updateOutput' ...torch/install/share/lua/5.1/nn/CrossEntropyCriterion.lua:20: in function 'forward' train.lua:205: in function 'eval_split' train.lua:331: in main chunk [C]: in function 'dofile' .../src/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

Evaluation file eval.lua

Regarding https://github.com/jiasenlu/HieCoAttenVQA/blob/master/eval.lua, is it possible to use it for VQA evaluation?

thank you.

run prepro_vqa.py error when split=2

I got this error when split=2 while split=1 work very well.
the command is ：
python vqa_preprocess.py --download 1 --split 2
python prepro_vqa.py --input_train_json ../data/vqa_raw_train.json --input_test_json ../data/vqa_raw_test.json --num_ans 1000

the error is ：
top words and their counts:9.88% done)
(320161, '?')
(225976, 'the')
(200545, 'is')
(118203, 'what')
(76624, 'are')
(64512, 'this')
(49209, 'in')
(45681, 'a')
(41629, 'on')
(40158, 'how')
(38230, 'many')
(37322, 'color')
(37023, 'of')
(29182, 'there')
(18392, 'man')
(14668, 'does')
(13492, 'people')
(12518, 'picture')
(11779, "'s")
(11758, 'to')
total words: 2284620
number of bad words: 0/14770 = 0.00%
number of words in vocab would be 14770
number of UNKs: 0/2284620 = 0.00%
inserting the special UNK token
Traceback (most recent call last):
File "prepro_vqa.py", line 292, in
main(params)
File "prepro_vqa.py", line 217, in main
ans_test = encode_answer(imgs_test, atoi)
File "prepro_vqa.py", line 128, in encode_answer
ans_arrays[i] = atoi.get(img['ans'], -1) # -1 means wrong answer.
KeyError: 'ans'

Connections with transformers?

Just came across your paper, and found that the formulation of co-attention is quiote similar to transformers:

Especially, a few (but not all) major ingredients, i.e., Q, V projections, attention computed with softmax after dot-product, also appear in transformers.

Considering your work was earlier than the transformer paper, do you think that it may have inspired transformers? Thanks.

THNN.lua:110: input and gradOutput have different number of elements: input[20 x 26] has 520 elements, while gradOutput[26] has 26

rzai@rzai00:/prj/HieCoAttenVQA$ th train.lua -input_img_train_h5 data/vqa_data_img_vgg_train.h5 -input_img_test_h5 data/vqa_data_img_vgg_test.h5 -input_ques_h5 data/vqa_data_prepro.h5 -input_json data/vqa_data_prepro.json -co_atten_type Alternating -feature_type VGG
{
input_img_train_h5 : "data/vqa_data_img_vgg_train.h5"
learning_rate_decay_every : 300
optim : "rmsprop"
hidden_size : 512
optim_epsilon : 1e-08
output_size : 1000
rnn_layers : 2
input_img_test_h5 : "data/vqa_data_img_vgg_test.h5"
losses_log_every : 600
id : "0"
input_ques_h5 : "data/vqa_data_prepro.h5"
learning_rate_decay_start : 0
start_from : ""
gpuid : 6
seed : 123
input_json : "data/vqa_data_prepro.json"
optim_beta : 0.995
batch_size : 20
iterPerEpoch : 1200
rnn_size : 512
max_iters : -1
checkpoint_path : "save/train_vgg"
save_checkpoint_every : 6000
learning_rate : 0.0004
co_atten_type : "Alternating"
feature_type : "VGG"
backend : "cudnn"
optim_alpha : 0.99
}
DataLoader loading h5 image file: data/vqa_data_img_vgg_train.h5
DataLoader loading h5 image file: data/vqa_data_img_vgg_test.h5
DataLoader loading h5 question file: data/vqa_data_prepro.h5
DataLoader loading json file: data/vqa_data_prepro.json
assigned 215375 images to split 0
assigned 121512 images to split 2
Building the model...
total number of parameters in word_level: 8031747
total number of parameters in phrase_level: 2889219
total number of parameters in ques_level: 5517315
constructing clones inside the ques_level
total number of parameters in recursive_attention: 2862056
/home/rzai/torch/install/bin/luajit: /home/rzai/torch/install/share/lua/5.1/nn/THNN.lua:110: input and gradOutput have different number of elements: input[20 x 26] has 520 elements, while gradOutput[26] has 26 elements at /home/rzai/torch/extra/cunn/lib/THCUNN/generic/SoftMax.cu:84
stack traceback:
[C]: in function 'v'
/home/rzai/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'SoftMax_updateGradInput'
./misc/maskSoftmax.lua:33: in function 'updateGradInput'
/home/rzai/torch/install/share/lua/5.1/nngraph/gmodule.lua:420: in function 'neteval'
/home/rzai/torch/install/share/lua/5.1/nngraph/gmodule.lua:454: in function 'updateGradInput'
/home/rzai/torch/install/share/lua/5.1/nn/Module.lua:31: in function 'backward'
./misc/ques_level.lua:143: in function 'updateGradInput'
/home/rzai/torch/install/share/lua/5.1/nn/Module.lua:31: in function 'backward'
train.lua:272: in function 'lossFun'
train.lua:310: in main chunk
[C]: in function 'dofile'
...rzai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
rzai@rzai00:/prj/HieCoAttenVQA$

Assertion `t >= 0 && t < n_classes failed

I followed the tutorial, until th train.lua, I met a error as:

libraries/torch/extra/cunn/lib/THCUNN/ClassNLLCriterion.cu:52: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.

Wish for your help.

image_model folder

Hi,
When will you release codes in image_model folder since we need it to extract both vgg and residual feature.
I keep waiting for it since the previous issue.
For example, in prepro_img_residule.lua, we miss transforms.lua in image_model folder.

local t = require 'image_model.transforms'

Hope it will come out soon.

which folder for -image_root ?

should I set image_root to the folder of unzipped train2014.zip?

HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 139982594582336:

Hi all,
I am getting this error after 25000 iterations.
Could anyone please suggest me how to solve it?

iter 25800: 1.169770, 1.267317, 0.000339, 27442.332723
HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 139982594582336:
#000: H5Tnative.c line 122 in H5Tget_native_type(): unable to register data type
major: Datatype
minor: Unable to register new atom
#1: H5I.c line 895 in H5I_register(): can't insert ID node into skip list
major: Object atom
minor: Unable to insert object
#2: H5SL.c line 995 in H5SL_insert(): can't create new skip list node
major: Skip Lists
minor: Unable to insert object
#3: H5SL.c line 687 in H5SL_insert_common(): can't insert duplicate key
major: Skip Lists
minor: Unable to insert object
HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 139982594582336:
#000: H5Dio.c line 173 in H5Dread(): can't read data
major: Dataset
minor: Read failed
#1: H5Dio.c line 420 in H5D__read(): unable to set up type info
major: Dataset
minor: Unable to initialize object
#2: H5Dio.c line 927 in H5D__typeinfo_init(): not a datatype
major: Invalid arguments to routine
minor: Inappropriate type
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/hdf5/dataset.lua:160: HDF5DataSet:partial() - failed reading data from [HDF5DataSet (83886080 /images_train DATASET)]
stack traceback:
[C]: in function 'assert'
/home/ubuntu/torch/install/share/lua/5.1/hdf5/dataset.lua:160: in function 'partial'
./misc/DataLoaderDisk.lua:123: in function 'getBatch'
/home/ubuntu/torch/install/share/lua/5.1/hdf5/dataset.lua:160: in function 'partial'
./misc/DataLoaderDisk.lua:123: in function 'getBatch'
train.lua:245: in function 'lossFun'
train.lua:310: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

jiasenlu / hiecoattenvqa Goto Github PK

hiecoattenvqa's People

Contributors

Stargazers

Watchers

Forkers

hiecoattenvqa's Issues

Problem statement: I am getting following error

predict.lua:142: in main chunk

Recommend Projects

Recommend Topics

Recommend Org

Jobs