I have a cuda runtime error after the end of epoch 3000 by running the following code:
$ python train.py --config config/keep/detectron_100_resnet_most_data.yaml
>>> torch.__version__
'0.3.1
i_epoch: 1 i_iter: 2000 val_loss:3.4700 val_acc:0.6148 runtime: 67.63 min
iter: 2100 train_loss: 2.8821 train_score: 0.6225 avg_train_score: 0.6105 val_score: 0.6008 val_loss: 3.4801 time(s): 561.6 s
iter: 2200 train_loss: 2.6174 train_score: 0.6195 avg_train_score: 0.6135 val_score: 0.6803 val_loss: 3.1912 time(s): 218.7 s
iter: 2300 train_loss: 2.7957 train_score: 0.6426 avg_train_score: 0.6190 val_score: 0.6205 val_loss: 3.3420 time(s): 412.5 s
iter: 2400 train_loss: 2.4924 train_score: 0.6453 avg_train_score: 0.6207 val_score: 0.6117 val_loss: 3.4666 time(s): 192.8 s
iter: 2500 train_loss: 2.7591 train_score: 0.6234 avg_train_score: 0.6243 val_score: 0.6293 val_loss: 3.4114 time(s): 190.6 s
iter: 2600 train_loss: 2.9420 train_score: 0.5928 avg_train_score: 0.6237 val_score: 0.6400 val_loss: 3.2718 time(s): 185.9 s
iter: 2700 train_loss: 2.6800 train_score: 0.6441 avg_train_score: 0.6247 val_score: 0.6590 val_loss: 3.0637 time(s): 176.4 s
iter: 2800 train_loss: 2.7028 train_score: 0.6506 avg_train_score: 0.6303 val_score: 0.6828 val_loss: 3.1584 time(s): 189.8 s
iter: 2900 train_loss: 2.6380 train_score: 0.6432 avg_train_score: 0.6326 val_score: 0.6340 val_loss: 3.3097 time(s): 183.0 s
iter: 3000 train_loss: 2.7275 train_score: 0.6227 avg_train_score: 0.6311 val_score: 0.6725 val_loss: 3.1253 time(s): 187.7 s
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 230, in <module>
scheduler=scheduler,best_val_accuracy=best_accuracy)
File "/home/rcadene/pythia/train_model/Engineer.py", line 159, in one_stage_train
data_reader_eval)
File "/home/rcadene/pythia/train_model/Engineer.py", line 87, in save_a_snapshot
loss_criterion=loss_criterion)
File "/home/rcadene/pythia/train_model/Engineer.py", line 204, in one_stage_eval_model
score, loss, n_sample = compute_a_batch(batch, myModel, eval_mode=True, loss_criterion=loss_criterion)
File "/home/rcadene/pythia/train_model/Engineer.py", line 191, in compute_a_batch
logit_res = one_stage_run_model(batch, my_model, add_graph, log_dir, eval_mode)
File "/home/rcadene/pythia/train_model/Engineer.py", line 249, in one_stage_run_model
image_feat_variables=image_feat_variables)
File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/rcadene/pythia/top_down_bottom_up/top_down_bottom_up_model.py", line 103, in forward
question_embedding_total, image_dim_variable_use)
File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/rcadene/pythia/top_down_bottom_up/image_embedding.py", line 39, in forward
image_feat_variable, question_embedding, image_dims)
File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/rcadene/pythia/top_down_bottom_up/image_attention.py", line 140, in forward
joint_feature = self.modal_combine(image_feat, question_embedding)
File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/rcadene/pythia/top_down_bottom_up/multi_modal_combine.py", line 142, in forward
joint_feature = self.dropout(joint_feature)
File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/dropout.py", line 46, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/functional.py", line 526, in dropout
return _functions.dropout.Dropout.apply(input, p, training, inplace)
File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/_functions/dropout.py", line 32, in forward
output = input.clone()
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58