GithubHelp home page GithubHelp logo

Comments (6)

dvornikita avatar dvornikita commented on June 26, 2024

You are right, there is a mistake in the example run. Both flags are missing. You need to run with --detect and --segment if you want to train for both tasks, or leave one of them if you choose either task. Pushed that modification.
Thank you.

from blitznet.

fastlater avatar fastlater commented on June 26, 2024

@DrSleep Correction has been done. I guess you can close this issue.

from blitznet.

dvornikita avatar dvornikita commented on June 26, 2024

Fixed

from blitznet.

fastlater avatar fastlater commented on June 26, 2024

@DrSleep Did you run the training script till end?
I am trying
python training.py --run_name=BlitzNet300_VOC12_Det_Seg --dataset=voc12-train --trunk=resnet50 --x4 --batch_size=1 --optimizer=adam --detect --segment --max_iterations=1001 --lr_decay 1000 1500

@dvornikita
I am using a NVIDIA QUADRO K2000 which only has 2GB or memory so I reduced the batch size from 32 to 1 and in the train.txt, I reduced the number of inputs from 1464 images to 500.
After a few steps (less than 100). I get the error:
Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero.
Could it be an out of memory error? I wanna be sure that the problem is my GPU and not the code.
Is there a possible way to test the training process under minimum requirements adjusting the configuration and other arguments?

2017-09-21 09:56:14.669454: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:217] Allocator (GPU_0_bfc
) ran out of memory trying to allocate 586.13MiB. The caller indicates that this
is not a failure, but may mean that there could be performance gains if more me
mory is available.
[INFO]: step 0, loss = 29.05, acc = 0.01, iou=0.000000, lr=5.000 (0.0 examples/s
ec; 20.544 sec/batch)
[INFO]: step 1, loss = 26.51, acc = 0.01, iou=0.000000, lr=4.000 (0.7 examples/s
ec; 1.366 sec/batch)
[INFO]: step 2, loss = 25.99, acc = 0.00, iou=0.003891, lr=4.000 (0.7 examples/s
ec; 1.379 sec/batch)
[INFO]: step 3, loss = 19.77, acc = 0.18, iou=0.008367, lr=4.000 (0.7 examples/s
ec; 1.354 sec/batch)
[INFO]: step 4, loss = 20.65, acc = 0.00, iou=0.016453, lr=4.000 (0.7 examples/s
ec; 1.400 sec/batch)
[INFO]: step 5, loss = 20.03, acc = 0.00, iou=0.022892, lr=4.000 (0.7 examples/s
ec; 1.360 sec/batch)
[INFO]: step 6, loss = 21.74, acc = 0.00, iou=0.027287, lr=4.000 (0.7 examples/s
ec; 1.388 sec/batch)
[INFO]: step 7, loss = 19.69, acc = 0.00, iou=0.030866, lr=4.000 (0.7 examples/s
ec; 1.364 sec/batch)
[INFO]: step 8, loss = 21.50, acc = 0.28, iou=0.032887, lr=4.000 (0.7 examples/s
ec; 1.371 sec/batch)
[INFO]: step 9, loss = 21.19, acc = 0.35, iou=0.033197, lr=4.000 (0.7 examples/s
ec; 1.390 sec/batch)
[INFO]: step 10, loss = 16.02, acc = 0.52, iou=0.033137, lr=4.000 (0.7 examples/
sec; 1.420 sec/batch)
[INFO]: step 11, loss = 18.08, acc = 0.43, iou=0.034473, lr=4.000 (0.7 examples/
sec; 1.341 sec/batch)
[INFO]: step 12, loss = 16.75, acc = 0.57, iou=0.035309, lr=4.000 (0.7 examples/
sec; 1.340 sec/batch)
[INFO]: step 13, loss = 17.05, acc = 0.61, iou=0.035514, lr=4.000 (0.7 examples/
sec; 1.404 sec/batch)
[INFO]: step 14, loss = 21.49, acc = 0.47, iou=0.036379, lr=4.000 (0.7 examples/
sec; 1.428 sec/batch)
[INFO]: step 15, loss = 19.92, acc = 0.57, iou=0.035743, lr=4.000 (0.7 examples/
sec; 1.376 sec/batch)
[INFO]: step 16, loss = 16.62, acc = 0.46, iou=0.036653, lr=4.000 (0.7 examples/
sec; 1.344 sec/batch)
[INFO]: step 17, loss = 12.99, acc = 0.51, iou=0.037424, lr=4.000 (0.7 examples/
sec; 1.421 sec/batch)
[INFO]: step 18, loss = 18.08, acc = 0.69, iou=0.037933, lr=4.000 (0.7 examples/
sec; 1.353 sec/batch)
[INFO]: step 19, loss = 30.51, acc = 0.75, iou=0.037144, lr=4.000 (0.7 examples/
sec; 1.361 sec/batch)
[INFO]: step 20, loss = 16.09, acc = 0.72, iou=0.037762, lr=4.000 (0.7 examples/
sec; 1.348 sec/batch)
[INFO]: step 21, loss = 18.75, acc = 0.64, iou=0.038262, lr=4.000 (0.7 examples/
sec; 1.369 sec/batch)
[INFO]: step 22, loss = 23.29, acc = 0.75, iou=0.038670, lr=4.000 (0.7 examples/
sec; 1.356 sec/batch)
[INFO]: step 23, loss = 16.26, acc = 0.71, iou=0.039054, lr=4.000 (0.7 examples/
sec; 1.371 sec/batch)
[INFO]: step 24, loss = 15.54, acc = 0.73, iou=0.038320, lr=4.000 (0.7 examples/
sec; 1.355 sec/batch)
[INFO]: step 25, loss = 11.92, acc = 0.74, iou=0.039059, lr=4.000 (0.7 examples/
sec; 1.375 sec/batch)
2017-09-21 09:57:03.093948: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.093948: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.093948: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.095198: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.100198: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.101448: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.103948: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.107698: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.108948: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: Reshape ca
nnot infer the missing input size for an empty tensor unless all specified input
sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
2017-09-21 09:57:03.122698: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\kernels\check_numerics_op.cc:157] abnormal_detected_host
@0000000200EF0B00 = {1, 0} LossTensor is inf or nan
2017-09-21 09:57:03.140199: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\kernels\queue_base.cc:295] _0_parallel_read/filenames: Sk
ipping cancelled enqueue attempt with queue not closed
2017-09-21 09:57:03.141449: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\kernels\queue_base.cc:295] _2_parallel_read/common_queue:
Skipping cancelled enqueue attempt with queue not closed
2017-09-21 09:57:03.142699: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\kernels\queue_base.cc:295] _2_parallel_read/common_queue:
Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framewo
rk.errors_impl.CancelledError'>, Enqueue operation was cancelled
[[Node: parallel_read/common_queue_enqueue_1 = QueueEnqueueV2[Tcomponen
ts=[DT_STRING, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task
:0/cpu:0"](parallel_read/common_queue, parallel_read/ReaderReadV2_1, parallel_re
ad/ReaderReadV2_1:1)]] File "C:\Program Files (x86)\Python 3.5.2\lib\site-packa
ges\tensorflow\python\client\session.py", line 1327, in _do_call

[INFO]: Error reported to Coordinator: <class 'tensorflow.python.framework.error
s_impl.CancelledError'>, Enqueue operation was cancelled
[[Node: parallel_read/common_queue_enqueue_1 = QueueEnqueueV2[Tcomponen
ts=[DT_STRING, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task
:0/cpu:0"](parallel_read/common_queue, parallel_read/ReaderReadV2_1, parallel_re
ad/ReaderReadV2_1:1)]]
return fn(*args)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
client\session.py", line 1306, in _run_fn
status, run_metadata)
File "C:\Program Files (x86)\Python 3.5.2\lib\contextlib.py", line 66, in ex
it

next(self.gen)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Reshape cannot inf
er the missing input size for an empty tensor unless all specified input sizes a
re non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
[[Node: PiecewiseConstant/case/Assert/AssertGuard/pred_id/_587 = _HostR
ecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0"
, send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1
, tensor_name="edge_685_PiecewiseConstant/case/Assert/AssertGuard/pred_id", tens
or_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/cpu:0"
]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "training.py", line 333, in
tf.app.run()
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "training.py", line 312, in main
train(dataset, net, net_config)
File "training.py", line 249, in train
update_mean_iou, learning_rate])
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
client\session.py", line 895, in run
run_metadata_ptr)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
client\session.py", line 1321, in _do_run
options, run_metadata)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Reshape cannot inf
er the missing input size for an empty tensor unless all specified input sizes a
re non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
[[Node: PiecewiseConstant/case/Assert/AssertGuard/pred_id/_587 = _HostR
ecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0"
, send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1
, tensor_name="edge_685_PiecewiseConstant/case/Assert/AssertGuard/pred_id", tens
or_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/cpu:0"
]]

Caused by op 'gradients/TopKV2_grad/Reshape', defined at:
File "training.py", line 333, in
tf.app.run()
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "training.py", line 312, in main
train(dataset, net, net_config)
File "training.py", line 209, in train
summarize_gradients=True)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\contrib
\slim\python\slim\learning.py", line 440, in create_train_op
check_numerics=check_numerics)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\contrib
\training\python\training\training.py", line 439, in create_train_op
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
training\optimizer.py", line 386, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
ops\gradients_impl.py", line 542, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
ops\gradients_impl.py", line 348, in _MaybeCompile
return grad_fn() # Exit early
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
ops\gradients_impl.py", line 542, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
ops\nn_grad.py", line 707, in _TopKGrad
ind_2d = array_ops.reshape(op.outputs[1], array_ops.stack([-1, ind_lastdim])
)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
ops\gen_array_ops.py", line 2619, in reshape
name=name)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
framework\ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
framework\ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-
access

...which was originally created as op 'TopKV2', defined at:
File "training.py", line 333, in
tf.app.run()
[elided 1 identical lines from previous traceback]
File "training.py", line 312, in main
train(dataset, net, net_config)
File "training.py", line 175, in train
seg_gt, dataset, config)
File "training.py", line 117, in objective
detection_loss(location, confidence, refine_ph, classes_ph, pos_mask)
File "training.py", line 74, in detection_loss
number_of_negatives)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
ops\nn_ops.py", line 1949, in top_k
return gen_nn_ops._top_kv2(input, k=k, sorted=sorted, name=name)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
ops\gen_nn_ops.py", line 2577, in _top_kv2
name=name)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
framework\ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Program Files (x86)\Python 3.5.2\lib\site-packages\tensorflow\python
framework\ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-
access

InvalidArgumentError (see above for traceback): Reshape cannot infer the missing
input size for an empty tensor unless all specified input sizes are non-zero
[[Node: gradients/TopKV2_grad/Reshape = Reshape[T=DT_INT32, Tshape=DT_I
NT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TopKV2/_851, gradients/To
pKV2_grad/stack)]]
[[Node: PiecewiseConstant/case/Assert/AssertGuard/pred_id/_587 = _HostR
ecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0"
, send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1
, tensor_name="edge_685_PiecewiseConstant/case/Assert/AssertGuard/pred_id", tens
or_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/cpu:0"
]]

from blitznet.

dvornikita avatar dvornikita commented on June 26, 2024

@fastlater Regarding the checkpoints, we save them every 1000 iterations no matter what. You can see this in the main training loop in training.py and you can change this value.
Regarding your error, I guess it's caused by the absence of positive proposals for your single image that you feed. As you know, there is some data augmentation involved in the pipeline. It could happen that your random crop doesn't contain an object. The probability p of this event is pretty low and when you have 32 images in your batch it becomes p^32, so almost zero. You shouldn't forget about batch normalization either that won't produce anything meaningful with the batch size of one.
DL;DR Increase your batch size.

from blitznet.

fastlater avatar fastlater commented on June 26, 2024

@dvornikita Thank you for your reply. About the checkpoint saving method, I found it in the training.py. I added a few lines to save the checkpoint when max_iteration is reached:
if step % args.max_iterations == 0 and step > 0: #here summaries and save checkpoints

I was training with batch_size =1 because if I set any number higher than 1, the execution stops due to low memory. I will update my gpu and run it next time with batch size =32.
Thanks.

from blitznet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.