I successfully deployed the project on the system with Windows 10 + Tensorflow 1.3.0 (CPU only) . However, when I deployed it on the system with Ubuntu + tensorflow 1.4 (GeForce GTX 1080 Ti), I ran into following problem.
sheldon@amax:~/Projects/Deeplab-v2--ResNet-101--Tensorflow$ python3 main.py
2017-12-28 11:41:44.924418: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-12-28 11:41:45.623542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:8a:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2017-12-28 11:41:45.623612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:8a:00.0, compute capability: 6.1)
-----------build encoder: deeplab pre-trained-----------
after start block: (10, 81, 81, 64)
after block1: (10, 81, 81, 256)
after block2: (10, 41, 41, 512)
after block3: (10, 41, 41, 1024)
after block4: (10, 41, 41, 2048)
-----------build decoder-----------
after aspp block: (10, 41, 41, 21)
Restored model parameters from /data2/deeplab_resnet_init.ckpt
2017-12-28 11:42:09.929249: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 720.00MiB. Current allocation summary follows.
2017-12-28 11:42:09.929406: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (256): Total Chunks: 278, Chunks in use: 216. 69.5KiB allocated for chunks. 54.0KiB in use in bin. 15.6KiB client-requested in use in bin.
2017-12-28 11:42:09.929433: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (512): Total Chunks: 65, Chunks in use: 64. 32.5KiB allocated for chunks. 32.0KiB in use in bin. 32.0KiB client-requested in use in bin.
2017-12-28 11:42:09.929452: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (1024): Total Chunks: 401, Chunks in use: 401. 401.2KiB allocated for chunks. 401.2KiB in use in bin. 401.0KiB client-requested in use in bin.
2017-12-28 11:42:09.929471: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (2048): Total Chunks: 88, Chunks in use: 88. 176.0KiB allocated for chunks. 176.0KiB in use in bin. 176.0KiB client-requested in use in bin.
> (There were way too much similar outputs, so I just left out most of the lines here)
2017-12-28 11:42:09.950169: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 424673280 totalling 405.00MiB
2017-12-28 11:42:09.950182: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 663552000 totalling 632.81MiB
2017-12-28 11:42:09.950194: I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 9.55GiB
2017-12-28 11:42:09.950211: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats:
Limit: 10968825856
InUse: 10253331200
MaxInUse: 10265177344
NumAllocs: 4856
MaxAllocSize: 802160640
2017-12-28 11:42:09.950341: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************************************************************************************______
2017-12-28 11:42:09.950375: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[5760,4,4,2048]
2017-12-28 11:42:09.973697: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.00GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.973763: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.973796: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 928.77MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.999611: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.58GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.999662: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.999689: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 415.06MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:10.017460: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.21GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:10.017501: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.67GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
Traceback (most recent call last):
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[5760,4,4,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: add/_1131 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5776_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 82, in <module>
tf.app.run()
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "main.py", line 76, in main
getattr(model, args.option)()
File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/model.py", line 60, in train
feed_dict=feed_dict)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[5760,4,4,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: add/_1131 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5776_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'fc1_voc12_c3/convolution/SpaceToBatchND', defined at:
File "main.py", line 82, in <module>
tf.app.run()
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "main.py", line 76, in main
getattr(model, args.option)()
File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/model.py", line 36, in train
self.train_setup()
File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/model.py", line 177, in train_setup
net = Deeplab_v2(self.image_batch, self.conf.num_classes, True)
File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 34, in __init__
self.build_network()
File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 38, in build_network
self.outputs = self.build_decoder(self.encoding)
File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 64, in build_decoder
outputs = self._ASPP(encoding, self.num_classes, [6, 12, 18, 24])
File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 125, in _ASPP
o.append(self._dilated_conv2d(x, 3, num_o, d, name='fc1_voc12_c%d' % i, biased=True))
File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 150, in _dilated_conv2d
o = tf.nn.atrous_conv2d(x, w, dilation_factor, padding='SAME')
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 1137, in atrous_conv2d
name=name)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 751, in convolution
return op(input, filter)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 835, in __call__
return self.conv_op(inp, filter)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 499, in __call__
return self.call(inp, filter)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 490, in _with_space_to_batch_call
paddings=paddings)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4922, in space_to_batch_nd
paddings=paddings, name=name)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[5760,4,4,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: add/_1131 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5776_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
It seems that the system ran out of resources. What shall I do to fix the problem?