Hello ! <a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Seems to be running out of memory. Are you running any other s? and which versio

thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

training with tensorflow-gpu about training-styletransfer HOT 14 CLOSED

ml5js commented on June 23, 2024

training with tensorflow-gpu

from training-styletransfer.

Comments (14)

b2renger commented on June 23, 2024 1

@cvalenzuela Victory \O/ !


Training complete. For evaluation:
    `python evaluate.py --checkpoint checkpoints/ ...`
Converting model to ml5js
Writing manifest to models/manifest.json
Done! Checkpoint saved. Visit https://ml5js.org/docs/StyleTransfer for more information

Now I'll test the model and start on documenting the docker process.

from training-styletransfer.

cvalenzuela commented on June 23, 2024

Seems to be running out of memory. Are you running any other scripts? and which version of CUDA are you using?

from training-styletransfer.

b2renger commented on June 23, 2024

ok, everything should be the latest version as I am starting from fresh install. (sorry but I can't check right now, I'll get back to you with the exact versions of everything when I have access to the machine).

So I guess 9.2 on windows.

The ubuntu drivers for a GTX1080 are 396.xx, and the rest should be :
cudatoolkit: 9.0-h13b8566_0
cudnn: 7.1.2-cuda9.0_0
cupti: 9.0.176-0

Nothing else is running.

Are there specific version of cudatoolkit, cudnn etc. that I should be aware of ?

BTW, I noticed that a requirement is missing to run everything : moviepy, that is not available via a classic 'conda install moviepy', users should do a 'conda install -c conda-forge moviepy ' instead.

Installing eveything is quite tricky, maybe we should focus on the docker documentation ? specially if we can run on the gpu using docker. What do you think ?

from training-styletransfer.

cvalenzuela commented on June 23, 2024

Your specs look right. All the dependencies are being install here: https://github.com/ml5js/training-styletransfer/blob/master/Dockerfile#L20, including moviepy

Can you try using the docker image instead? Perhaps building the .DockerFile or pulling from the docker hub this image: cvalenzuelab/styletransfer

from training-styletransfer.

b2renger commented on June 23, 2024

thanks @cvalenzuela I'll look into it

from training-styletransfer.

cvalenzuela commented on June 23, 2024

lmk how it goes!

from training-styletransfer.

b2renger commented on June 23, 2024

@cvalenzuela

I may need some help to actually run the docker container, if I can get through this I may be able to write some doc about it. I've been looking at the docker documentation and here's what I think I should do :

first I do need to pull the docker file :
docker pull cvalenzuelab/styletransfer:latest
or I can build it locally
docker build -t cvalenzuelab/styletransfer -f Dockerfile .

the I need to run it and if I am not mistaken it should go like this :
docker run -it -p 8888:8888 -p 6006:6006 -v /Users/:/root/sharedfolder cvalenzuelab/styletransfer bash

after that I should navigate to the folder where I've clone this repo and run the scripts (from the step 3 of the readme) am I right ?

from training-styletransfer.

cvalenzuela commented on June 23, 2024

yes! that's right. It will probably be a good idea to add those instructions as well

from training-styletransfer.

b2renger commented on June 23, 2024

Ok so for future reference, running docker on linux with access to the home folder is a bit trickier that the above cli (because of users and permissions), it looks like this instead :
sudo docker run -e USER=$USER -e USERID=$UID -v $PWD:$PWD -w=$PWD -it -p 8888:8888 -p 6006:6006 -v ~:/root/sharedfolder cvalenzuelab/styletransfer bash

to be continued ...

from training-styletransfer.

b2renger commented on June 23, 2024

@cvalenzuela
on ubuntu, while being in the container after running the command above to start it and running the setup.py script when I try to run run.sh or the direct python cmd to style I get this error :

root@287cf95c435c:/home/lasseter/git/training-styletransfer# bash run.sh
Traceback (most recent call last):
  File "style.py", line 6, in <module>
    from optimize import optimize
  File "src/optimize.py", line 3, in <module>
    import vgg, pdb, time
  File "src/vgg.py", line 3, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 72, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

from training-styletransfer.

cvalenzuela commented on June 23, 2024

There seems to be a problem with your CUDA runtime.
Are you using nvidia-docker?

from training-styletransfer.

b2renger commented on June 23, 2024

No I'm not !

So @cvalenzuela I installed nvidia-docker following those instructions : https://github.com/NVIDIA/nvidia-docker

the nvidia-docker --version cmd returns :
Docker version 18.06.1-ce, build e68fc7a

Run it with
sudo nvidia-docker run -e USER=$USER -e USERID=$UID -v $PWD:$PWD -w=$PWD -it -p 8888:8888 -p 6006:6006 -v ~/home/lasseter/:/home cvalenzuelab/styletransfer bash

this ends up with a RessourceExhaustedError again, here is the full log :

ml5.js Style Transfer Training!
Note: This traning will take a couple of hours.
Training is starting!...
Train set has been trimmed slightly..
(1, 1400, 1200, 3)
UID: 32
Traceback (most recent call last):
File "style.py", line 179, in
main()
File "style.py", line 156, in main
for preds, losses, i, epoch in optimize(*args, **kwargs):
File "src/optimize.py", line 114, in optimize
train_step.run(feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2042, in run
_run_using_default_session(self, feed_dict, self.graph, session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4490, in _run_using_default_session
session.run(operation, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[20,4096,256]
[[Node: gradients/transpose_2_grad/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/MatMul_2_grad/tuple/control_dependency, gradients/transpose_grad/InvertPermutation)]]

Caused by op u'gradients/transpose_2_grad/transpose', defined at:
File "style.py", line 179, in
main()
File "style.py", line 156, in main
for preds, losses, i, epoch in optimize(*args, **kwargs):
File "src/optimize.py", line 91, in optimize
train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 343, in minimize
grad_loss=grad_loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 414, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile
return grad_fn() # Exit early
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 581, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_grad.py", line 505, in _TransposeGrad
return [array_ops.transpose(grad, array_ops.invert_permutation(p)), None]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1336, in transpose
ret = gen_array_ops.transpose(a, perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 5694, in transpose
"Transpose", x=x, perm=perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

...which was originally created as op u'transpose_2', defined at:
File "style.py", line 179, in
main()
[elided 0 identical lines from previous traceback]
File "style.py", line 156, in main
for preds, losses, i, epoch in optimize(*args, **kwargs):
File "src/optimize.py", line 74, in optimize
feats_T = tf.transpose(feats, perm=[0,2,1])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1336, in transpose
ret = gen_array_ops.transpose(a, perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 5694, in transpose
"Transpose", x=x, perm=perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[20,4096,256]
[[Node: gradients/transpose_2_grad/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/MatMul_2_grad/tuple/control_dependency, gradients/transpose_grad/InvertPermutation)]]

from training-styletransfer.

b2renger commented on June 23, 2024

FIY I tried to reduce the bacthsize to 16, it seems to be working for now ... I'll keep you posted in a few hours

from training-styletransfer.

cvalenzuela commented on June 23, 2024

great! glad to hear

from training-styletransfer.

training with tensorflow-gpu about training-styletransfer HOT 14 CLOSED

Comments (14)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs