Comments (14)
@cvalenzuela Victory \O/ !
Training complete. For evaluation:
`python evaluate.py --checkpoint checkpoints/ ...`
Converting model to ml5js
Writing manifest to models/manifest.json
Done! Checkpoint saved. Visit https://ml5js.org/docs/StyleTransfer for more information
Now I'll test the model and start on documenting the docker process.
from training-styletransfer.
Seems to be running out of memory. Are you running any other scripts? and which version of CUDA are you using?
from training-styletransfer.
ok, everything should be the latest version as I am starting from fresh install. (sorry but I can't check right now, I'll get back to you with the exact versions of everything when I have access to the machine).
So I guess 9.2 on windows.
The ubuntu drivers for a GTX1080 are 396.xx, and the rest should be :
cudatoolkit: 9.0-h13b8566_0
cudnn: 7.1.2-cuda9.0_0
cupti: 9.0.176-0
Nothing else is running.
Are there specific version of cudatoolkit, cudnn etc. that I should be aware of ?
BTW, I noticed that a requirement is missing to run everything : moviepy, that is not available via a classic 'conda install moviepy', users should do a 'conda install -c conda-forge moviepy ' instead.
Installing eveything is quite tricky, maybe we should focus on the docker documentation ? specially if we can run on the gpu using docker. What do you think ?
from training-styletransfer.
Your specs look right. All the dependencies are being install here: https://github.com/ml5js/training-styletransfer/blob/master/Dockerfile#L20, including moviepy
Can you try using the docker image instead? Perhaps building the .DockerFile
or pulling from the docker hub this image: cvalenzuelab/styletransfer
from training-styletransfer.
thanks @cvalenzuela I'll look into it
from training-styletransfer.
lmk how it goes!
from training-styletransfer.
I may need some help to actually run the docker container, if I can get through this I may be able to write some doc about it. I've been looking at the docker documentation and here's what I think I should do :
first I do need to pull the docker file :
docker pull cvalenzuelab/styletransfer:latest
or I can build it locally
docker build -t cvalenzuelab/styletransfer -f Dockerfile .
the I need to run it and if I am not mistaken it should go like this :
docker run -it -p 8888:8888 -p 6006:6006 -v /Users/:/root/sharedfolder cvalenzuelab/styletransfer bash
after that I should navigate to the folder where I've clone this repo and run the scripts (from the step 3 of the readme) am I right ?
from training-styletransfer.
yes! that's right. It will probably be a good idea to add those instructions as well
from training-styletransfer.
Ok so for future reference, running docker on linux with access to the home folder is a bit trickier that the above cli (because of users and permissions), it looks like this instead :
sudo docker run -e USER=$USER -e USERID=$UID -v $PWD:$PWD -w=$PWD -it -p 8888:8888 -p 6006:6006 -v ~:/root/sharedfolder cvalenzuelab/styletransfer bash
to be continued ...
from training-styletransfer.
@cvalenzuela
on ubuntu, while being in the container after running the command above to start it and running the setup.py script when I try to run run.sh or the direct python cmd to style I get this error :
root@287cf95c435c:/home/lasseter/git/training-styletransfer# bash run.sh
Traceback (most recent call last):
File "style.py", line 6, in <module>
from optimize import optimize
File "src/optimize.py", line 3, in <module>
import vgg, pdb, time
File "src/vgg.py", line 3, in <module>
import tensorflow as tf
File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import *
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 72, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
from training-styletransfer.
There seems to be a problem with your CUDA runtime.
Are you using nvidia-docker
?
from training-styletransfer.
No I'm not !
So @cvalenzuela I installed nvidia-docker following those instructions : https://github.com/NVIDIA/nvidia-docker
the nvidia-docker --version
cmd returns :
Docker version 18.06.1-ce, build e68fc7a
Run it with
sudo nvidia-docker run -e USER=$USER -e USERID=$UID -v $PWD:$PWD -w=$PWD -it -p 8888:8888 -p 6006:6006 -v ~/home/lasseter/:/home cvalenzuelab/styletransfer bash
this ends up with a RessourceExhaustedError again, here is the full log :
ml5.js Style Transfer Training!
Note: This traning will take a couple of hours.
Training is starting!...
Train set has been trimmed slightly..
(1, 1400, 1200, 3)
UID: 32
Traceback (most recent call last):
File "style.py", line 179, in
main()
File "style.py", line 156, in main
for preds, losses, i, epoch in optimize(*args, **kwargs):
File "src/optimize.py", line 114, in optimize
train_step.run(feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2042, in run
_run_using_default_session(self, feed_dict, self.graph, session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4490, in _run_using_default_session
session.run(operation, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[20,4096,256]
[[Node: gradients/transpose_2_grad/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/MatMul_2_grad/tuple/control_dependency, gradients/transpose_grad/InvertPermutation)]]Caused by op u'gradients/transpose_2_grad/transpose', defined at:
File "style.py", line 179, in
main()
File "style.py", line 156, in main
for preds, losses, i, epoch in optimize(*args, **kwargs):
File "src/optimize.py", line 91, in optimize
train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 343, in minimize
grad_loss=grad_loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 414, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile
return grad_fn() # Exit early
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 581, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_grad.py", line 505, in _TransposeGrad
return [array_ops.transpose(grad, array_ops.invert_permutation(p)), None]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1336, in transpose
ret = gen_array_ops.transpose(a, perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 5694, in transpose
"Transpose", x=x, perm=perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access...which was originally created as op u'transpose_2', defined at:
File "style.py", line 179, in
main()
[elided 0 identical lines from previous traceback]
File "style.py", line 156, in main
for preds, losses, i, epoch in optimize(*args, **kwargs):
File "src/optimize.py", line 74, in optimize
feats_T = tf.transpose(feats, perm=[0,2,1])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1336, in transpose
ret = gen_array_ops.transpose(a, perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 5694, in transpose
"Transpose", x=x, perm=perm, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-accessResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[20,4096,256]
[[Node: gradients/transpose_2_grad/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/MatMul_2_grad/tuple/control_dependency, gradients/transpose_grad/InvertPermutation)]]
from training-styletransfer.
FIY I tried to reduce the bacthsize to 16, it seems to be working for now ... I'll keep you posted in a few hours
from training-styletransfer.
great! glad to hear
from training-styletransfer.
Related Issues (8)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from training-styletransfer.