Comments (14)
I also encountered the same error. I just set the environment variable ML_DATA_PATH to the path where I keep my data. You would also need to set ML_LOG_PATH to some other location as well. Besides the environment variables, I needed to install two packages I didn't have on my system
conda install pil
pip install sacred
from iaf.
@rfarouni thanks a a lot of the suggestions. For future reference for anyone else, I ended up with the following 3 environment variables
ML_DATA_PAPTH=/path/to/cifar10
CIFAR10_PATH=/path/to/cifar10
ML_LOG_PATH=/path/to/logs
then I modified graphy/nodes/conv.py
by adding
if 'gpu' in theano.config.device: # @UndefinedVariable
from theano.sandbox.cuda.dnn import dnn_conv
from theano.sandbox.cuda.dnn import dnn_pool
elif 'cuda' in theano.config.device: # @UndefinedVariable
from theano.sandbox.gpuarray.dnn import dnn_conv
from theano.sandbox.gpuarray.dnn import dnn_pool
elif 'cpu' in theano.config.device:
from theano.tensor.nnet import conv2d as dnn_conv
from theano.tensor.signal.pool import pool_2d as dnn_pool
else:
raise Exception()
since I don't have gpu on my machine
but now I get the following error about the posterior down_iaf2_NL
[graphy] floatX = float32
INFO - Deep VAE - Running command 'train'
WARNING - Deep VAE - No observers have been added to this run
INFO - Deep VAE - Started
Logpath: /Users/user/iaf/logs//1475861678.05/
CVAE1 with {'depths': [2, 2, 2], 'nl': u'elu', 'n_h2': 64, 'n_z': 32, 'shape_x': [3, 32, 32], 'optim': u'adamax', 'weightsharing': False, 'px': u'logistic', 'kernel_x': [5, 5], 'n_h1': 64, 'prior': u'diag', 'posterior': u'down_iaf2_NL', 'pad_x': 0, 'beta2': 0.001, 'beta1': 0.1, 'depth_ar': 1, 'alpha': 0.002, 'kl_min': 0.25, 'downsample_type': u'nn', 'kernel_h': [3, 3]}
ERROR - Deep VAE - Failed after 0:00:01!
Traceback (most recent calls WITHOUT Sacred internals):
File "train.py", line 185, in train
model = construct_model(data_init)
File "train.py", line 128, in construct_model
model = models.cvae1(**margs)
File "/Users/user/iaf/models.py", line 396, in cvae1
layers[i].append(cvae_layer(name, prior, posterior, n_h1, n_h2, n_z, depth_ar, downsample, nl, kernel_h, False, downsample_type, w))
File "/Users/user/iaf/models.py", line 105, in cvae_layer
raise Exception("Unknown posterior "+posterior)
Exception: Unknown posterior down_iaf2_NL
from iaf.
@kirk86 You need to change down_iaf2_NL
to down_iaf2_nl
here
python train.py with problem=cifar10 n_z=32 n_h=64 depths=[2,2,2] margs.depth_ar=1 margs.posterior=down_iaf2_NL margs.kl_min=0.25
from iaf.
@rfarouni thanks. Now I'm getting the following error:
python train.py with problem=cifar10 n_z=32 n_h=64 depths=[2,2,2] margs.depth_ar=1 margs.posterior=down_iaf2_nl margs.kl_min=0.25
[graphy] floatX = float32
INFO - Deep VAE - Running command 'train'
WARNING - Deep VAE - No observers have been added to this run
INFO - Deep VAE - Started
Logpath: /Users/user/iaf/logs//1475865352.93/
CVAE1 with {'depths': [2, 2, 2], 'nl': u'elu', 'n_h2': 64, 'n_z': 32, 'shape_x': [3, 32, 32], 'optim': u'adamax', 'weightsharing': False, 'px': u'logistic', 'kernel_x': [5, 5], 'n_h1': 64, 'prior': u'diag', 'posterior': u'down_iaf2_nl', 'pad_x': 0, 'beta2': 0.001, 'beta1': 0.1, 'depth_ar': 1, 'alpha': 0.002, 'kl_min': 0.25, 'downsample_type': u'nn', 'kernel_h': [3, 3]}
ERROR - Deep VAE - Failed after 0:00:01!
Traceback (most recent calls WITHOUT Sacred internals):
File "train.py", line 185, in train
model = construct_model(data_init)
File "train.py", line 128, in construct_model
model = models.cvae1(**margs)
File "/Users/users/iaf/models.py", line 520, in cvae1
f_encode_decode(w)
File "/Users/users/iaf/models.py", line 416, in f_encode_decode
h = x_enc(_x - .5, w)
File "/Users/users/iaf/graphy/nodes/conv.py", line 196, in f
input_shape = h.tag.test_value.shape[1:]
AttributeError: scratchpad instance has no attribute 'test_value'
from iaf.
@kirk86 In .theanorc, add this line compute_test_value=raise
. My file looks like this
[global]
floatX = float32
device = gpu
compute_test_value=raise
Note: I am using a GPU
from iaf.
@rfarouni thanks for you patience! I was just about to close this when I saw the print out messages but then I run on this NaN error
ar.conv2d 0_0_posterior_conv1_out_1 (64, 16, 16) (32, 16, 16) [3, 3] True False True valid False True
conv2d 0_0_down_conv2_1 (96, 16, 16) (64, 16, 16) [3, 3] True valid (1, 1) 1
conv2d x_dec (64, 16, 16) (3, 32, 32) [5, 5] True valid (1, 1) 2
AdaMax_Avg alpha: -0.002 beta1: 0.1 beta2: 0.001
Compiling... 212.75 s
[array([ nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan], dtype=float32)]
ERROR - Deep VAE - Failed after 0:11:00!
Traceback (most recent calls WITHOUT Sacred internals):
File "train.py", line 256, in train
result = model.train(data_train, n_batch=n_batch)
File "/Users/user/iaf/models.py", line 538, in newf
return f.cache(*args, **kws)
File "/Users/user/iaf/graphy/function.py", line 110, in func
raise Exception("NaN detected")
Exception: NaN detected
from iaf.
@kirk86 Although I didn't encounter this error, I got a memory error after a minute or so of run time. I have only a 4G GPU and it seems I need larger memory to run the code on the dataset given the parameters that were provided
from iaf.
@rfarouni, ah, that gives me a hint to try it on a machine with larger memory as well. Even though I'm not quite confident that this nan error comes from memory issues in my case. It seems to me more of a computation error related to the actual code implementation than the memory part. I'll tested on a bigger machine just in case and report back.
On another note, I like your favorite quotes
section 👍 that I might steal the idea even though I've had lots of those quotes
collected in my notebook but never posted them...
from iaf.
Some update regarding this issue. So I run the script on a machine with 8 cores and 32GB of memory and two days now, it's like I've been watching a black hole, speaking in memory terms. The only thing running on that machine is this script and so far it has swallowed 27GB of memory. I'm not closing this issue yet until the script is over. But to be fair this is kind of insane in terms of memory consumption.
from iaf.
Hi Kirk86,
Thanks for bringing this up; it definitely doesn't need that much memory,
probably a bug. I'll look into it.
On Fri, Oct 14, 2016 at 1:36 PM, kirk86 [email protected] wrote:
Some update regarding this issue. So I run the script on a machine with 8
cores and 32GB of memory and two days now, it's like I've been watching a
black hole, speaking in memory terms. The only thing running on that
machine is this script and so fat it has swallowed 27GB of memory. I'm not
closing this issue yet until the script is over. But to be fair this is
kind of insane in terms of memory consumption.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACUc85LFEhTl_K86mV6sb0f4-IMpWU8Mks5qz-fKgaJpZM4KKVtu
.
from iaf.
@dpkingma @kirk86 I also ran into memory problems on the GPU the first time I ran it. The second time, for some unexplained reason, worked fine, although very slowly. I also tried to run the Tensorflow implementation on one GPU, but I encountered this error
python tf_train.py --logdir $ML_LOG_PATH --hpconfig depth=1,num_blocks=20,kl_min=0.1,learning_rate=0.002,batch_size=32 --num_gpus 1 --mode train
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.1.5 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
Num trainable variables: 41557927
starting training
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GTX 960
major: 5 minor: 2 memoryClockRate (GHz) 1.304
pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.41GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0)
Starting queue runners
Initializing parameters.
Initialized!
Traceback (most recent call last):
File "tf_train.py", line 392, in <module>
tf.app.run()
File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "tf_train.py", line 386, in main
run(hps)
File "tf_train.py", line 265, in run
with sv.managed_session(config=config) as sess:
File "/home/rick/anaconda3/envs/py27/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 974, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 802, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 963, in managed_session
start_standard_services=start_standard_services)
File "/home/rick/Documents/repos/iaf/tf_utils/common.py", line 222, in prepare_or_wait_for_session
not_ready = self._session_manager._model_not_ready(sess)
AttributeError: 'SessionManager' object has no attribute '_model_not_ready'
from iaf.
@rfarouni I know that this might not be the solution , but did you install tqdm? Regarding the slowliness it's something that I've also experienced. In my case it's almost one day for each epoch.
from iaf.
@kirk86 sure! conda install tqdm
from iaf.
tdqm is not in conda but is in pip pip install tqdm
from iaf.
Related Issues (12)
- conv2d running very slowly
- #8 causes NaNs almost immediately during training
- Memory space is increasing HOT 1
- 10 Python 3 syntax errors
- Command Line for MNIST HOT 1
- Possible bug in the source
- Update to support TF1.1 + HOT 5
- RuntimeError: curand error generating random normals 102 HOT 1
- Small bug in tf_utils/layers.py
- Incorrect initialisation in tf_utils/layers.py
- Constant variance for the generating network of autoencoder
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iaf.