GithubHelp home page GithubHelp logo

errors upon running the project about iaf HOT 14 OPEN

openai avatar openai commented on June 12, 2024
errors upon running the project

from iaf.

Comments (14)

rfarouni avatar rfarouni commented on June 12, 2024

I also encountered the same error. I just set the environment variable ML_DATA_PATH to the path where I keep my data. You would also need to set ML_LOG_PATH to some other location as well. Besides the environment variables, I needed to install two packages I didn't have on my system

conda install pil
pip install sacred

from iaf.

kirk86 avatar kirk86 commented on June 12, 2024

@rfarouni thanks a a lot of the suggestions. For future reference for anyone else, I ended up with the following 3 environment variables

ML_DATA_PAPTH=/path/to/cifar10
CIFAR10_PATH=/path/to/cifar10
ML_LOG_PATH=/path/to/logs

then I modified graphy/nodes/conv.py by adding

if 'gpu' in theano.config.device:  # @UndefinedVariable
    from theano.sandbox.cuda.dnn import dnn_conv
    from theano.sandbox.cuda.dnn import dnn_pool
elif 'cuda' in theano.config.device:  # @UndefinedVariable
    from theano.sandbox.gpuarray.dnn import dnn_conv
    from theano.sandbox.gpuarray.dnn import dnn_pool
elif 'cpu' in theano.config.device:
    from theano.tensor.nnet import conv2d as dnn_conv
    from theano.tensor.signal.pool import pool_2d as dnn_pool
else:
    raise Exception()

since I don't have gpu on my machine

but now I get the following error about the posterior down_iaf2_NL

[graphy] floatX = float32
INFO - Deep VAE - Running command 'train'
WARNING - Deep VAE - No observers have been added to this run
INFO - Deep VAE - Started
Logpath: /Users/user/iaf/logs//1475861678.05/
CVAE1 with  {'depths': [2, 2, 2], 'nl': u'elu', 'n_h2': 64, 'n_z': 32, 'shape_x': [3, 32, 32], 'optim': u'adamax', 'weightsharing': False, 'px': u'logistic', 'kernel_x': [5, 5], 'n_h1': 64, 'prior': u'diag', 'posterior': u'down_iaf2_NL', 'pad_x': 0, 'beta2': 0.001, 'beta1': 0.1, 'depth_ar': 1, 'alpha': 0.002, 'kl_min': 0.25, 'downsample_type': u'nn', 'kernel_h': [3, 3]}
ERROR - Deep VAE - Failed after 0:00:01!
Traceback (most recent calls WITHOUT Sacred internals):
  File "train.py", line 185, in train
    model = construct_model(data_init)
  File "train.py", line 128, in construct_model
    model = models.cvae1(**margs)
  File "/Users/user/iaf/models.py", line 396, in cvae1
    layers[i].append(cvae_layer(name, prior, posterior, n_h1, n_h2, n_z, depth_ar, downsample, nl, kernel_h, False, downsample_type, w))
  File "/Users/user/iaf/models.py", line 105, in cvae_layer
    raise Exception("Unknown posterior "+posterior)
Exception: Unknown posterior down_iaf2_NL

from iaf.

rfarouni avatar rfarouni commented on June 12, 2024

@kirk86 You need to change down_iaf2_NL to down_iaf2_nl here

python train.py with problem=cifar10 n_z=32 n_h=64 depths=[2,2,2] margs.depth_ar=1 margs.posterior=down_iaf2_NL margs.kl_min=0.25

from iaf.

kirk86 avatar kirk86 commented on June 12, 2024

@rfarouni thanks. Now I'm getting the following error:

python train.py with problem=cifar10 n_z=32 n_h=64 depths=[2,2,2] margs.depth_ar=1 margs.posterior=down_iaf2_nl margs.kl_min=0.25
[graphy] floatX = float32
INFO - Deep VAE - Running command 'train'
WARNING - Deep VAE - No observers have been added to this run
INFO - Deep VAE - Started
Logpath: /Users/user/iaf/logs//1475865352.93/
CVAE1 with  {'depths': [2, 2, 2], 'nl': u'elu', 'n_h2': 64, 'n_z': 32, 'shape_x': [3, 32, 32], 'optim': u'adamax', 'weightsharing': False, 'px': u'logistic', 'kernel_x': [5, 5], 'n_h1': 64, 'prior': u'diag', 'posterior': u'down_iaf2_nl', 'pad_x': 0, 'beta2': 0.001, 'beta1': 0.1, 'depth_ar': 1, 'alpha': 0.002, 'kl_min': 0.25, 'downsample_type': u'nn', 'kernel_h': [3, 3]}
ERROR - Deep VAE - Failed after 0:00:01!
Traceback (most recent calls WITHOUT Sacred internals):
  File "train.py", line 185, in train
    model = construct_model(data_init)
  File "train.py", line 128, in construct_model
    model = models.cvae1(**margs)
  File "/Users/users/iaf/models.py", line 520, in cvae1
    f_encode_decode(w)
  File "/Users/users/iaf/models.py", line 416, in f_encode_decode
    h = x_enc(_x - .5, w)
  File "/Users/users/iaf/graphy/nodes/conv.py", line 196, in f
    input_shape = h.tag.test_value.shape[1:]
AttributeError: scratchpad instance has no attribute 'test_value'

from iaf.

rfarouni avatar rfarouni commented on June 12, 2024

@kirk86 In .theanorc, add this line compute_test_value=raise. My file looks like this

[global]
floatX = float32
device = gpu
compute_test_value=raise

Note: I am using a GPU

from iaf.

kirk86 avatar kirk86 commented on June 12, 2024

@rfarouni thanks for you patience! I was just about to close this when I saw the print out messages but then I run on this NaN error

ar.conv2d 0_0_posterior_conv1_out_1 (64, 16, 16) (32, 16, 16) [3, 3] True False True valid False True
conv2d 0_0_down_conv2_1 (96, 16, 16) (64, 16, 16) [3, 3] True valid (1, 1) 1
conv2d x_dec (64, 16, 16) (3, 32, 32) [5, 5] True valid (1, 1) 2
AdaMax_Avg alpha: -0.002 beta1: 0.1 beta2: 0.001
Compiling...  212.75 s
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan], dtype=float32)]
ERROR - Deep VAE - Failed after 0:11:00!
Traceback (most recent calls WITHOUT Sacred internals):
  File "train.py", line 256, in train
    result = model.train(data_train, n_batch=n_batch)
  File "/Users/user/iaf/models.py", line 538, in newf
    return f.cache(*args, **kws)
  File "/Users/user/iaf/graphy/function.py", line 110, in func
    raise Exception("NaN detected")
Exception: NaN detected

from iaf.

rfarouni avatar rfarouni commented on June 12, 2024

@kirk86 Although I didn't encounter this error, I got a memory error after a minute or so of run time. I have only a 4G GPU and it seems I need larger memory to run the code on the dataset given the parameters that were provided

from iaf.

kirk86 avatar kirk86 commented on June 12, 2024

@rfarouni, ah, that gives me a hint to try it on a machine with larger memory as well. Even though I'm not quite confident that this nan error comes from memory issues in my case. It seems to me more of a computation error related to the actual code implementation than the memory part. I'll tested on a bigger machine just in case and report back.

On another note, I like your favorite quotes section 👍 that I might steal the idea even though I've had lots of those quotes collected in my notebook but never posted them...

from iaf.

kirk86 avatar kirk86 commented on June 12, 2024

Some update regarding this issue. So I run the script on a machine with 8 cores and 32GB of memory and two days now, it's like I've been watching a black hole, speaking in memory terms. The only thing running on that machine is this script and so far it has swallowed 27GB of memory. I'm not closing this issue yet until the script is over. But to be fair this is kind of insane in terms of memory consumption.

from iaf.

dpkingma avatar dpkingma commented on June 12, 2024

Hi Kirk86,

Thanks for bringing this up; it definitely doesn't need that much memory,
probably a bug. I'll look into it.

On Fri, Oct 14, 2016 at 1:36 PM, kirk86 [email protected] wrote:

Some update regarding this issue. So I run the script on a machine with 8
cores and 32GB of memory and two days now, it's like I've been watching a
black hole, speaking in memory terms. The only thing running on that
machine is this script and so fat it has swallowed 27GB of memory. I'm not
closing this issue yet until the script is over. But to be fair this is
kind of insane in terms of memory consumption.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACUc85LFEhTl_K86mV6sb0f4-IMpWU8Mks5qz-fKgaJpZM4KKVtu
.

from iaf.

rfarouni avatar rfarouni commented on June 12, 2024

@dpkingma @kirk86 I also ran into memory problems on the GPU the first time I ran it. The second time, for some unexplained reason, worked fine, although very slowly. I also tried to run the Tensorflow implementation on one GPU, but I encountered this error

python tf_train.py --logdir $ML_LOG_PATH --hpconfig depth=1,num_blocks=20,kl_min=0.1,learning_rate=0.002,batch_size=32 --num_gpus 1 --mode train
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.1.5 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
Num trainable variables: 41557927
starting training
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: GeForce GTX 960
major: 5 minor: 2 memoryClockRate (GHz) 1.304
pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.41GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0)
Starting queue runners
Initializing parameters.
Initialized!
Traceback (most recent call last):
  File "tf_train.py", line 392, in <module>
    tf.app.run()
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "tf_train.py", line 386, in main
    run(hps)
  File "tf_train.py", line 265, in run
    with sv.managed_session(config=config) as sess:
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 974, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 802, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 963, in managed_session
    start_standard_services=start_standard_services)
  File "/home/rick/Documents/repos/iaf/tf_utils/common.py", line 222, in prepare_or_wait_for_session
    not_ready = self._session_manager._model_not_ready(sess)
AttributeError: 'SessionManager' object has no attribute '_model_not_ready'

from iaf.

kirk86 avatar kirk86 commented on June 12, 2024

@rfarouni I know that this might not be the solution , but did you install tqdm? Regarding the slowliness it's something that I've also experienced. In my case it's almost one day for each epoch.

from iaf.

rfarouni avatar rfarouni commented on June 12, 2024

@kirk86 sure! conda install tqdm

from iaf.

Mistobaan avatar Mistobaan commented on June 12, 2024

tdqm is not in conda but is in pip pip install tqdm

from iaf.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.