joschu / cgt Goto Github PK

Computation Graph Toolkit

License: Other

CMake 5.65% Python 79.30% C++ 12.91% C 0.84% Cuda 1.31%

cgt's Introduction

Computation Graph Toolkit (CGT) is a library for evaluation and differentiation of functions of multidimensional arrays.

Full documentation can be found at http://rll.berkeley.edu/cgt

[Release announcement](http://joschu.github.io/index.html#Announcing CGT)

cgt's People

Contributors

Stargazers

Watchers

Forkers

hojonathanho zxie zclfly multipath benjamesbabala txd866 codeashu mirwang666 796f intermezzo-fr sherjilozair chanrom codeaudit jameshensman animesh-garg fighterlyl mesnilgr basaundi dementrock davmre briancheung nebw jiumem nickjhay aicloud123 robertnishihara richardwarfield sbos gopal-m avivt tz2016 dmitriy-serdyuk bas-aarts doctorteeth arkadiuszsz sayiho sycinf xindaya wangxiong2015 studanshu clementdotck mindis guomin nikhilmishra000 anirudh9119 apark263 caomw edsterg tigerneil sigmaquan wanjinchang timleathart wgapl dinghaoyang cadretedata arita37 kasikasi2014 tempbottle chagge jakirkham zhubin086 babooppa6 xiaohu2015 poonono shi27feng hvdang ensemblr xyuan chenxuelee semanticbeeng delding devasenainupakutika kant bssrdf pencilandbike afcarl batermj maphysart beautifulnow1992 gaurav-singh1998 leopeng1995 rezacdoobary tron-x zky001 fanchenhao iq-scm

cgt's Issues

Error on Neural Turing Machine Demo

numpy version: numpy 1.10.0.post2

~/.cgtrc:

debug = False                                                                                                                                       
precision = single                                                                                                                                  
backend = native                                                                                                                                    
cache_dir = ~/.cgt_cache                                                                                                                            
enable_inplace_opt = True                                                                                                                           
enable_simplification = True                                                                                                                        
parallel = False                                                                                                                                                                                                                                                                                      
force_python_impl = False                                                                                                                           
debug_cpp = False                                                                                                                                   
verbose = False

The mnist & variation autoencoder demos seems to work fine, having issues with the neural turing machine demo:

(.venv)➜  examples git:(master) ✗ python demo_neural_turing_machine.py 
Traceback (most recent call last):
  File "demo_neural_turing_machine.py", line 469, in <module>
    main()
  File "demo_neural_turing_machine.py", line 415, in main
    ntm = make_ntm(opt)
  File "demo_neural_turing_machine.py", line 199, in make_ntm
    controller = make_ff_controller(opt)
  File "demo_neural_turing_machine.py", line 104, in make_ff_controller
    assert infer_shape(k_bHm) == (b,H,m)
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 707, in infer_shape
    return tuple(x.op.value if isinstance(x.op, Constant) else None for x in  CACHER.simplify(cgt.shape(arr)))
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 2728, in simplify
    for x in xs: self.simplify1(x)
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 2733, in simplify1
    update_simplify_map(x, self.analysis, self.repl)
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 2626, in update_simplify_map
    maybe_pair = process_top_stack_item_and_maybe_get_replacement(stack, analysis, repl)
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 2589, in process_top_stack_item_and_maybe_get_replacement
    newnewnode = maybe_replace(newnode, analysis, repl)
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 2689, in maybe_replace
    out = cgt.constant(py_numeric_apply(node, [p.op.value for p in parents]))
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 2926, in py_numeric_apply
    callable.call(vals, out)
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 794, in call
    return self._func(*args)    
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 1192, in f
    self.info.pyfunc(reads[0], out=write)
  File "/home/jramapuram/projects/cgt/cgt/core.py", line 1096, in _nu_iceil
    np.ceil(x,out)
TypeError: ufunc 'ceil' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

Fancy non-flat indexing

Currently, you can do this:

x[0:2,:], x[0:2,1:3]

which uses GetSli, and this:

x[ cgt.arange(0,10), cgt.arange(0,20,2) ]

which uses GetFlatIndices op after computing the flat indices for the expression.

But you can't do this:

x[ [1,3], : ]

We should have an Op that looks like GetSli but takes an integer vector of indices along some axis, say, GetFancySli

Ability to run execution graph from an external application

Please add ability to run compiled with native backend CGT graph from C or C++ application.

Intended pipeline:

Create computation graph with CGT
Build it using native backend
Link result object files with external application
Execute graph.

Thank you.

Broadcasting elementwise operations

@joschu recently made the following comments regarding broadcasting.

The current system requires a call to broadcast whenever you want to do a broadcasting elementwise binary operation.
As for the decision not to allow implicit broadcasting, there are a few options of how to deal with broadcasting, none of which is fully satisfying.

Broadcasting is fully determined at runtime based on singleton pattern of variables.

The singleton pattern of a variable is part of its type (which is propagated at graph construction time). This is what Theano does.

Broadcasting is fully explicit, i.e., the user must specify which singleton axes should be expanded. This is what CGT currently does.

The downside of 1 is that it reduces our ability to infer shapes of intermediate variables in the graph, and it makes the expressions specifying the shapes more complicated.

The downside of 2 is that sometimes the library can't figure out that your variable has singleton dimensions. With theano, you often have to explicitly set the broadcastable dimensions of variables, which is confusing to non-expert users.

The downside of 3 is extra verbosity, and confusion for numpy/theano users.

I think 1 is also a reasonable option -- it would just require modifying the C++ and CUDA code for ElwiseBinary to do broadcasting based on the singleton pattern that arrives at runtime, and modifying the shp_apply function. I'm open to switching to this option.
On the other hand, I feel like implicit broadcasting causes a lot of bugs, and it's actually a good thing for users to specify explicitly the singleton patterns of their variables -- we all make the mistake of putting singletons in the wrong place.

I agree that lack of implicit broadcasting will be confusing to both Theano and numpy users. I also agree that option one seems the most reasonable.

We're open to further suggestions on this topic before we make a decision.

Berkely BidMach and CGT

Hi,

Just wondering how CGT compares to your colleagues BidMach.
Bidmach is already very fast too for ML and DNN framework.

Thanks, regards

@zobot told me earlier that she had some trouble building CGT on linux.
Also, we had the unfortunate openblas download problem after the release announcement.
Seems like a good time to set up a CI system.
I just applied for an open-source TeamCity license.

Support for square root operation on Node

I was trying out this code for profiling cgt and theano for understanding how better the graph optimisations are done in CGT.
x = cgt.scalar(name='x', dtype='float32') y = cgt.scalar(name='y', dtype='float32') func = np.sqrt((x**2)*(y/x)*(x**3/y**3) -(x**2)*(y/x)*(x**3/y**3) + (x**2)*(y/x)*(x**3/y**3))
and then i got this error while compiling the function,
AttributeError: 'Node' object has no attribute 'sqrt'
Would be glad to have square root support
Thanks

-1 as an index for the last element in an axis doesn't work

A minimum example to reproduce a bug:
import cgt
axis = 1
slices = -1
input_var = cgt.tensor3('input', dtype=np.float32)
result = input_var[:, slices, :]

alternative variant of indexes

result = input_var[(slice(None),) * axis + (slices,)]

f = cgt.function([input_var], [result])
input = np.ones((3, 4, 5), dtype=np.float32)
input[0, 3] = 3
print f(input)

output:
[array([[ 0., 0., 0., 0., 0.],
[ 3., 3., 3., 3., 3.],
[ 1., 1., 1., 1., 1.]], dtype=float32)]

if I set slices = 3 which is the last element in axis=1 I get a correct answer:
[array([[ 3., 3., 3., 3., 3.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.]], dtype=float32)]

Bug in graph simplification

I have a baseline parameter in my program which is subtracted from gradients and not used to compute the actual function I'm optimizing. However, changing that parameter affects not just gradient computation, but the returned function value as well.
This behaviour is present only if enable_simplification = True.

The code is here https://gist.github.com/sbos/65ea08d5885660625fd3
The three values it prints should be nearly equal, but setting enable_simplification = True makes the second one close to zero.

Questions about Installation : cmake and windows

Two simple questions.

Is there any other methods to install CGT without 'cmake'??
Can I install cgt on windows? or possibility?

Error on tutorial

I get an assertion error (assert newnewnode.typ == orig.typ) on the simplify command in the tutorial code:
cgt.print_expr(cgt.simplify([dLdw])[0]);

I am running on windows, with only the python installed (no Cython or Cuda installed).

This is the code I am running:
import cgt
a = cgt.scalar(name='a') # float-valued scalar, with optional name provided
b = cgt.scalar(name='b')
n = cgt.scalar(name='n', dtype='int64') # integer scalar

c = (an + bn)**(1.0/n)

f = cgt.function([a,b,n], c)
print f(8,15,2)

X_nk = cgt.matrix("X")
y_n = cgt.vector("y")
w_k = cgt.vector("w")
b = cgt.scalar("b")
ypred_n = X_nk.dot(w_k) + b
L = cgt.sum(cgt.square(ypred_n - y_n))
print "L = ",
cgt.print_expr(L)
print X_nk.ndim, str(X_nk.shape), X_nk.dtype
grads = dLdw, dLdb = cgt.grad(L, [w_k, b])
print "Loss and gradient objects", dLdw, dLdb
print "Pretty-printed gradient: ",
cgt.print_expr(cgt.simplify([dLdw])[0]);

And this is the error I get:

Traceback (most recent call last):
File "C:/test_cgt.py", line 23, in
cgt.print_expr(cgt.simplify([dLdw])[0]);
File "C:\CGT\cgt\core.py", line 2688, in simplify
return simplify_and_analyze(xs)[0]
File "C:\CGT\cgt\core.py", line 2533, in simplify_and_analyze
for output in outputs: update_simplify_map(output, analysis, repl)
File "C:\CGT\cgt\core.py", line 2600, in update_simplify_map
maybe_pair = process_top_stack_item_and_maybe_get_replacement(stack, analysis, repl)
File "C:\CGT\cgt\core.py", line 2567, in process_top_stack_item_and_maybe_get_replacement
assert newnewnode.typ == orig.typ
AssertionError

I also put a break-point on the assertion line, and got that
newnewnode.typ = Tensor(i4,0)
orig.typ = Tensor(i8,0)

Native backend failed to build with Cython 0.20.1

Building native backend with Cython 0.20.1 will result in Cython compiler crash:

/root/deeprl/cgt/src/cycgt.pyx:109:30: Compiler crash in CreateClosureClasses

ModuleNode.body = StatListNode(cycgt.pyx:1:0)
StatListNode.stats[16] = StatListNode(cycgt.pyx:107:5)
StatListNode.stats[0] = CFuncDefNode(cycgt.pyx:107:5,
 args = [...]/2,
 modifiers = [...]/0,
 needs_closure = True,
 visibility = u'private')
CFuncDefNode.body = StatListNode(cycgt.pyx:108:4,
 is_terminator = True)
StatListNode.stats[0] = ReturnStatNode(cycgt.pyx:109:4,
 is_terminator = True)
ReturnStatNode.value = SimpleCallNode(cycgt.pyx:109:16,
 analysed = True,
 is_temp = 1,
 result_is_used = True,
 use_managed_ref = True)
SimpleCallNode.arg_tuple = TupleNode(cycgt.pyx:109:16,
 is_sequence_constructor = 1,
 is_temp = 1,
 result_is_used = True,
 use_managed_ref = True)
TupleNode.args[0] = GeneratorExpressionNode(cycgt.pyx:109:30,
 genexpr_name = u'genexpr1',
 is_temp = 1,
 name = u'genexpr',
 needs_closure = True,
 needs_self_code = True,
 pymethdef_cname = u'__pyx_mdef_5cycgt_12cgt2py_tuple_1genexpr',
 result_is_used = True,
 use_managed_ref = True)

Compiler crash traceback from this point on:
 File "Visitor.py", line 170, in Cython.Compiler.Visitor.TreeVisitor._visit (Cython/Compiler/Visitor.c:4285)
 File "/usr/lib/python2.7/dist-packages/Cython/Compiler/ParseTreeTransforms.py", line 2364, in visit_LambdaNode
 self.create_class_from_scope(node.def_node, self.module_scope, node)
 File "/usr/lib/python2.7/dist-packages/Cython/Compiler/ParseTreeTransforms.py", line 2342, in create_class_from_scope
 type=cscope.scope_class.type,
AttributeError: 'ClosureScope' object has no attribute 'scope_class'

Maybe we can consider having a version lower bound for Cython mentioned either in Readme or checked in make tool?

http://rll.berkeley.edu/cgt-data/cifar.npz does not exist.

cifar_demo also needs to be updated to use fetch_data.

Build system

I propose that we get rid of CMake and build with setup.py.
We need to determine all the compiler flags in compilation.py anyway, so it wouldn't be too much extra work to compile cgt.so using this information.
Getting to this to work robustly across platform robustly will be nontrivial and benefit from CI (issue #7).

Memory consumption of native backend on Linux

I observe a high rate of memory consumption when running my code on Linux using native backend. It's about 20 MB / sec and I quickly get out of memory. I don't see this on Mac OS or if I use python backed (and call gc.collect() from time to time).

It is interesting that running demo_neural_turing_machine.py consumes only a constant amount of memory while demo_char_rnn.py very quickly drains all available memory. I will test more examples soon.

Is there an analog of gc.collect() which I can use to free memory allocated by native backend?

Using intermediate nodes of graph as input to function

Suppose we have function([x], y) where we have z = f(x), y = g(z). We then want to compile function([z], y). Currently two issues:

z is a Result and not an Argument node
create_execution_graph processes all ancestors of y (including x) when trying to compile function([z], y); it should be blocked at provided input z.

CIFAR Demo Error when using GPU

(.venv)➜  examples git:(master) ✗ python demo_cifar.py --devtype=gpu
Traceback (most recent call last):
  File "demo_cifar.py", line 90, in <module>
    main()
  File "demo_cifar.py", line 60, in main
    train = cgt.function(inputs=[X, y], outputs=[loss], updates=updates)
  File "/home/jramapuram/projects/cgt/cgt/compilation.py", line 14, in function
    return _function_listout(inputs, outputs, dbg, updates, givens)
  File "/home/jramapuram/projects/cgt/cgt/compilation.py", line 35, in _function_listout
    interp = run_compilation_pipeline(inputs, outputs, updates, givens)
  File "/home/jramapuram/projects/cgt/cgt/compilation.py", line 354, in run_compilation_pipeline
    inputs, nodes_sorted, analysis["node2shape"], node2memowner, node2dev)
  File "/home/jramapuram/projects/cgt/cgt/compilation.py", line 240, in create_execution_graph
    assert node2dev[node] == node2dev[node2memowner[node]]
AssertionError

Random number generation and modules (and simplification again)

It seems that there are still problems with simplification of random number generation operation if they are encapsulated into modules.

The following code prints two last lines equal when simplification is turned on:

import cgt
import cgt.nn as nn

z1 = cgt.randn()
z2 = cgt.randn()

m = nn.Module([cgt.scalar()], [z1, z2])
f = cgt.function([], m([0.]))

print 'A', f()
print 'A', f()

z = cgt.randn(2)
m = nn.Module([cgt.scalar()], [z])

f = cgt.function([], m([0.]) + m([0.]))

print 'B', f()
print 'B', f()

Add type checking to all typ_apply functions

The functions Op.typ_apply should check the input types and throw human-understandable exceptions when necessary. E.g., Mul22 needs to make sure that both of its inputs have ndim=2 and float or complex types.

Conditionals (if-else) and Loops

I'm creating this Issue to sketch out some ideas on how to implement conditionals and loops, which have come out of discussions with @nickjhay and @hojonathanho. These ideas will address Issue #23, among other things. This issue is not the top priority at the moment (better error messages, full GPU support, and other issues are higher on the list), but it can't hurt to start brainstorming about it now.

First, let's consider implementing an IfElse operation: if cond then x else y, where cond is boolean.
Right now, we iterate through the nodes of the computation graph in topological order and fire each node after its parent nodes.Hence, x and y would both be computed, which would be problematic if one of them would have an out-of-bounds error (as pointed out in the #23 discussion).
This issue was raised early on in Theano development (see https://github.com/Theano/Theano/blob/master/doc/proposals/conditional.txt)
and they implemented "linkers" (like CGT's interpreters) that allow for lazy execution, along with the idea of "thunks".

Implementing lazy evaluation in CGT would require changes to how the execution graph works. One possibility is to (1) create a new IfElse instruction, (2) make the interpreters work backwards from the final instruction instead of working forwards. Specifically, we'd construct a dependency graph on instructions (as we do in the ParallelInterpreter already), and then work backwards from the final results, using a depth-first-search. The depth-first search can be implemented with a stack, and for an IfElse instruction, we'll make sure that it gets popped off the stack twice -- first time we evaluate the condition, the second time we evaluate the appropriate expression.

Loops present another more challenging problem. I can think of two use cases for loops: first, to implement an RNN-like model with a variable or very large number of timesteps; second, to implement an optimization algorithm like SGD. The first case requires us to have arrays that grow with the size of the loop. Theano uses Scan, which is a gargantuan piece of code written by Razvan Pascanu and only fully understood by him. It's also not a very convenient user interface. We need to answer a couple questions: (1) what would a convenient python interface look like, which would allow one to specify loops and implement RNNs and simple optimizers, (2) what should the underlying representation be, at the level of the "expression graph" in python and at the level of the execution graph.

@nickjhay and @hojonathanho and I have kicked around a couple ideas about the internal representation. For the python expression graph, we could perhaps use recursive functions in a canonical form, e.g. the primitive recursion formulation here. That said, we may end up falling back to something like Theano's Scan. (But hopefully it can be done more cleanly.)
For the execution graph, @hojonathanho suggested considering the approach taken by lazy functional languages, for example with the Spineless Tagless G-Machine. This problem has also been addressed dataflow computing systems, e.g. Dryad.

Having said all of that, we don't want to re-invent a compiler, and we don't want to build a full-fledged dataflow engine -- we should pick a restricted set of computations that can be performed by a simple system.

cgt.stack only supports stacking scalars

On the other hand, theano.tensor.stack supports stacking arbitrary tensors of the same shape

Random number generation and graph simplification

It seems that multiple calls to random number generation routines (cgt.randn and cgt.rand) can be reduced to a single call during graph simplification which is probably a bug.

The code example is https://gist.github.com/sbos/4cfe3e13510b8b670619.

Even if this is not a bug, such behaviour is not very obvious.

Numba backend

May be worth keeping in mind support for numba: https://github.com/numba/numba

Supports multithreading in pure python syntax, is on the order of fortran fast and easily conda installable.

Loading pre-trained models, e.g. VGG

It would be nice to be able to load well-known pre-trained models like VGG and GoogLeNet into CGT data-structures.
There's a half-written script for loading Caffe prototxt files, called caffe2cgt.py.
For VGG, all of the operations are already implemented, whereas some of the other models involve "local response normalization" layers that are not yet implemented.

NTM documentation

In the NTM example you mention that

rprev_bhm:          previous vector read from memory. [-1, 1]

but this vector is the result of a linear operation between the memory and the address vector. But the memory is not bounded, how the read vector got bounded? Shouldn't it be [-inf, inf] as well?

BLAS name collisions

CGT downloads and installs OpenBLAS. But then when numpy gets imported, the functions from the linked BLAS (e.g. cblas_dgemm) overload some of the functions from OpenBLAS. I noticed this when I found that setting VECLIB_MAXIMUM_THREADS changes the behavior of CGT's matrix multiplication. This behavior doesn't seem to cause any serious bugs, but it partly defeats the purpose of using OpenBLAS, which is to obtain consistent behavior with regard to multithreading and so forth.

Support for ragged arrays and sparse matrix/vectors

Two things which theano doesn't really do and which would be really really useful for sequential data and NLP applications, perhaps enough to make me take the jump :)

In theano, ragged arrays require workarounds with padding and masking, which aside from being quite ugly and making the code less intuitive, can also hurt performance unless you do a bunch of extra preprocessing to lump sequences of similar lengths together in minibatches.

Sparse vectors and matrices are also very useful and something that theano has at best second-class support for. For the common case of neural models with dense weight matrices, sparse by dense dot products are probably the most useful thing to have implemented with efficient sparse gradients. Common operations in NLP neural models can be seen as sparse-by-dense dot products, e.g. a lookup table (sparse one-hot vector by dense embedding matrix), or a "continuous bag of words" sum of word embeddings (sparse count vector by dense embedding matrix.). Noise-contrastive estimation (useful for large softmax output layers) also relies for its speed advantage on efficiently backpropagating a sparse error vector from the output layer.

Simple operation breaks CGT

The following code doesn't work:

x = np.array([31,-2])
K = np.array([[3,5],[7,11],[13,17]])

_x = cgt.vector(fixed_shape=[2], name="x")
_K = cgt.matrix(fixed_shape=[3,2], name="K")

grad_K = cgt.grad(_K.dot(_x)[0], _K)
grad_K_x = cgt.grad(grad_K[0,0], _x)
f_K_x = cgt.function([_x, _K], grad_K_x)
print f_K_x(x,K) # should be [1,0]

to support variable length argument in cgt.dimshuffle

Theano support: tensor.dimshuffle('x', 0, ...) together with tensor.dimshuffle(['x', 0, ...]). I think it is a good idea to support both syntacsis in cgt because theano-people use both.

When running script under IPython, it seems like cgtrc was recognized but debug / backend options were not applied

Looping over tensor / using symbolic variable.

Hello, if I'm understanding the code correctly, currently there is no way of doing loops with symbolic number of iterations (or looping over leading dimension of tensor like in ScanOP).

Are there plans to add such functionality to CGT? If not, what would be the recommended way of processing variable-length sequences? Is there a Switch operator, so that one could write:

output = init_output()
for t in range(max_num_steps):
   output = cgt.switch(X.shape[0]>t, make_step(X[t], output), output)

But wouldn't it create a huge overhead if the difference between shortest and longest sequences in the dataset is large?

Hope it makes sense to post this question here instead of the mailing list.

OSX: Build fails with OpenBLAS dependency

brew install openblas works just fine.
Add an option to utilize a system installed openBLAS or if dependency coupling is that important then add logic to CMake for OSX & parallel from homebrew recipe.

➜  build git:(master) make
[ 28%] Built target cgt
[ 85%] Built target cycgt
[100%] Generating OpenBLAS/libopenblas.a
already downloaded openblas.tar.gz
mkdir -p /Users/jramapuram/projects/cgt/build/OpenBLAS && tar -xf openblas.tar.gz --directory /Users/jramapuram/projects/cgt/build/OpenBLAS  --strip-components=1
Compiling OpenBLAS...this will take a minute or so
make -j ONLY_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_OPENMP=0 NUM_THREADS=8
Undefined symbols for architecture x86_64:
  "_blas_cpu_number", referenced from:
      _goto_set_num_threads in libopenblas_haswellp-r0.2.14.a(blas_server.o)
      _gotoblas_pthread in libopenblas_haswellp-r0.2.14.a(blas_server.o)
  "_blas_get_cpu_number", referenced from:
      _gotoblas_pthread in libopenblas_haswellp-r0.2.14.a(blas_server.o)
  "_blas_num_threads", referenced from:
      _blas_thread_init in libopenblas_haswellp-r0.2.14.a(blas_server.o)
      _exec_blas_async in libopenblas_haswellp-r0.2.14.a(blas_server.o)
      _goto_set_num_threads in libopenblas_haswellp-r0.2.14.a(blas_server.o)
      _blas_thread_shutdown_ in libopenblas_haswellp-r0.2.14.a(blas_server.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[4]: *** [libopenblas_haswellp-r0.2.14.dylib] Error 1
make[3]: *** [shared] Error 2
Traceback (most recent call last):
  File "4build/download_and_build_openblas.py", line 25, in <module>
    call_and_print("make -j ONLY_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_OPENMP=0 NUM_THREADS=%i"%multiprocessing.cpu_count())
  File "4build/download_and_build_openblas.py", line 10, in call_and_print
    subprocess.check_call(cmd,shell=True)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'make -j ONLY_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_OPENMP=0 NUM_THREADS=8' returned non-zero exit status 2
make[2]: *** [OpenBLAS/libopenblas.a] Error 1
make[2]: *** Deleting file `OpenBLAS/libopenblas.a'
make[1]: *** [CMakeFiles/openblas.dir/all] Error 2
make: *** [all] Error 2

Product of diagonal elements may be `nan`

I'm trying to implement the product of matrix diagonal elements.
For some reason, my code return nan when at least one of the elements is negative. I'm using the latest master.

Here is my code:

import cgt
import numpy as np

D = 2

L = cgt.matrix("L")
diag_elements = [np.arange(D, dtype=int), np.arange(D, dtype=int)]
f = cgt.function([L], cgt.prod(L[diag_elements]))

L = np.random.rand(D, D)
print f(L)

L = np.random.randn(D, D)
print f(L)

L = np.array([[ 0.8582886,   0.        ],
 [-0.33441732, -0.45777691]])
print f(L)

The last output is always nan as one of diagonal elements is negative, the first one is always a number since all the numbers are non-negative.
Is this a bug or I'm doing something wrong?

At the same time, if I pass the vector of diagonal elements to the cgt.prod it works well:

x = cgt.vector("x")
g = cgt.function([x], cgt.prod(x))

print g(np.diag(L))

So there might be something wrong with indexing.

Parameter indexing leads to an assertion error in module definition

It seems that one cannot define a module that computes parameter gradient wrt expression where the parameter is indexed. In this code example one of two equivalent module definitions fails with assertion error.

import cgt
import cgt.nn as nn
import numpy as np

#does work
z = cgt.scalar()
theta = nn.parameter(np.random.rand(2) - 0.5)
x = cgt.sqrt(z + cgt.dot(theta, theta))

g = cgt.grad(x, [theta])
m = nn.Module([z], g)

#doesn't work
z = cgt.scalar()
theta = nn.parameter(np.random.rand(2) - 0.5)
x = cgt.sqrt(z + cgt.square(theta[0]) + cgt.square(theta[1]))

g = cgt.grad(x, [theta])
#function can be created successfully
f = cgt.function([z], g)
#assertion error
m = nn.Module([z], g)

Error details:

Traceback (most recent call last):
  File "module.py", line 22, in <module>
    m = nn.Module([z], g)
  File "/Users/sbos/projects/cgt/cgt/nn.py", line 16, in __init__
    self.c = core.Composition(inputs, outputs)
  File "/Users/sbos/projects/cgt/cgt/core.py", line 2358, in __init__
    dio = set(differentiably_influences(outputs))
  File "/Users/sbos/projects/cgt/cgt/core.py", line 615, in differentiably_influences
    for (p,d) in utils.safezip(node.parents, node.get_diff()):
  File "/Users/sbos/projects/cgt/cgt/utils.py", line 43, in safezip
    assert len(x) == len(y)
AssertionError

DType.canon sees 'i8' type (provided by TensorType) as invalid when running mnist or char-rnn examples

When I attempt to run the mnist or char-rnn examples, I get the following error stack:
...
File "C:\Users\triley\workspace\cgt\cgt\core.py", line 1386, in typ_apply
return TensorType('i8',0)
File "C:\Users\triley\workspace\cgt\cgt\core.py", line 65, in init
self.dtype = Dtype.canon(dtype)
File "C:\Users\triley\workspace\cgt\cgt\core.py", line 31, in canon
raise ValueError("Invalid dtype %s"%dt)
ValueError: Invalid dtype int64

The error appears to occur because the dtype check in Dtype.canon (in /cgt/core.py) resolves the numpy dtype of 'i8' into 'q', which isn't part of the 'fdg', 'iulBb?', or 'FDG' types. I tried to add 'q' myself (to the second group, I think this is where it goes) but then I ran into another problem with an assert and neither script ran.

This is on Windows 10, with Python 2.7.10 and running cgt straight from the github repo (no compilation).

Test test_par_interp.test_matmuls('native', 'single', 'cpu') fails

On my system the test fails with the following error message:

test_par_interp.test_matmuls('native', 'single', 'cpu') ... BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

I think this is because I'm using my own version of OpenBLAS which was compiled with NUM_THREADS=4 (see #33) instead of USE_THREAD=0. Is there anything I can do to help debug the problem?

Negative reshaping check not enforced sometimes

The following code will crash the native backend (segmentation fault):

import cgt
x = cgt.matrix('x', dtype='i8')
y = x * x - x
z = y.reshape((-1,))
cgt.function([x], y[z])([[1]])

Issues with shared scalar

cgt.shared(0).op.set_value(1) # Fails
cgt.shared(0.0).op.set_value(np.float32(1)) # Fails

Build error: No such file or directory: 'openblas.tar.gz.part'

Super excited to try this, but right now stuck on this build error:

[100%] Built target cycgt
Traceback (most recent call last):
  File "4build/download_and_build_openblas.py", line 19, in <module>
    shutil.move("{fname}.part".format(fname=fname),"{fname}".format(fname=fname))
  File "/Users/delip/anaconda/lib/python2.7/shutil.py", line 302, in move
    copy2(src, real_dst)
  File "/Users/delip/anaconda/lib/python2.7/shutil.py", line 130, in copy2
    copyfile(src, dst)
  File "/Users/delip/anaconda/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'openblas.tar.gz.part'
make[2]: *** [OpenBLAS/libopenblas.a] Error 1
make[1]: *** [CMakeFiles/openblas.dir/all] Error 2
make: *** [all] Error 2

The makefiles were generated using:

cmake .. -DCGT_ENABLE_CUDA=ON

Installation with anaconda

I use anaconda as the python environment for the deployment in ubuntu. But after I install the cgt, when I use "import cycgt" in the python shell. I got the error of ImportError: /cgt/build/lib/cycgt.so: undefined symbol: PyFPE_jbuf. It means that the library itself actually use the pure python path. How can I modify the cmakefile so that I can install cgt with anaconda. BTW, my anaconda path is: "/anaconda2/"?
Thanks a lot!

download_and_build_openblas.py should not use cgt module

Since download_and_build_openblas.py is used during the cgt compilation, it shouldn't call any cgt functions. Otherwise, compiling cgt for the first time is getting a bit tricky. Considering that it just calls cgt.utils.warn, it would be much more convenient to replace it with a simpler print call.

Dealing with exceptions (eg KeyboardInterrupt) in python functions called via C++

Moving this over from the bitbucket tracker...

Suppose you run the following script CGT_FLAGS=backend=native python examples/demo_mnist.py
and then hit control-C during execution. Chances are, you'll see something like the following
^CException KeyboardInterrupt in 'cycgt._pyfunc_byval' ignored ^C[1] 7414 segmentation fault CGT_FLAGS=backend=native python demo_mnist.py
Some of the operations are still only implemented in python. Sometimes KeyBoardInterrupts will get caught by python and then ignored by cython, causing undesirable/unpredictable behavior.
At the very least, we should abort() from cython. Better yet, we'd set some error flag, which would get caught by the C++ interpreter to stop execution, and then finally we'd raise a Python exception.

Anyway, @hojonathanho has a fix implemented that involves introducing some python exception machinery into execution.{cpp/h}. We agree that including this stuff is a bit undesirable, so we're holding off on merging for now. But I'll leave this issue as a placeholder, in case we decide to deal with it later.

Negative step when slicing

Currently you can't do x[::-1] or x[:,::-1]. Currently x[start:stop:step] produces a GetSli Op with those arguments, however some of the logic assumes that step is positive. We have two choices

generalize GetSli to work with negative step
assume that if a symbolic value for step is provided, it's positive. If it's negative at runtime, throw an exception. If it's non-symbolic and negative, use a Flip Op.

Add support for theano's tensor.inc_subtensor command

In theano there is a great command:
http://deeplearning.net/software/theano/library/tensor/basic.html#theano.tensor.inc_subtensor

I can use it following way:
embedding_grads = theano.grad(cost, embedding_output)
updates[embedding.W] = T.inc_subtensor(embedding.W[T.reshape(input_var, (N_BATCH * MAX_LENGTH, ))],
-LEARNING_RATE * T.reshape(embedding_grads, (N_BATCH*MAX_LENGTH, 300)))

It helps to train only embedding word vectors that exist in current mini-batch.

Better error message for cgt.broadcast

Right now when the dimensions don't match, we see an AssertionError with no error messages. This should be made more informative as I can imagine it being a common error

Get “SSL: CERTIFICATE_VERIFY_FAILED” Error when installing openblas

Just a FYI.

It's this error: http://stackoverflow.com/questions/27835619/ssl-certificate-verify-failed-error

I fixed it by adding

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

as in one version in the best answer...

Stack only takes scalars

cgt.stack takes scalar inputs. There is currently no easy way to stack tensors.

The python backend actually uses np.array to do the stacking, so it will work as-is. The CPP backend would need some modification though to handle arbitrary stacking.

Support a matrix of indexes to implement usual embedding layer.

I've tried to implement embedding layer like in lasagne (https://github.com/Lasagne/Lasagne/blob/master/lasagne/layers/embedding.py) and I discovered that I can't do like this:

import numpy as np
import cgt

def main():
input_var = cgt.matrix('input', dtype=np.int64)
w_glove = cgt.shared(np.zeros((1000, 300), dtype=np.float32))
output = w_glove[input_var]
f = cgt.function([input_var], [output])

input = np.ones((3, 3), dtype=np.int32)
print f(input)

if name == 'main':
main()

w_glove[input_var] should return tensor3. Other words it replaces sequences of word's indexes in sequences of word's vectors. Usual NLP operation.

As result I have to replace output = w_glove[input_var] on:
output = w_glove[cgt.flatten(input_var)]
output = cgt.reshape(output, (3, 3, 300))

to flat indexes in indexes' list and reshape result to tensor3 after it. I believe that cgt can do it more effectively.

Negative step when slicing

Currently you can't do x[::-1] or x[:,::-1]. Currently x[start:stop:step] produces a GetSli Op with those arguments, however some of the logic assumes that step is positive. We have two choices

generalize GetSli to work with negative step
assume that if a symbolic value for step is provided, it's positive. If it's negative at runtime, throw an exception. If it's non-symbolic and negative, use a Flip Op.

Elementwise op with arbitrary arity

It could be useful to unify and generalize the elementwise unary and binary ops to support arbitrary numbers of arguments. This would allow for some nice graph simplifications, and in particular, compositions of elementwise operations (for example, for common implementations of ReLU) would be performed with a single CUDA kernel launch. A general elementwise arithmetic op could store a symbolic expression that could even be simplified with a library like SymPy.