bolt-project / bolt Goto Github PK

View Code? Open in Web Editor NEW

159.0 16.0 20.0 7.39 MB

Unified interface for local and distributed ndarrays

Home Page: http://bolt-project.github.io/

License: Apache License 2.0

Python 100.00%

bolt's Introduction

bolt

python interface to local and distributed multi-dimensional arrays

The goal of bolt is to support array manipulation and computation whether data are small, medium, or very, very large, through a common and familiar ndarray interface. The core is 100% Python. Currently backed by numpy (local) or spark (distributed) and will expand to others in the future.

View the documentation at bolt-project.github.io

Requirements

Bolt supports Python 2.7+ and Python 3.4+. The core library is 100% Python, the only primary requirement is numpy, and for spark functionality it requires 1.4+ which can be obtained here.

Installation

$ pip install bolt-python

bolt's People

Contributors

Stargazers

Watchers

Forkers

jwittenbach andrewosh gitter-badger yonglehou anirbanniara shyamalschandra steveais boazmohar kr-hansen mindis gdtm86 djoffrey kmader ifarhankhan jhxu0416 beautifulnow1992 feitianyiren hymanhuang

bolt's Issues

Support mixed indexing (lists and slices)

We should be able to use mixed list and slice indexing, for example,

barray = bolt.array(arange(24).reshape(2,3,4), context=sc)
barray[:,[0,2],:]

which currently throws a NotImplementedError

for reference, it should return exactly the same thing as numpy, e.g.

a = barray[:,[0,2],:].toarray()
b = barray.toarray()[:,[0,2],:]
assert allclose(a, b)

thanks to @boazmohar for flagging, cc @jwittenbach

Support negative indices in getitem methods

We should follow the numpy conventions for negative indices in slices exactly here.

Python 3 Compatibility

We must be Python 3 compatible.

Implement reshape and transpose on Spark array

These will both be general, higher-order functions that under the hood call some combination of swap, keys.reshape, keys.transpose, values.reshape, and values.transpose.

For at least some specified reshapes and transposes, it can be done entirely without swap, which will be more efficient. The goal should be finding, for any given specified reshape or transpose, how to compose these operators, to maximize efficiency, under the constraint that the split remains the same.

transpose
reshape

Integration with ibis and dato?

Thanks for the work on this cool project!

Wes just announced a new open source python distributed/ OOC data analysis framework called Ibis

Not sure if it is just an expression engine, but the workflow does have planned support for native python LLVM UDFs in impala with possible other execution engines. This is great because we will have a native python non JVM distributed solution.

Also Dato announced out of memory numpy arrays that work with scikit learn. Not sure of the license.

The landscape is starting to fragment a bit and I wonder if targeting or working with one of these projects (or dask) would help focus efforts. On the other hand, the distribution helps to prototype different approaches.

Other question: Is bolt going to use the Numpy array protocol? It would be great if we can use bolt as a drop in numpy replacement in legacy python code like the dato arrays.

Implement squeeze

Add tests for constructor.py

Opening this issue since #17 was closed before adding tests.

Can't filter by key

filter always wraps the function I give it to see only values:
https://github.com/bolt-project/bolt/blob/master/bolt/spark/array.py#L220-L222

It would be nice to be able to pass a with_keys argument to not have that happen.
@jwittenbach what do you think?

Implement general axis swapping function

This function needs to take axes from the values and move them into the keys (or vice versa).

Make Local and Spark Arrays have identical APIs

For testing and developing functions it would be nice if Local and Spark Arrays had identical functions. This might be easiest by a faking SparkContext and RDD and using BoltArraySpark for everything.

Fail early on out of bounds indexing

Currently, indexing out of bounds along an axis only throws an error upon conversion to a local array, e.g. with the array

barray = bolt.array(arange(24).reshape(2,3,4), context=sc)

this works

barray[0:1,:,:].toarray().shape
>> (1, 3, 4)
barray[1:2,:,:].toarray().shape
>> (1, 3, 4)

but this fails

barray[2:3,:,:].shape
>> (1, 3, 4)
barray[2:3,:,:].toarray().shape
>> ValueError: total size of new array must be unchanged

the initial indexing appears to succeed and shows an array of shape (1,3,4) but after toarray() there is an error because the actual shape and the inferred shape differs. By comparison, in numpy, it results in an empty array

barray.toarray()[2:3,:,:].shape
>> (0, 3, 4)

In this case, I think that unlike numpy we should throw an error immediately upon out-of-bounds indexing.

Thanks for flagging @boazmohar, cc @jwittenbach

Add a newsplit argument to transpose

Currently, transpose is significantly less general than swap because it keeps the the split the same. For example, if we want to go from a (1000) x (500,500) array to a (500,500) x (1000) array, we can do that with a swap via swap((0),(0,1)), but with transpose we can only get to (500) x (500,1000) via transpose(2,1,0).

If transpose supported a newsplit or similar keyword arg, we could get to the same result with transpose, which is often more intuitive than calling swap directly.

cc @jwittenbach

ndarray API necessary to wrap bolt with xray

cc @izaid who's interested in this for dynd

Sparse vectors?

Hi,

Is there support for sparse vectors using this library? For example, I may want to construct a distributed, sparse rating matrix consisting of (num_users, num_items) which represents the set of items that each user has rated.

I would like to perform row-wise operations on this without representing the entire num_items dense vector.

Support dtype in Spark array

The BoltArraySpark needs a dtype, to be specified (or inferred) during construction, and as an attribute that's appropriate propagated with shape etc. Basically, it should behave like the dtype on a numpy array in every way that's relevant.

@andrewgiessel want to take a stab at this?

Add a chunked bolt array class

For some operations likely to be common on bolt arrays, for example, applying some vectorized operation via a map, it would be useful to allow the user to glom the array, concatenating values into groups of values. This is essentially identical to Spark's own glom, but could additionally allow specification of glommed arrays of arbitrary size (with some remainder).

The result of such gloming is not necessarily a bolt array (because its irregular), but we could allow a restricted subset of operations, perhaps even just map, and then allow the user to unglom to work with it further.

Implement map / reduce

Need to add map and reduce on both BoltArraySpark and BoltArrayLocal. In either case, we should allow for an arbitrary specification of axes (e.g. a single number, or a list of numbers), closely following the numpy conventions in the case of reduce, and to some degree inventing our own conventions in the case of map.

map on spark
reduce on spark
map on local
reduce on local

Add a cache

This one should be easy, just call out to the Spark method.

Sorting issues in BoltArraySpark

As a follow up on the discussion with @freeman-lab in thunder-project/thunder#325
I suggest the following changes:

Renaming _ordered to _sorted
Adding a sorted property returning _sorted
Adding a sort() function that would check _sorted and sortByKey the rdd if needed
@jwittenbach what do you think?
If you approve I will try to implement.

Add constructors for ones, zeros, and random

Currently we can only construct a bolt array from an existing array. We should add constructors for options like ones, zeros, and random.

Possibly consider borrowing logic from dask.array for wrapping existing numpy functionality, though in the Spark case it may be better to have the construction lazy, i.e. rather than construct a massive local array and then parallelize, parallelize a function which creates the arrays. Also maybe look at how random array construction happens within Spark + Mllib.

Implement getitem on BoltArraySpark

At a minimum we need to provide support for single integers or slicing along each dimension.

It would also be good to support integer list or boolean array indexing.

basic indexing
advanced indexing

Array size dependent failure in toarray()

I see an array size dependent failure in toarray()
This works as expected:

data = thunder.images.fromtif('/tier2/svoboda/users/Aaron/151124_BMWR30/Run1',
                              start=0,stop=2,nplanes=29, engine=sc)
data.toarray().shape()
out: (439, 36, 72, 29)

data2 = data[:, :, :, :]
data2.toarray().shape
out: (439, 36, 72, 29)

Changing the number of files to read from 2 to 20 makes this fail:

data = thunder.images.fromtif('/tier2/svoboda/users/Aaron/151124_BMWR30/Run1',
                              start=0,stop=20,nplanes=29, engine=sc)
data.toarray().shape()
out: (4645, 36, 72, 29)

data2 = data[:, :, :, :]
data2.toarray().shape <---- error

The error is:

ValueError                                Traceback (most recent call last)
<ipython-input-39-8a17c79c2259> in <module>()
      4     return x.reshape(-1, 1).squeeze()
      5 data3 = data2.map(flat)
----> 6 data3.toarray().shape

/groups/svoboda/home/moharb/thunder/thunder/images/images.pyc in toarray(self)
    159         Return a local array
    160         """
--> 161         out = asarray(self.values)
    162         if out.shape[0] == 1:
    163             out = out.squeeze(axis=0)

/usr/local/python-2.7.6/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    460 
    461     """
--> 462     return array(a, dtype, copy=False, order=order)
    463 
    464 def asanyarray(a, dtype=None, order=None):

/groups/svoboda/home/moharb/bolt/bolt/spark/array.pyc in __array__(self)
     27 
     28     def __array__(self):
---> 29         return self.toarray()
     30 
     31     def cache(self):

/groups/svoboda/home/moharb/bolt/bolt/spark/array.pyc in toarray(self)
    947         """
    948         x = self._rdd.sortByKey().values().collect()
--> 949         return asarray(x).reshape(self.shape)
    950 
    951     def tordd(self):

ValueError: total size of new array must be unchanged

Also: data2.first() and data2.count() work as expected.

I am using the master from jwittenbach/bolt

Thanks!!
Boaz

Change bolt arrray name

Change name of blot array from barray to array

Change constructor to use axes instead of split

We should allow an axis specification that determines which axes end up in the keys, with the default being the first axis. Under the hood, we'll transpose where necessary and then use the current construction.

In the "split" terminology, for a 3D array, a split of 0 would mean an axis of 0, a split of 1 would mean an axis of (0,1), and a split of 2 would mean an axis of (0,1,2), but other options would be allows as well.

Split set incorrectly after filter operation

When calling filter in spark mode, all axes that are to be filtered over are moved to the keys and all other axes are moved to the values. The filter also linearizes all axes that are filtered over. This means that the split in the resulting array should always be 1. However, it is currently set otherwise. While this bug is not fatal when only perform the filter, it does end up causing an error if later operations are performed that rely on the split being set correctly (e.g. map).

Consider changing default axes for map / reduce / filter

The current default for these operations on Spark arrays is axis=(0,), which may incur a swap to distribute along that axis (if it isn't already). The default could instead be axis=None which would mean apply over the distributed axes (whatever they are) and would never incur a swap.

Suggested by @shoyer, thanks!

This generally seems like a more friendly default, the only issue arises not with map but with reduce, when considering sequences of mixed operations. For example, in the following two cases where the map is a no-op,

data = ones((2, 3, 4), sc)
data.map(lambda x: x, axis=(0,)).reduce(add)
data.map(lambda x: x, axis=(0,1)).reduce(add)

if the default for reduce is over the partitioned axes, the answer will be different in the two cases, whereas if the default is over axis=(0,) it will be the same.

I can see an argument that these really should be the same with the default parameters, but curious to get other opinions. Another option is using different defaults for map/filter and reduce.

cc @andrewosh

Referencing properties within other properties

In the BoltArraySpark, many of the properties reference other properties. Right now, they are often doing this by going straight to the under-scored member behind the property. In general, unless there is a good reason for this, it seems to me that we should default to referencing the property itself, in case some piece of functionality there is necessary.

For example, in BoltArraySpark we currently have

@property
def shape(self):
    return self._shape

@property
def size(self):
    return prod(self._shape)

In the definition of size, it would make more sense to me to reference self.shape rather than self._shape. Right now there is no difference, but in the future, the code for shape could add more complicated functionality used in determining the shape, and size would not pick up on this.

I know this is minor, but I thought I would put it down while I was thinking about it.

Finish docstrings

The following modules need significant documentation (in order of importance):

Bug in Shapes.transpose

The transpose functions on the Shapes objects currently use the shape as the "old axes", when really they should be using [0, ..., len(shape)]. This causes incorrect behavior.

Filter fails to wrap resulting keys in a tuple

Thunder 1.0.0 with Bolt fails on a large dataset

In Thunder 0.6 the default reduce was treeReduce (see stats function in rdds\data.py).
This enabled us to use mean on a volume of 100x1024x1024.
In Thunder 1.0.0 using Bolt this now fails:

data = thunder.images.fromrandom((100,1024,1024,100), engine=sc, npartitions=100)
data.mean()

Load a bolt array from a collection of HDF5 files

A direct API for storing and loading from HDF5 files could make sense.

Alternatively, and actually more attractive for my use cases (e.g., hooking into niche file formats like netCDF), would be the ability to create a bolt array directly from an indexable object, similarly to dask.array.from_array.

Add astype

Only needed on Spark bolt array

Display functions broken / in need of improvement

Printing vs Returning

The display method returns a representation of the array as a string that, while useful for printing, is not particularly well-formatted for viewing on it's own. It might be better to actually do the printing and not return anything.

As an example:

import bolt as blt
a = np.arange(24).reshape(2,3,4)
b = blt.barray(a)
b.display()

currently returns

'[[[ 0  1  2  3]\n  [ 4  5  6  7]\n  [ 8  9 10 11]]\n\n [[12 13 14 15]\n  [16 17 18 19]\n  [20 21 22 23]]]'

while

print b.display()

returns the nicer

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]

It seems to me that we should just do the printing rather than handing the user this unwieldy string.

Key/Value Size Match Bug

Here the behavior is broken entirely in a special case. Currently, display tries to call take on the underlying RDD, recast this as an ndarray, and then turn it into a string to return. However, in special cases where two of the dimensions have the same size, this coercion fails. I think this is because asarray greedily tries to stack as many dimensions as possible. The following code is an example:

import bolt as blt
a = np.arange(60).reshape(2,3,2,5)
b = blt.barray(a, sc, split=2)
b.display()

This code gives the following error:

In [170]: b.display()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-170-5bff50e7829e> in <module>()
----> 1 b.display()

/Users/wittenbachj/code/bolt/bolt/spark.py in display(self)
    350
    351     def display(self):
--> 352         return str(asarray(self._rdd.take(10)))

/Users/wittenbachj/anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    460
    461     """
--> 462     return array(a, dtype, copy=False, order=order)
    463
    464 def asanyarray(a, dtype=None, order=None):

ValueError: could not broadcast input array from shape (2,5) into shape (2)

This appears to be a general problem with NumPy that doesn't have a straight-forward solution: http://numpy-discussion.10968.n7.nabble.com/converting-a-list-of-tuples-into-an-array-of-tuples-td39698.html. It might be better just to write a custom bit of code for displaying to solve this.

Implement clip on Spark Bolt Array

http://bolt-project.org/ down?

I get a "NOTICE: This domain name expired on 7/4/2018 and is pending renewal or deletion." when I go there!

Indexing features

Currently indexing with negative indices only works for integer indices, and even here there are issues with incorrect behavior (thunder-project/thunder#275). We should also support negative indices with slices (start, stop, and step).

Use underscores for multiple words throughout

type of elements in shape

Usually, if a is a BoltArray, then a.shape returns a list with members of type int. However, after chunking and then calling either keys_to_values or values_to_keys, the list returned by a.shape have members of type numpy.int64.

This inconsistency has be a problem if a user tries to serialize this data (such as in Thunder, when writing out the shape as part of metadata), as Python's json package does not know how to handle this case.

Consider a more flexible chunking model for distributing spark arrays

Looking through the source code, it appears that the internal design of BoltArraySpark distributes arrays by splitting them along axes.

This is a similar but less flexible than the distributed data model of dask.array, which allows for partial chunking along axes. The BoltArraySpark model is equivalent to the dask.array chunking model if chunk sizes are restricted to size 1 or the full length of an array dimension.

In some cases, more flexible chunking (e.g., as implemented with the HDF5 file format) can be very convenient. For example, it allows for reasonable worst case performance with both space and time queries on 3D arrays with dimensions (x, y, t). Or with large 2D images (e.g., mosaics of satellite imagery), there really isn't any alternative to chunking along multiple dimensions at once.

Have you thought about this approach for your distributed spark arrays? Is this more flexible model more complex than you want to deal with? Or maybe you think it's unnecessary because of the high performance of Spark for reshuffling data?

Implement filter

Need to add a filter method to both BoltArraySpark and BoltArrayLocal. This should also take an axis argument specifying which axis along which to apply the filtering operation. Should be able to borrow much of the same logic as map.

filter on local
filter on spark