dgasmith / opt_einsum Goto Github PK

View Code? Open in Web Editor NEW

802.0 22.0 64.0 3.79 MB

⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.

Home Page: https://dgasmith.github.io/opt_einsum/

License: MIT License

Python 99.07% TeX 0.72% Makefile 0.21%

python tensor contraction gpu-acceleration performance einsum tensor-contraction

opt_einsum's Introduction

Optimized Einsum

Optimized Einsum: A tensor contraction order optimizer

Optimized einsum can significantly reduce the overall execution time of einsum-like expressions (e.g., np.einsum, dask.array.einsum, pytorch.einsum, tensorflow.einsum, ) by optimizing the expression's contraction order and dispatching many operations to canonical BLAS, cuBLAS, or other specialized routines.

Optimized einsum is agnostic to the backend and can handle NumPy, Dask, PyTorch, Tensorflow, CuPy, Sparse, Theano, JAX, and Autograd arrays as well as potentially any library which conforms to a standard API. See the documentation for more information.

Example usage

The opt_einsum.contract function can often act as a drop-in replacement for einsum functions without further changes to the code while providing superior performance. Here, a tensor contraction is performed with and without optimization:

import numpy as np
from opt_einsum import contract

N = 10
C = np.random.rand(N, N)
I = np.random.rand(N, N, N, N)

%timeit np.einsum('pi,qj,ijkl,rk,sl->pqrs', C, C, I, C, C)
1 loops, best of 3: 934 ms per loop

%timeit contract('pi,qj,ijkl,rk,sl->pqrs', C, C, I, C, C)
1000 loops, best of 3: 324 us per loop

In this particular example, we see a ~3000x performance improvement which is not uncommon when compared against unoptimized contractions. See the backend examples for more information on using other backends.

Features

The algorithms found in this repository often power the einsum optimizations in many of the above projects. For example, the optimization of np.einsum has been passed upstream and most of the same features that can be found in this repository can be enabled with np.einsum(..., optimize=True). However, this repository often has more up to date algorithms for complex contractions.

The following capabilities are enabled by opt_einsum:

Inspect detailed information about the path chosen.
Perform contractions with numerous backends, including on the GPU and with libraries such as TensorFlow and PyTorch.
Generate reusable expressions, potentially with constant tensors, that can be compiled for greater performance.
Use an arbitrary number of indices to find contractions for hundreds or even thousands of tensors.
Share intermediate computations among multiple contractions.
Compute gradients of tensor contractions using autograd or jax

Please see the documentation for more features!

Installation

opt_einsum can either be installed via pip install opt_einsum or from conda conda install opt_einsum -c conda-forge. See the installation documentation for further methods.

Citation

If this code has benefited your research, please support us by citing:

Daniel G. A. Smith and Johnnie Gray, opt_einsum - A Python package for optimizing contraction order for einsum-like expressions. Journal of Open Source Software, 2018, 3(26), 753

DOI: https://doi.org/10.21105/joss.00753

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide.

opt_einsum's People

Contributors

Stargazers

Watchers

opt_einsum's Issues

Support for get_symbol(116) in Python 2

get_symbol(x) currently fails in Python 2 for any x >= 116:

i = 116

    def get_symbol(i):
        """Get the symbol corresponding to int ``i`` - runs through the usual 52
        letters before resorting to unicode characters, starting at ``chr(192)``.

        Examples
        --------
        >>> get_symbol(2)
        'c'

        >>> oe.get_symbol(200)
        'Ŕ'

        >>> oe.get_symbol(20000)
        '京'
        """
        if i < 52:
            return einsum_symbols_base[i]
>       return chr(i + 140)
E       ValueError: chr() arg not in range(256)

Can we fix this by use unichr() and unicode strings in Python 2?

v2.2 patch notes

Gathering thoughts on the 2.1 patches notes and remaining TODO's.

Remaining items:

Unicode support for Python 2.7 #42.
Update changelog before releasing.
Docs passing updating doc strings to always contain a small example.

New features:

(#48) Intermediates can now be shared between contractions, see here for more details.
(#53) Intermediate caching is thread safe.

Enhancements:

(#48) Expressions are now mapped to non-unicode index set so that unicode input is support for all backends.
(#58) Adds tensorflow and theano with shared intermediates.

Bug fixes:

(#41) PyTorch indices are mapped back to a small a-z subset valid for PyTorch's einsum implementation.

Missing path_random in pip3 install

It seems like opt_einsum.path_random is missing from the pip install package.

Version: 2.3.2

>>> import opt_einsum
>>> opt_einsum.__version__
'v2.3.2'
>>> dir(opt_einsum)
['__builtins__', '__cached__', '__doc__', '__file__', '__git_revision__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_version', 'backends', 'blas', 'compat', 'contract', 'contract_expression', 'contract_path', 'get_symbol', 'helpers', 'parser', 'paths', 'shared_intermediates', 'sharing']
>>> opt_einsum.path_random
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'opt_einsum' has no attribute 'path_random'

Allow custom cost objective functions

Experimenting with #60 and #63 made me think that allowing custom cost objective functions would be neat. The current the cost function greedy uses is

size12 - size1 - size2

though as @fritzo notes, just using

size12

made some contractions more deterministic and thus easier to cache.

Generally you might have:

A * size12 - B * size1 - C * size2

with e.g. A, B, C random modifiers for #63. One might even try and learn the values using the contraction complexity / flop cost as a loss function.

Finish out input documentation

Input documentation was started in #68. However, the canonical einsum string and integer-based contractions specifications are not yet documented. The current stub can be found in docs/source/input_format.rst.

Greedy vs optimal scaling miss

Found this in a SO comment. Posting here to take a look at.

M = np.arange(35 * 37 * 59).reshape([35, 37, 59])
A = np.arange(35 * 51 * 59).reshape([35, 51, 59])
B = np.arange(37 * 51 * 51 * 59).reshape([37, 51, 51, 59])
C = np.arange(59 * 27).reshape([59, 27])

>>> path, desc = np.einsum_path('xyf,xtf,ytpf,fr->tpr', M, A, B, C, optimize="greedy");
>>> print(desc)
  Complete contraction:  xyf,xtf,ytpf,fr->tpr
         Naive scaling:  6
     Optimized scaling:  5
      Naive FLOP count:  3.219e+10
  Optimized FLOP count:  4.165e+08
   Theoretical speedup:  77.299
  Largest intermediate:  5.371e+06 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   5              ytpf,xyf->xptf                         xtf,fr,xptf->tpr
   4               xptf,xtf->ptf                              fr,ptf->tpr
   4                 ptf,fr->tpr                                 tpr->tpr

>>> path, desc = np.einsum_path('xyf,xtf,ytpf,fr->tpr', M, A, B, C, optimize="optimal");
>>> print(desc)
  Complete contraction:  xyf,xtf,ytpf,fr->tpr
         Naive scaling:  6
     Optimized scaling:  4
      Naive FLOP count:  3.219e+10
  Optimized FLOP count:  2.744e+07
   Theoretical speedup:  1173.425
  Largest intermediate:  1.535e+05 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   4                xtf,xyf->ytf                         ytpf,fr,ytf->tpr
   4               ytf,ytpf->ptf                              fr,ptf->tpr
   4                 ptf,fr->tpr                                 tpr->tpr

Release v2.1 patch notes

Gathering thoughts on the 2.1 patches notes and remaining TODO's.

Remaining items:

Update README me detailing backend support
Switch setup.py to correctly read README.md with new PyPi markdown support (pypi markdown support).
Add a note about citation to DOCS and end of the README.
auto path option for optimal for n<4.

opt_einsum continues to improve its support for additional backends beyond NumPy with PyTorch

We have also published the opt_einsum package in the Journal of Open Source Software. If you use this package in your work, please consider citing us!

New features:

PyTorch backend support
Tensorflow eager-mode execution backend support

Enhancements:

Intermediate tensordot-like expressions are now ordered to avoid transposes.
CI now uses conda backend to better support GPU and tensor libraries.
Now accepts arbitrary unicode indices rather than a subset.
New auto path option which switches between optimal and greedy at 4 tensors.

Bug fixes:

Fixed issue where broadcast indices were incorrectly locked out of tensordot-like evaluations even after their dimension was broadcast.

FR: separate 'auto' strategy out of contract_path

The 'auto' strategy is really nice, but it requires actual arrays because it is based into contract_path. Sometimes I want to get a path based merely on shapes. It should be easy to factor out the logic from contract_path to a new paths.auto that conforms to the standard shape-based path interface.

reusable einsum expressions

Essentially, I'm using opt_einsum for some tensor network calculations, and a significant proportion of time is spent in contract_path, even for relatively large sized tensors and pre-calculated paths.

Specifically, a huge number of contractions with the same indices and shapes are performed over and over again and I already cache these using the path from contract_path. However, it seems even with the path supplied, a large amount of time is spent parsing the input, in can_blas, parse_einsum_input etc.

My suggestion would be something like:

shapes = [(2, 3), (3, 4), (4, 5)]
my_expr = einsum_expression("ab,bc,cd->ad", *shapes)

for _ in many_repeats:
    ...
    x, y, z = (rand(s) for s in shapes)
    out = my_expr(x, y, z)
    ...

I.e. it would only accept arrays of the same dimensions (and probably an out argument) and would otherwise skip all parsing, using a contraction_list stored within the function.

Anyway, I'm about to knock something up to test for myself, but thought it might be of more general interest.

Cannot provide custom `tensordot` method

As far as I can tell there is no way to provide a custom tensordot method. I tried pretty hard, but the code path seems to fail on examples like contract("abc,ab-> ac").

Error when using PyTorch backend on large contraction

Great project! I'm looking forward to using it in my research.

I came across this when I was trying to optimize a tensor (as in the quimb documentation) and wanted to try my hand at a contraction manually. My machine was using 88GB RAM to perform the contraction, which is simply too much, and I wanted to see which step of the contraction was causing a problem, since none of my tensors are very large.

When I tried to put together a MWE, everything seems to work great until I actually want to do the contraction, using the PyTorch backend. It's important for my application to use PyTorch because I need to do the optimization step (are there other backends that would let me do that?) for a time-evolved state. I get the error: RuntimeError: only lowercase letters a-z allowed as indices.

Now, this is obviously a limitation in PyTorch itself, not opt_einsum, but I'm wondering if there might be a clever workaround that I've missed? I could imagine that it would be possible to take a pre-computed contraction path and

Limit the number of indices in any one step (maybe not, now that I think about it more), and/or
Cleverly re-index so that separate calls to einsum use indices that start over at "a", which might mitigate or eliminate the problem, depending on how many indices are contracted in a given step

If it helps, this is the contraction I was trying, which I generated for a 2-dimensional quimb TensorNetwork:

abc,dbef,gehi,jhkl,mkn,aopq,drpst,gusvw,jxvyz,mAyB,oCDE,rFDGH,uIGJK,xLJMN,AOMP,CQRS,FTRUV,IWUXY,LZXÀÁ,OÂÀÃ,QÄÅ,TÄÆÇ,WÆÈÉ,ZÈÊË,ÂÊÌ,ÍÎcÏ,ÐÎÑfÒ,ÓÑÔiÕ,ÖÔ×lØ,Ù×nÚ,ÍÛÜqÝ,ÐÞÜßtà,Óáßâwã,Öäâåzæ,ÙçåBè,ÛéêEë,ÞìêíHî,áïíðKñ,äòðóNô,çõóPö,é÷øSù,ìúøûVü,ïýûþYÿ,òĀþāÁĂ,õăāÃĄ,÷ąÅĆ,úąćÇĈ,ýćĉÉĊ,ĀĉċËČ,ăċÌč,ĎďÏ,ĐďđÒ,ĒđēÕ,ĔēĕØ,ĖĕÚ,ĎėĘÝ,ĐęĘĚà,ĒěĚĜã,ĔĝĜĞæ,ĖğĞè,ėĠġë,ęĢġģî,ěĤģĥñ,ĝĦĥħô,ğĨħö,ĠĩĪù,ĢīĪĬü,ĤĭĬĮÿ,ĦįĮİĂ,ĨıİĄ,ĩĲĆ,īĲĳĈ,ĭĳĴĊ,įĴĵČ,ıĵč->

Thanks in advance for any help! Depending on the feasibility of the above, I might be able to help out in developing functionality that would allow contractions like this one, since they're important for my project.

`import opt_einsum` fails due to jax.py

Importing opt_einsum 3.0.0 fails for me due to the new jax backend:

In [1]: import opt_einsum
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-c4cbb6c93042> in <module>()
----> 1 import opt_einsum

.../opt_einsum/__init__.py in <module>()
      7 from . import paths
      8 from . import path_random
----> 9 from .contract import contract, contract_path, contract_expression
     10 from .parser import get_symbol
     11 from .sharing import shared_intermediates

.../opt_einsum/contract.py in <module>()
      8 import numpy as np
      9 
---> 10 from . import backends
     11 from . import blas
     12 from . import helpers

.../opt_einsum/backends/__init__.py in <module>()
      5 # Backends
      6 from .cupy import to_cupy
----> 7 from .dispatch import (get_func, has_einsum, has_tensordot, build_expression, evaluate_constants, has_backend)
      8 from .tensorflow import to_tensorflow
      9 from .theano import to_theano

.../opt_einsum/backends/dispatch.py in <module>()
     10 
     11 from . import cupy as _cupy
---> 12 from . import jax as _jax
     13 from . import tensorflow as _tensorflow
     14 from . import theano as _theano

.../opt_einsum/backends/jax.pyc in <module>()
     13 
     14     @to_backend_cache_wrap
---> 15     @jax.jit
     16     def to_jax(x):
     17         return x

AttributeError: 'module' object has no attribute 'jit'

I think the import jax at https://github.com/dgasmith/opt_einsum/blob/master/opt_einsum/backends/jax.py#L12 is importing itself since the file is named jax, instead of importing the installed library jax.

Once this is fixed, I suspect there'll be a circular import, since jax imports opt_einsum! https://github.com/google/jax/blob/master/jax/numpy/lax_numpy.py#L38 I'm not sure how best to resolve this -- the best I can come up with is one of us adds a "lazy" import inside a function, instead of at the top of the file, but maybe you can think of something more elegant. Maybe since jax already calls opt_einsum the jax backend isn't necessary?

BTW, glad to see you're interested in jax! :)

dask backend attempts to use non-existent dask einsum function on an einsum broadcasting example

From my reading of the code, broadcasting disables blas, which then defers to the dask einsum backend, which does not exist. This happens on master c235f0c

import dask.array as da
from opt_einsum import contract

# Produce random dask floats
rf = lambda s,c: da.random.random(size=s, chunks=c)

I = rf((10, 10, 10, 10), (5,5,5,5))
C = rf((10, 10), (5,5))

# Example in the docs, this works
result = contract('ea,fb,abcd,gc,hd->efgh', C, C, I, C, C,
                               optimize='greedy', backend='dask')

# dim sizes and their chunks
s = 10
sc = (5,5)
t = 5
tc = (2,3)
a = 7
ac = (3,4)
c = 4
cc = (2,2)

# This fails
result = contract('sta,tac->stac',
    rf((s,t,a), (sc,tc,ac)),
    rf((t,a,c), (tc,ac,cc)),
    backend='dask')

produces

$ python test_opt_einsum.py 
Traceback (most recent call last):
  File "test_opt_einsum.py", line 26, in <module>
    backend='dask')
  File "/home/sperkins/venv/opt_einsum/local/lib/python2.7/site-packages/opt_einsum/contract.py", line 423, in contract
    return _core_contract(operands, contraction_list, backend=backend, **einsum_kwargs)
  File "/home/sperkins/venv/opt_einsum/local/lib/python2.7/site-packages/opt_einsum/contract.py", line 484, in _core_contract
    new_view = _einsum(einsum_str, *tmp_operands, backend=backend, **einsum_kwargs)
  File "/home/sperkins/venv/opt_einsum/local/lib/python2.7/site-packages/opt_einsum/contract.py", line 280, in _einsum
    fn = backends.get_func('einsum', kwargs.pop('backend', 'numpy'))
  File "/home/sperkins/venv/opt_einsum/local/lib/python2.7/site-packages/opt_einsum/backends.py", line 39, in get_func
    fn = _import_func(func, backend)
  File "/home/sperkins/venv/opt_einsum/local/lib/python2.7/site-packages/opt_einsum/backends.py", line 21, in _import_func
    raise AttributeError("{} doesn't seem to provide the function {}".format(backend, func))
AttributeError: dask doesn't seem to provide the function einsum

NumPy Dispatch Mechanisms

It would be a good idea to keep an eye on NEP 18 "A dispatch mechanism for NumPy’s high-level array." Depending on the implementation of the NEP, this could help us avoid extra code or may require us to rework how our backend technology works.

Python2 Deprecation Plan

As Python2 will cease to be supported in 2020 and many Python projects have cut their last release of a Py2 compliant module (NumPy, SciPy, Pandas, iPython, SciKitLearn, etc) we may want to consider doing the same. The cleanup in this project is pretty low when considering the removal of Py2, so I do not see this as very urgent. I would be curious if anyone had thoughts as to this direction.

@fritzo @jcmgray

let contract_expression use interleaved input

Currently, contract_expression only accepts the 'equation' style input. This is an easy fix that I will do shortly. As part of that I can flesh out the docs on input (#70), unless you are already planning that @dgasmith ?

Support more backends in shared_intermediates

Following up on #43 (comment), it would be nice if with shared_intermediates(): supported more backends. This could start by refactoring towards a common to_backend function that is memoized.

Status

Sharing modulo commutativity, associativity, transpose

The shared_intermediates memoization mechanism #43 already implements sharing modulo permutation of tensor operands and alpha-renaming of the equation. @dgasmith and @jcmgray suggested in #43 (comment) that we could further quotient by tensor axis permutation. Note that this breaks into transposing the output (a bit easier), and transposing all of the inputs (a bit tougher).

Status

hashing einsum modulo alpha-renaming of equation
hashing einsum modulo reordering of inputs
hashing tensordot modulo reordering of inputs
hashing einsum modulo transpose of output
hashing tensordot modulo transpose of output
hashing tensordot modulo transpose of inputs
hashing einsum modulo transpose of inputs

Add some github tags to this repository

@dgasmith, adding some tags might aid the discoverability of opt_einsum? e.g tensor, contraction, python come to mind for starters.

Support multi-threading in shared_intermediates

Following up on #43 (comment), the shared_intermediates cache is currently not locked.

One option would be to assign the thread id of a unique owner of the cache, such that at least thread conflicts could be detected and reported, rather than mere silent performance degradation.

optimal contraction path

I tried to use opt_einsum to calculate the following tensor contraction. But the contraction path seems non-optimal.

res1 = opt_einsum.contract("abc,bdef,fghj,cem,mhk,ljk -> adgl",A,B,C,D,E,F, path='optimal')
pathinfo = opt_einsum.contract_path("abc,bdef,fghj,cem,mhk,ljk -> adgl" ,A,B,C,D,E,F, path="optimal")
The "optimal path" printed is 

[(3, 4), (0, 1, 2, 3, 4)]
Complete contraction:  abc,bdef,fghj,cem,mhk,ljk->adgl
Naive scaling:  12
Optimized scaling:  11
Naive FLOP count:  3.075e+09
Optimized FLOP count:  2.050e+08
Theoretical speedup:  15.000
Largest intermediate:  6.250e+02 elements
--------------------------------------------------------------------------------
scaling   BLAS                  current                                remaining
--------------------------------------------------------------------------------
   5      GEMM            mhk,cem->ckeh             abc,bdef,fghj,ljk,ckeh->adgl
  11     False ckeh,ljk,fghj,bdef,abc->adgl                               adgl->adgl

My question is that why the second contraction doesn't contract pairwise. T
I also use tensordot to contract it pairwise. It's much faster.
I guess the reason is that though in the pairwise tensordot, the total FLOPS is larger, but the speed is much faster.

Manual and automatic slicing

Tensor slicing (essentially explicitly doing an outer sum over some indices rather than including them in the pairwise contractions) is a pretty useful way of decreasing the memory cost of performing a contraction (at some computational cost increase) in order to fit it in memory or on the GPU and/or massively parallelizing it.

I thought I'd chuck out the following kind of functionaliy and see if it might be useful to people?

Be able to supply a list of which indices to slice over contract('abm,bcm,cd', ... slice=['m'])
Return an iterator over the slices so that you can perform them in parallel as you wish, maybe contract_sliced
Perform some inference on a path to determine which indices should be sliced (generally the ones that appear on the largest intermediate). You could specific the maximum memory, and/or the minimum number of slices etc.

I proof of principle versions of all of these which could be starting point (for me or someone else...). Here's what a full example might look like:

# find an initial path
path, info = oe.contract_path(eq, *arrays, optimize='random-greedy')

slices = info.suggest_slices(max_memory=2**20, min_num_slices=None)
print(info.slice_statistics(slices))
# slice: ['a', 'd', 'e']
# total size: 64 
# peak memory reduction: 30.45x
# flops increase 2.53x etc.

# perform contraction EDIT: (not sure about this general syntax)
sliced_arrays = oe.gen_sliced_arrays(eq, *arrays, slices=slices)

# each item of sliced_arrays will be the input tensors with some
#    combination of the ['a', 'd', 'e'] dimensions selected
sum(my_parallel_pool.map(
    lambda x: oe.contract(eq, *x, optimize=path),
    sliced_arrays
)

Help required optimizing the use of contract_expression

Hi,

Thank you for this great library - I see a significant improvement in my performance already.

I'm facing a specific case where I got 4 matrices where only one of them is changing (real-time application). I'm successfully using the contract_expression like that:

import numpy as np
import opt_einsum as oe

t = 51
p = 51
r = 27
f = 56
x = 35
y = 39
A = np.arange(x * t * f).reshape([x, t, f])
B = np.arange(p * y * t * f).reshape([p, y, t, f])
C = np.arange(f * r).reshape([f, r])
D_shape = (y, x, f)

expr = oe.contract_expression('yxf,xtf,pytf,fr->tpr',
                            D_shape, A, B, C,
                            constants=[1, 2, 3],
                            optimize='optimal')

# this is called multiple times
result = expr(D)  # D.shape == D_shape

But this doesn't yield performance improvement compared to default np.einsum.

In this scenario I can easily np.transpose A, B, C (and even D) in a way that'll yield the best path. I can change the order of the expression, and basically play with is as I'd like in order to get to the optimal path.

Is there a way to do that using the library? I don't have intuition regarding what will yield the best performance.

Thanks!

Porting new path tech to NumPy

From discussion here I think we need to consider porting the new path technology over to NumPy. The greedy path has been significantly updated since the original port and the total FLOP comparisons should be much more accurate with the new FLOP counter.

I see two options:

Patch up the greedy path as it currently stands in NumPy for now.
Replace current path technology with opt_einsum tech completely. This will be ~500 lines changed.

Option 1 is a bit of a stop gap while option 2 will replace a significant amount of technology and will alter result times slightly. I also worry a bit as the use cases in NumPy proper are enormous, as einsum is one of the most flexible functions available it is not entirely clear to me that we are properly covering all use cases.

@jcmgray are you ok with me adding chunks of your code to numpy/numpy (with you on the author list of course) or would like to take a try at it?

@charris do you have an opinion between option 1/2?

Separate symbolic path optimization from tensor backend?

I really like that opt_einsum provides a pure-python dispatch mechanism that we can essentially use for symbolic computation. I'm opening this issue to discuss how we might further separate the path optimizer (which is purely symbolic) from the backend dispatch mechanism. (They are already mostly independent.)

Examples

Here are a couple cool examples of purely symbolic path optimization, unrelated to tensor contraction:

Kevin Murphy's variable elimination code that uses np.einsum_path as an alternative to the Junction Tree algorithm for inference in graphical model. When these models are limited to discrete random variables, this example is equivalent to tensor contraction; however it generalizes to e.g. Gaussian variable elimination which is distinct from tensor contraction and really wants a different objective function.
@eb8680 and I are similarly using the greedy optimizer to optimize sum-product expressions involving not only sums but integrals, again typically Gaussian integrals.

One interesting aspect of both these examples is that, whereas tensor contraction cost is exponential space in time in the number of dimensions (of say fixed size), Gaussian contraction is only quadratic in space and cubic (or just sub-cubic) in number of dimensions. Thus Gaussian contractions should use a different FLOP computaiton.

Other backends: tensorflow, theano, sparse, dask etc

What do you think about adding other (optional) backends?

I have this branch for gpu support here which was a fairly simple tweak to test using gpus with tensorflow, theano and cupy. It would also be very simple to add more backends since mostly they just need to implement tensordot and transpose (only einsum for non-blas operations).

Anyway here's a brief summary of the possible ndarray libraries:

Library	Tensordot?	Einsum?
tensorflow	yes	yes
theano	yes	no
cupy	yes	yes
dask	yes	not yet?
sparse	yes	not yet?
tblis python	would be easy to implement I think?	yes

tensorflow, theano and I guess dask fit nicely with ContractExpression since they build/compile the graph of all the operations first.

FR: cache hit/miss data

It would be nice to have something like this (partly copied from the document)

# copied
with shared_intermediates() as cache:
    marginals = {output: contract('{}->{}'.format(inputs, output), *factors)
                 for output in 'abcdef'}
# new ideas
print(cache.hit)
print(cache.miss)

If I understand correctly cache is now a simple dictionary and no direct operaetion on the variable is documented, so adding some new attributes won't be very difficult.

Use best of many random greedy paths

In quantum computing simulations particularly, finding the optimal contraction sequence has a huge effect on what is computable - the amount of memory and time required is exponential in how 'bad' the path found is. The greedy approach is generally worse than the optimal approach but it is cheap to compute (and even cheaper with #60!).

If we introduce a random greedy algorithm by tweaking the cost objective function, we get a variety of different paths with varying costs and we can just pick the best one. I had a little play with contracting a 49 qubit depth 20 circuit (661 tensors overall) and randomly adjusting each cost by +- 5%.

The following are 'contraction complexities' (maximum rank of tensor produced - 2^this is the memory required which is also a rough proxy for computation time) on the left and frequency on the right from 100 runs.

[(25, 1),
 (27, 1),
 (28, 3),
 (29, 6),
 (30, 10),
 (31, 16),
 (32, 6),  # <- this is what the current greedy finds
 (33, 9),
 (34, 21),
 (35, 8),
 (36, 8),
 (37, 1),
 (38, 4),
 (39, 2),
 (40, 1),
 (42, 3)]

So the best contraction found vs the current contraction found requires ~128x less memory and computation, and turns the simulation from 'super'-computable to 'laptop'-computable.

TensorNetwork

It looks like google put out a new TensorNetwork framework which is very similar to this project. It may be reaching out to them as their base algorithms are relatively simple:

https://github.com/google/TensorNetwork

Re-use intermediates across various contractions (Common Subexpression Elimination)

Suppose you want to compute two contractions of the same tensors, e.g. contraction strings

['ab,dca,eb,cde', 'ab,cda,eb,cde']

The (globally) optimal way to do this would be to first perform the contractions over indices a,b and e, and then perform the remaining contractions over c and d for the two sets of contractions. The current opt_einsum implementation does not allow for such a global optimization of contraction order and re-use of common intermediates.

I'm still organizing my thoughts on this, and all input would be most welcome. On a side note, I believe implementing such a more general optimization strategy will also fix #7 as a by-product.

Optimization logic

A relevant publication suggesting a concrete algorithm for this optimization problem is

Hartono et al, Identifying Cost-Effective Common Subexpressions to Reduce Operation Count in Tensor Contraction Evaluations

I do not know to what extent the current code can be re-used in this more general setup, but the single-term optimization should be under control with the current implementation.

Interface

Such a multi_contract function could be called with a list of tuples (contractionString, [tensors, ..]), and would return a list of results with the same length as the input list.

Internally, one would have to find out which tensors are actually identical between the various contractions, and then use the above contraction logic. Ideally this information should be deduced automatically and not rely on user input being in the correct form. In the same spirit, dummy indices should be allowed to have arbitrary names, i.e. they should not have to match across different contractions to be correctly identified as a common subexpression.
This may require transforming the input into a 'canonical' form first to make sure that common subexpressions are captured correctly.

In contrast to the setup outlined in Hartono et al, contraction strings should maybe remain in their current 'simple' form and not be generalized to allow for numerical expressions like sums of tensors etc. Such a behavior can be implemented a posteriori with the interface described here by computing the sum of the resulting tensors, e.g.

contract('ab,ab + ab,ba', N,M) --> 'sum'(multi_contract( [('ab,ab', N,M), ('ab,ba', N,M)] ))

Thus, restricting contraction strings to be of the form currently used does not cause loss of generality, the only downside being that it might lead to a slightly increased memory-footprint as the function would return several arrays instead of one.

Other thoughts?

Optimal path with constant arguments

I have an application where there is a large tensor contraction in the inner loop, with some arrays that are fixed and others that are updated between iterations. Operations only involving the constant terms could done ahead of time (outside the loop), but this may or may not actually improve performance.

It could be nice if opt-einsum had a way to mark some arguments as constant, in which case the cost of contracting them them could neglected when computing the cost of paths only involving these arguments.

pip install opt_einsum fails due to versioneer

When I try to pip install opt_einsum I get an error

$ pip install opt_einsum -U
Collecting opt_einsum
  Using cached https://files.pythonhosted.org/packages/30/52/64ed28228334a1124c082809402c01c4219085c5ee0f34991ace24a10dad/opt_einsum-2.1.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/c5/cqn89jt900v5tmm2y7p0tf4r0000gn/T/pip-install-AyuWKs/opt-einsum/setup.py", line 4, in <module>
        import versioneer
    ImportError: No module named versioneer

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/c5/cqn89jt900v5tmm2y7p0tf4r0000gn/T/pip-install-AyuWKs/opt-einsum/

When I then try to fix the missing dependency via pip install versioneer, I get

$ pip install versioneer
# ...
$ pip install opt_einsum
Collecting opt_einsum
  Using cached https://files.pythonhosted.org/packages/30/52/64ed28228334a1124c082809402c01c4219085c5ee0f34991ace24a10dad/opt_einsum-2.1.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/c5/cqn89jt900v5tmm2y7p0tf4r0000gn/T/pip-install-SZkOZK/opt-einsum/setup.py", line 12, in <module>
        version=versioneer.get_version(),
    AttributeError: 'module' object has no attribute 'get_version'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/c5/cqn89jt900v5tmm2y7p0tf4r0000gn/T/pip-install-SZkOZK/opt-einsum/

System info

$ python --version
Python 2.7.14 :: Anaconda, Inc.
$ uname -a
Darwin fritzo-C02V61MRHTDG 17.4.0 Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 2017; root:xnu-4570.41.2~1/RELEASE_X86_64 x86_64

Bohrium Backend

Spotted the following library that might be useful to explore: https://bohrium.readthedocs.io

Still not sure exactly how we would integrate it (along the lines of numba/numexpr), but something to look at it.

[BUG] Path flop count computation is nondeterministic in Python 3

This came up in pyro-ppl/pyro#1427

def test_path_flop_count():
    equation = 'bd,db,eac->ace'
    shapes = [(3, 2), (2, 3), (4, 2, 3)]
    operands = list(map(np.random.random, shapes))
    path = [(0, 1), (1, 0), (0,)]
    path, path_info = opt_einsum.contract_path(equation, *operands, path='greedy')
    print(path_info)  # "Optimized FLOP count" is nondeterministic

Admittedly this is kind of a weird path. See #60 (comment). Maybe the solution is to avoid paths like these, but I'm not sure how.

`einsum` alias

Should we alias contract to einsum to allow a more drop-in replacement approach?

from opt_einsum import einsum

einsum(...)

As a note this would not remove the contract function, but simply alias it? We could also consider putting a deprecation warning around contract if we do feel like deprecating it. It could be desirable to be more "NEP18-like".

Make no memory limit the default?

This is more a open question than a concrete suggestion. I'm just wondering if limiting the contraction memory to the largest input is the sensible choice for most use cases? I always turn it off, and need to to see best speedups for my use cases. In the contractions where memory is going to be an issue, limiting the possibly contractions results in exponentially slow progress anyway.

I'm just thinking that maybe a lot of contractions could be made faster by default for users not aware of this.

Anyway, I realise my use-cases are not everyones! And that this also might have also been influenced a bit by keeping maximum compatibility with numpy's einsum.

`Optimal` path not optimal?

First, thanks for the great package!

Was running optimal vs. auto on a 10 element random expression, and while I know optimal is not recommended for larger networks, I found it odd that it gives worse scaling and with a larger intermediate size. Some of the other threads mention memory as a factor in the optimization, but the auto solution seems better on memory as well? All the tensors are 3x3. Is this intentional behaviour?

auto path:

([(2, 3), (4, 8), (0, 4), (0, 4), (4, 5), (0, 2), (2, 3), (1, 2), (0, 1)],
   Complete contraction:  db,cc,fe,fe,aa,ff,fe,cb,ea,ac->d
          Naive scaling:  6
      Optimized scaling:  3
       Naive FLOP count:  7.290e+3
   Optimized FLOP count:  2.700e+2
    Theoretical speedup:  27.000
   Largest intermediate:  9.000e+0 elements
 --------------------------------------------------------------------------------
 scaling        BLAS                current                             remaining
 --------------------------------------------------------------------------------
    2              0              fe,fe->fe         db,cc,aa,ff,fe,cb,ea,ac,fe->d
    2              0              fe,fe->fe            db,cc,aa,ff,cb,ea,ac,fe->d
    3           GEMM              cb,db->cd               cc,aa,ff,ea,ac,fe,cd->d
    2              0              ac,cc->ac                  aa,ff,ea,fe,cd,ac->d
    3           GEMM              ac,cd->ad                     aa,ff,ea,fe,ad->d
    2              0              ea,aa->ea                        ff,fe,ad,ea->d
    3           GEMM              ea,ad->ed                           ff,fe,ed->d
    3           GEMM              ed,fe->df                              ff,df->d

optimal path:

([(1, 7), (0, 8), (0, 2), (0, 1), (0, 5), (0, 3), (0, 3), (0, 2), (0, 1)],
   Complete contraction:  db,cc,fe,fe,aa,ff,fe,cb,ea,ac->d
          Naive scaling:  6
      Optimized scaling:  4
       Naive FLOP count:  7.290e+3
   Optimized FLOP count:  5.130e+2
    Theoretical speedup:  14.211
   Largest intermediate:  2.700e+1 elements
 --------------------------------------------------------------------------------
 scaling        BLAS                current                             remaining
 --------------------------------------------------------------------------------
    2              0              cb,cc->cb         db,fe,fe,aa,ff,fe,ea,ac,cb->d
    3           GEMM              cb,db->cd            fe,fe,aa,ff,fe,ea,ac,cd->d
    3              0             aa,fe->afe              fe,ff,fe,ea,ac,cd,afe->d
    2              0              ff,fe->fe                 fe,ea,ac,cd,afe,fe->d
    2              0              fe,fe->fe                    ea,ac,cd,afe,fe->d
    3              0            afe,ea->afe                       ac,cd,fe,afe->d
    4           GEMM            afe,ac->fec                          cd,fe,fec->d
    4           GEMM            fec,cd->fed                             fe,fed->d
    3           GEMM              fed,fe->d                                  d->d)

too many subscripts in einsum

Here is the bottom of my stack trace:

  File "/home/simon/site-packages/opt_einsum/contract.py", line 731, in __call__
    return self._contract(ops, out, backend, evaluate_constants=evaluate_constants)
  File "/home/simon/site-packages/opt_einsum/contract.py", line 667, in _contract
    **self.einsum_kwargs)
  File "/home/simon/site-packages/opt_einsum/contract.py", line 563, in _core_contract
    new_view = _einsum(einsum_str, *tmp_operands, backend=backend, **einsum_kwargs)
  File "/home/simon/site-packages/opt_einsum/sharing.py", line 161, in cached_einsum
    return einsum(*args, **kwargs)
  File "/home/simon/site-packages/opt_einsum/contract.py", line 360, in _einsum
    return fn(einsum_str, *operands, **kwargs)
ValueError: Internal error while evaluating `ContractExpression`. Note that few checks are performed - the number and rank of the array arguments must match the original expression. The internal error was: '('too many subscripts in einsum',)'

It looks like numpy-einsum is barfing here:

def iterlabels(labelstr, output_labels, ndim_output):
    """
     * Set up the labels for the iterator (output + combined labels).
     * Can just share the output_labels memory, because iter_labels
     * is output_labels with some more labels appended.
    """
    iter_labels = output_labels[:]
    ndim_iter = ndim_output
    for label in range(labelstr.min_label, labelstr.max_label+1):
        if labelstr.counts.get(label, 0) > 0 and label not in output_labels:
            if ndim_iter > 128:
                raise ValueError('too many subscripts in einsum')
            iter_labels.append(label)
            ndim_iter += 1
    # may have 0 in iter_labels
    return iter_labels

This is strange because I get the path_info from contract_path and it tells me
that largest_intermediate is of size 134217728, which seems doable.

I'm wondering how einsum ended up with more that 128 labels (i'm assuming these are tensor indices?) Perhaps there are a bunch of trivial indices being summed over...

Any suggestions ?

Einsum broadcast regression

It was pointed out on the main Numpy repository that the current python-based parsing tech misses a few broadcasting operations and tries to run them through BLAS routines which does not work.

See the issue here and PR to fix it here. This patch needs to be pulled over to opt_einsum before the next release.

Indexing operations in Einsum?

Hi opt-einsum,

Big fan. I have a feature request / question.

We often find ourselves writing complicated indexing operations where we want something like Z[i, j] = X[i, Y[i, j]]. To do this in pytorch/numpy is a bug-prone hassle.

Given that this is just a sparse matrix multiply, I love the idea of writing it in einsum, maybe something like einsum(ik,ijk*->ij, x, y). Where k is the implied dimension of treating Y as a sparse one-hot matrix.

Any ideas on how I might implement a backend like this or notate it? Would want the sparse dimension to become fancy indexing.

memory_limit with 'greedy' contractions

I notice that with path='greedy', memory_limit is locked at the max input size, for 'algorithmic' reasons. Is this still an issue? / any prospects of this being resolved?

I did some brief testing with the limit just removed and all seemed ok (and faster!).

P.S. thanks for the nice project!

Reduce greedy complexity to sub-quadratic?

The greedy strategy can be quite slow when given hundreds of tensors. Is it possible to reduce its complexity to sub-quadratic, possibly by:

using incremental computations when possible
using memoization when possible
maintaining a priority queue rather than recomputing min-cost contraction at each step

cuTENSOR + cupy backend

Nvidia reached out about using cuTENSOR as backend (details here) which is a low level implementation of tensor contraction on Nvidia GPU's.

This doesn't seem to be widely advertised, but cupy has support for this in their development head with some information here. Though the different syntax would require some minimal work to integrate.

I don't have too many use cases personally to give this a try, but if anyone has some fun use cases I can benchmark them on a pair of V100s against other backends.

v3.0 Planning

We are planning a major version bump for the following reasons:

Transition to be backend agnostic (NumPy is no longer the principle backend).
- See #81 which has the (small) potential to break current implementations.

TODO Items:

Refactoring in preparation for NEP18 (#46) should be considered.
Consider adding a Ricci calculus parser contract('A_ij B_jk A_kl', {"A": a, "B": b}).
Run through the backends to check where we can refactor.
Update docs and README to reflect that we are NumPy agnostic.
Consider parsers that use metrics besides raw FLOPS (actually trying out several contraction paths for example).

Are there other items where we might need to do a few (small) compatibility breaks that would be good for future sustainability?

FR: Support multiple output shapes via batch_contract()

It would be nice to support equations like ab,bc,cd->a,b,c,d that produce multiple different output shapes, as e.g. is common in computing marginals of graphical models via variable elimination.

a, b, c, d = 5, 6, 7, 8
x, y, z = np.random.randn(a, b), np.random.randn(b, c), np.random.randn(c, d)
result = batch_contract('ab,bc,cd->a,b,c,d', x, y, z)
assert isinstance(result, tuple) and len(result) == 4
assert result[0].shape == (a,)
assert result[1].shape == (b,)
assert result[2].shape == (c,)
assert result[3].shape == (d,)

Note that this is backwards incompatible with contract, since the result is a tuple of arrays rather than an array. A naive implementation (simplified from the implementation in Pyro) is

def batch_contract(equation, *operands, **kwargs):
    inputs, outputs = equation.split('->')
    with shared_intermediates():
        return tuple(contract(inputs + '->' + output, *operands, **kwargs)
                     for output in outputs.split(','))

While the naive implementation can share some work across output shapes, it can miss sharing opportunities across contract calls because each contraction path is greedily optimized blind to the other paths.

Can we generalize the eager strategy to jointly optimize multiple contraction paths, possibly still in a greedy fashion?

Related work

Hartono et al. (2006) Identifying Cost-Effective Common Subexpressions to
Reduce Operation Count in Tensor Contraction
Evaluations
Kohlas, Wilson (2008) Semiring induced valuation algebras:
Exact and approximate local computation algorithms
Khamis, Ngo, Rudra (2017) FAQ: Questions Asked Frequently

Cython implementation

It would be nice if the contraction algorithms (especially optimal) was written in Cython. This could give really big speed boosts to the search algorithms.

Error with path="optimal"

When running the following example, I get an exception

contract('ia,aj,ka->ijk', A, B, C, path="optimal")

The traceback is

Traceback (most recent call last):
  File "opt_einsum.py", line 477, in <module>
    x, y = contract('ia,aj,ka->ijk', A, B, C, path="optimal")
  File "opt_einsum.py", line 425, in contract
    tmp_inputs.append(input_list.pop(x))
IndexError: pop index out of range

Possible JOSS paper

I have been playing around with the idea of submitting this project as a paper to The Journal of Open Source Software. Once the documentation is fully ported over to ReadTheDocs the process should be relatively straightforward.

I would be curious if any other contributors would be interested in being on the paper as well. I think @jcmgray should be especially for the latest contributions. Any other thoughts here?

Support more subscript symbols

The project now supports more than 52 subscripts, but only with integers, and it can quickly get confusing using so many integers. My suggestion is to support this kind of syntax:

arr1 = np.zeros((2, 3))
sub1 = '[left edge][bond]'
arr2 = np.zeros((3, 2))
sub2 = '[bond][right edge]'
out = '[left edge][right edge]'
opt_einsum.contract(arr1, sub1, arr2, sub2, out)

I know that it's equivalent with:

arr1 = np.zeros((2, 3))
sub1 = [0, 1]
arr2 = np.zeros((3, 2))
sub2 = [1, 2]
out = [0, 2]

However, IMHO, when doing contraction with lots of tensor which have different meanings, it'll be nice to give them a proper name.

I'm more than happy to prepare a PR for this, I think it's not extemely difficult.

Potential bug (outer tensor product)

This works...:

>>> import mpmath
>>> import numpy
>>> m0 = numpy.eye(2)
>>> opt_einsum.contract('bc,ad->abcd', m0, m0)
array([[[[1., 0.], ...)

...but this breaks:

>>> m1 = m0 + mpmath.mpc(0)
>>> opt_einsum.contract('bc,ad->abcd', m1, m1)
Traceback (most recent call last): (...)
TypeError: invalid data type for einsum

However, using opt_einsum with mpmath is otherwise OK in principle:

>>> opt_einsum.contract('bc,cd->bd', m1, m1)
array([[mpc(real='1.0', imag='0.0'), mpc(real='0.0', imag='0.0')],
       [mpc(real='0.0', imag='0.0'), mpc(real='1.0', imag='0.0')]],
      dtype=object)

dgasmith / opt_einsum Goto Github PK

opt_einsum's Introduction

Optimized Einsum

Optimized Einsum: A tensor contraction order optimizer

Example usage

Features

Installation

Citation

Contributing

opt_einsum's People

Contributors

Stargazers

Watchers

Forkers

opt_einsum's Issues

Status

Status

Examples

Optimization logic

Interface

System info

Related work

Recommend Projects

Recommend Topics

Recommend Org

Jobs