While sparse arrays are <a href="http://dask.pydata.org/en/latest/array-sparse.html" r

+1 for <a class="issue-link js-issue-link" data-error-text="Failed to load title" data

For storage you could write a custom dask function that relies on <code class="notrans

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Sparse arrays support about dask-ml HOT 24 OPEN

dask commented on September 14, 2024

Sparse arrays support

from dask-ml.

Comments (24)

mrocklin commented on September 14, 2024 2

Long term it would be nice to see a scipy.sparse csr class that satisfied the np.ndarray rather than np.matrix interface. Then it could operate nicely with dask array while still having the CSR layout. This is discussed in more depth in scipy/scipy#8162

To be clear, I'm not saying a fully ndarray implementation, just the current implementation where n <= 2 that satisfies ndarray conventions. This is a broader issue for the rest of the community. My understanding is that sklearn devs would also appreciate a scipy.sparse.csr_array class.

from dask-ml.

ogrisel commented on September 14, 2024 1

+1 for scipy/scipy#8162 as the only true long term solution.

On the shorter term it should be possible to fix the most common discrepancies directly in dask. I found one in dask/dask#2842 but I did not find (take) the time to finish investigating the cause of the test failures.

from dask-ml.

hameerabbasi commented on September 14, 2024 1

Interesting... It seems a lot of algorithms have been optimized for CSR/CSC already. I'll see how hard it is to implement CSR/CSC (limited to 2-D) with array semantics. Mostly it's just a matter of studying and re-implementing scipy.sparse code (or maybe just even wrapping may be good enough).

However, indexing would again be COO... Since CSR/CSC isn't defined for 1-D.

from dask-ml.

mrocklin commented on September 14, 2024 1

Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix). Can't we start with this and progressively re-implement operations to finally reach the final implementation that wouldn't eventually use scipy.sparse ?

This is feasible. The only concern would be that we do a lot of work and never reach the functionality of scipy.sparse. It may be that copying the entire implementation over and then tweaking a couple of things to stop using the np.matrix interface is significantly easier to do.

To be clear, people can do whatever they want to do. I have thoughts on how I would or would not spend my time here, but that doesn't stop others from exploring this topic. I doubt that my thoughts on this topic are entirely correct.

from dask-ml.

ogrisel commented on September 14, 2024

For sklearn, both CSR and CSC are useful. CSR is typically optimal for incremental learning algorithm (those with a partial_fit) while CSC is optimal for column-wise algorithms such as decision trees and linear models fitted with a coordinate descent optimizer (Lasso, ElasticNet...).

from dask-ml.

ogrisel commented on September 14, 2024

For storage you could write a custom dask function that relies on joblib.dump to efficiently serialize scipy.sparse.csr_matrix blocks in a filesystem folder.

from dask-ml.

ogrisel commented on September 14, 2024

or write your own by using numpy.save / numpy.load on the 3 underlying numpy arrays sparse_data_block.indices, sparse_data_block.data and sparse_data_block.indptr.

from dask-ml.

mrocklin commented on September 14, 2024

cc @hameerabbasi , dev on the mrocklin/sparse library

from dask-ml.

rth commented on September 14, 2024

Thanks for the response. The discussion about sparse ndarray in scipy looks quite promising indeed!

For storage you could write a custom dask function that relies on joblib.dump [..] or write your own by using numpy.save

yes, I imagine that wouldn't be too difficult to do.

On the shorter term it should be possible to fix the most current discrepancies directly in dask directly. I found one in dask/dask#2842 ...

Good to know, thanks.

from dask-ml.

hameerabbasi commented on September 14, 2024

I'm a bit hesitant to add CSR/CSC in mrocklin/sparse, simply because they limit us to 2-D (N-D CSR/CSC is... murky to say the least), and I'm worried about growing complexity. However, we can achieve near-CSR level performance if mrocklin/sparse#60 is fixed (O(log n) vs O(1)).

Currently, we are working with C-style ordering of indices. If we, instead, added a Fortran-style mode for ordering indices, and modified indexing to fit, then we could achieve near-CSC level performance as well.

If you could reduce the bug in dask/dask#2842 to an underlying bug or missing feature in mrocklin/sparse, I'd be happy to look into it. If it's a bug in dask itself, I'm less willing to dive into it at this point, but I will soon, to help get it integrated.

from dask-ml.

mrocklin commented on September 14, 2024

In regards to performance I think that the main operation where you really need CSR/CSC is tensordot, which is pretty critical in many numeric algorithms. For this CSC/CSR is probably essential.

If we don't want to host these in mrocklin/sparse then csr_array and csc_array could also exist in scipy.sparse. This would probably be good to get more eyes on it and ensure maintenance, but presents a challenge due to the slow release and development cycle. Alternatively someone could make a separate standalone package.

At this point I think that downstream library maintainers, like the scikit-learn community, should probably decide what they would prefer to depend upon.

from dask-ml.

ogrisel commented on September 14, 2024

CSR and CSC support would be useful to allow for no-copy behavior when calling Cython code from scikit-learn on sparse data blocks.

There are several algorithms in scikit-learn that do very efficient, cache aligned data access when fed with CSR or CSC datastructures.

from dask-ml.

mrocklin commented on September 14, 2024

Note that the scope of CSR/CSC is non-trivial. You might want to look at algorithms in scipy/sparse/linalg/. This would almost certainly require writing low-level code in Cython/Numba/C/C++.

from dask-ml.

mrocklin commented on September 14, 2024

If I were doing this under time constraints I would probably try to find a way to modify the existing solution to support the ndarray interface.

However if I were doing this without serious time constraints then yes, rebuilding that code from scratch might result in significant improvements (and would also be pretty educational I think).

from dask-ml.

hameerabbasi commented on September 14, 2024

I plan to keep mrocklin/sparse a mostly data structure related thing as discussed in scipy/scipy#8162. Maybe implement operators, etc, but that's about it. I want to go down to Cython at some point... Might as well be soon.

dot/tensordot/linalg.solve etc., would either be in Scipy or a separate package... Or in mrocklin/sparse if we had help. I plan to keep the data/indices/indptr structure so their code can be easily ported.

from dask-ml.

hameerabbasi commented on September 14, 2024

To be clear, I'd love to see mrocklin/sparse expand, or be merged into Scipy when stable. I just don't think I can do it alone. :)

from dask-ml.

mrocklin commented on September 14, 2024

Yeah, you're not alone here. I think that many people on this issue have demonstrated that they're happy to pitch in when relevant. I also suspect that sparse ndarrays will be of relevance to @jcrist in a couple months.

I plan to keep mrocklin/sparse a mostly data structure related thing as discussed in scipy/scipy#8162. Maybe implement operators, etc, but that's about it. I want to go down to Cython at some point... Might as well be soon.

FWIW I'd rather see a single csr ndarray class somewhere that has all relevant computational operations. Splitting this up between repositories seems painful to me. I'd rather wait until a clear solution is agreed upon instead of launching into this short term.

from dask-ml.

jakirkham commented on September 14, 2024

I wonder if it would be possible to extend CSR/CSC to N-D. It seems that all it should be is an ordering of which axes are prioritized in the sparse representation. Does that seem correct? If so, it seems like it should be possible to implement this in sparse in a way where it remains general purpose. With CSR/CSC, this amount to one of two orderings involving axes 0 and 1.

from dask-ml.

mrocklin commented on September 14, 2024

Yes, this is possible. There are a combinatorial number of such layouts, but presumably we would store some hierarchy of axes that we cared about. I wouldn't make this choice until some important workloads demonstrated its value (like large sparse iterative factorization algorithms). I also wouldn't want to think about how operations like solve generalized under these conditions.

Short-to-medium term I think that 2d CSR/CSC with the np.ndarray interface is a very valuable and feasible objective. My hope is that going into the scipy.sparse code and making a few modifications is all that's necessary to achieve this; this could be naive though.

from dask-ml.

rth commented on September 14, 2024

I would also be interested to help a bit, once it is decided where and in what way this should be implemented. (Though my bandwidth before summer won't be great).

At this point I think that downstream library maintainers, like the scikit-learn community, should probably decide what they would prefer to depend upon.

I would think it's more a question for dask-ml. scikit-learn itself currently has a minimal dependency of scipy 0.13.3 and works reasonably well around the limitations of the matrix interface. Which means that this development won't impact it in the near future either way.

dot/tensordot/linalg.solve etc., would either be in Scipy or a separate package...

However if I were doing this without serious time constraints then yes, rebuilding that code from scratch might result in significant improvements (and would also be pretty educational I think).
I'd rather wait until a clear solution is agreed upon instead of launching into this short term.

At the same time in the context of dask-ml, even an implementation that supports basic operations and is able to covert without much cost to/from scipy's csr/csc_matrix might help solve some of the issues?

Short-to-medium term I think that 2d CSR/CSC with the np.ndarray interface is a very valuable and feasible objective. My hope is that going into the scipy.sparse code and making a few modifications is all that's necessary to achieve this;

from dask-ml.

mrocklin commented on September 14, 2024

At the same time in the context of dask-ml, even an implementation that supports basic operations and is able to covert without much cost to/from scipy's csr/csc_matrix might help solve some of the issues?

Yes, that's true. Short term we've been using the COO implementation in mrocklin/sparse for this. The conversion isn't free, but it's also not that expensive. We could also make a thin wrapper around scipy.sparse that just converted every operation back and forth, but this would probably be a wasted project long-term. There are some short-term-needs vs long-term-goals tradeoffs to make here. I'm not sure what the best path is.

from dask-ml.

hameerabbasi commented on September 14, 2024

It seems that all it should be is an ordering of which axes are prioritized in the sparse representation.

Yes, that is the most general way to put it. Would CSD (compressed sparse dimensions) make sense, for both CSR/CSC? For CSR, the compressed dimension would be 0 and for CSC it would be 1. We could keep a tuple of compressed_axes pairs.

Then, CSR/CSC would just inherit from this or be special cases. We would implement all operations exactly as in 2-D CSR without any broadcasting. Tensordot would reduce to dot for the appropriate choice of axes.

from dask-ml.

rth commented on September 14, 2024

We could also make a thin wrapper around scipy.sparse that just converted every operation back and forth, but this would probably be a wasted project long-term.

Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix). Can't we start with this and progressively re-implement operations to finally reach the final implementation that wouldn't eventually use scipy.sparse ?

from dask-ml.

mrocklin commented on September 14, 2024

@hameerabbasi and @jakirkham 's thoughts on CSD might be a decent start. This code seems orthogonal to the logic in scipy.sparse (and so would presumably not be thrown away in the future) and might resolve the short-term need.

from dask-ml.

Sparse arrays support about dask-ml HOT 24 OPEN

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs