Comments (24)
Long term it would be nice to see a scipy.sparse csr class that satisfied the np.ndarray rather than np.matrix interface. Then it could operate nicely with dask array while still having the CSR layout. This is discussed in more depth in scipy/scipy#8162
To be clear, I'm not saying a fully ndarray implementation, just the current implementation where n <= 2
that satisfies ndarray conventions. This is a broader issue for the rest of the community. My understanding is that sklearn devs would also appreciate a scipy.sparse.csr_array
class.
from dask-ml.
+1 for scipy/scipy#8162 as the only true long term solution.
On the shorter term it should be possible to fix the most common discrepancies directly in dask. I found one in dask/dask#2842 but I did not find (take) the time to finish investigating the cause of the test failures.
from dask-ml.
Interesting... It seems a lot of algorithms have been optimized for CSR/CSC already. I'll see how hard it is to implement CSR/CSC (limited to 2-D) with array semantics. Mostly it's just a matter of studying and re-implementing scipy.sparse code (or maybe just even wrapping may be good enough).
However, indexing would again be COO... Since CSR/CSC isn't defined for 1-D.
from dask-ml.
Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix). Can't we start with this and progressively re-implement operations to finally reach the final implementation that wouldn't eventually use scipy.sparse ?
This is feasible. The only concern would be that we do a lot of work and never reach the functionality of scipy.sparse. It may be that copying the entire implementation over and then tweaking a couple of things to stop using the np.matrix interface is significantly easier to do.
To be clear, people can do whatever they want to do. I have thoughts on how I would or would not spend my time here, but that doesn't stop others from exploring this topic. I doubt that my thoughts on this topic are entirely correct.
from dask-ml.
For sklearn, both CSR and CSC are useful. CSR is typically optimal for incremental learning algorithm (those with a partial_fit
) while CSC is optimal for column-wise algorithms such as decision trees and linear models fitted with a coordinate descent optimizer (Lasso
, ElasticNet
...).
from dask-ml.
For storage you could write a custom dask function that relies on joblib.dump
to efficiently serialize scipy.sparse.csr_matrix
blocks in a filesystem folder.
from dask-ml.
or write your own by using numpy.save / numpy.load on the 3 underlying numpy arrays sparse_data_block.indices
, sparse_data_block.data
and sparse_data_block.indptr
.
from dask-ml.
cc @hameerabbasi , dev on the mrocklin/sparse library
from dask-ml.
Thanks for the response. The discussion about sparse ndarray in scipy looks quite promising indeed!
For storage you could write a custom dask function that relies on joblib.dump [..] or write your own by using numpy.save
yes, I imagine that wouldn't be too difficult to do.
On the shorter term it should be possible to fix the most current discrepancies directly in dask directly. I found one in dask/dask#2842 ...
Good to know, thanks.
from dask-ml.
I'm a bit hesitant to add CSR/CSC in mrocklin/sparse, simply because they limit us to 2-D (N-D CSR/CSC is... murky to say the least), and I'm worried about growing complexity. However, we can achieve near-CSR level performance if mrocklin/sparse#60 is fixed (O(log n)
vs O(1)
).
Currently, we are working with C-style ordering of indices. If we, instead, added a Fortran-style mode for ordering indices, and modified indexing to fit, then we could achieve near-CSC level performance as well.
If you could reduce the bug in dask/dask#2842 to an underlying bug or missing feature in mrocklin/sparse, I'd be happy to look into it. If it's a bug in dask itself, I'm less willing to dive into it at this point, but I will soon, to help get it integrated.
from dask-ml.
In regards to performance I think that the main operation where you really need CSR/CSC is tensordot, which is pretty critical in many numeric algorithms. For this CSC/CSR is probably essential.
If we don't want to host these in mrocklin/sparse
then csr_array and csc_array could also exist in scipy.sparse
. This would probably be good to get more eyes on it and ensure maintenance, but presents a challenge due to the slow release and development cycle. Alternatively someone could make a separate standalone package.
At this point I think that downstream library maintainers, like the scikit-learn community, should probably decide what they would prefer to depend upon.
from dask-ml.
CSR and CSC support would be useful to allow for no-copy behavior when calling Cython code from scikit-learn on sparse data blocks.
There are several algorithms in scikit-learn that do very efficient, cache aligned data access when fed with CSR or CSC datastructures.
from dask-ml.
Note that the scope of CSR/CSC is non-trivial. You might want to look at algorithms in scipy/sparse/linalg/. This would almost certainly require writing low-level code in Cython/Numba/C/C++.
from dask-ml.
If I were doing this under time constraints I would probably try to find a way to modify the existing solution to support the ndarray interface.
However if I were doing this without serious time constraints then yes, rebuilding that code from scratch might result in significant improvements (and would also be pretty educational I think).
from dask-ml.
I plan to keep mrocklin/sparse a mostly data structure related thing as discussed in scipy/scipy#8162. Maybe implement operators, etc, but that's about it. I want to go down to Cython at some point... Might as well be soon.
dot
/tensordot
/linalg.solve
etc., would either be in Scipy or a separate package... Or in mrocklin/sparse if we had help. I plan to keep the data
/indices
/indptr
structure so their code can be easily ported.
from dask-ml.
To be clear, I'd love to see mrocklin/sparse expand, or be merged into Scipy when stable. I just don't think I can do it alone. :)
from dask-ml.
Yeah, you're not alone here. I think that many people on this issue have demonstrated that they're happy to pitch in when relevant. I also suspect that sparse ndarrays will be of relevance to @jcrist in a couple months.
I plan to keep mrocklin/sparse a mostly data structure related thing as discussed in scipy/scipy#8162. Maybe implement operators, etc, but that's about it. I want to go down to Cython at some point... Might as well be soon.
FWIW I'd rather see a single csr ndarray class somewhere that has all relevant computational operations. Splitting this up between repositories seems painful to me. I'd rather wait until a clear solution is agreed upon instead of launching into this short term.
from dask-ml.
I wonder if it would be possible to extend CSR/CSC to N-D. It seems that all it should be is an ordering of which axes are prioritized in the sparse representation. Does that seem correct? If so, it seems like it should be possible to implement this in sparse
in a way where it remains general purpose. With CSR/CSC, this amount to one of two orderings involving axes 0 and 1.
from dask-ml.
Yes, this is possible. There are a combinatorial number of such layouts, but presumably we would store some hierarchy of axes that we cared about. I wouldn't make this choice until some important workloads demonstrated its value (like large sparse iterative factorization algorithms). I also wouldn't want to think about how operations like solve
generalized under these conditions.
Short-to-medium term I think that 2d CSR/CSC with the np.ndarray interface is a very valuable and feasible objective. My hope is that going into the scipy.sparse code and making a few modifications is all that's necessary to achieve this; this could be naive though.
from dask-ml.
I would also be interested to help a bit, once it is decided where and in what way this should be implemented. (Though my bandwidth before summer won't be great).
At this point I think that downstream library maintainers, like the scikit-learn community, should probably decide what they would prefer to depend upon.
I would think it's more a question for dask-ml. scikit-learn itself currently has a minimal dependency of scipy 0.13.3 and works reasonably well around the limitations of the matrix interface. Which means that this development won't impact it in the near future either way.
dot/tensordot/linalg.solve etc., would either be in Scipy or a separate package...
However if I were doing this without serious time constraints then yes, rebuilding that code from scratch might result in significant improvements (and would also be pretty educational I think).
I'd rather wait until a clear solution is agreed upon instead of launching into this short term.
At the same time in the context of dask-ml, even an implementation that supports basic operations and is able to covert without much cost to/from scipy's csr/csc_matrix might help solve some of the issues?
Short-to-medium term I think that 2d CSR/CSC with the np.ndarray interface is a very valuable and feasible objective. My hope is that going into the scipy.sparse code and making a few modifications is all that's necessary to achieve this;
+1
from dask-ml.
At the same time in the context of dask-ml, even an implementation that supports basic operations and is able to covert without much cost to/from scipy's csr/csc_matrix might help solve some of the issues?
Yes, that's true. Short term we've been using the COO implementation in mrocklin/sparse
for this. The conversion isn't free, but it's also not that expensive. We could also make a thin wrapper around scipy.sparse that just converted every operation back and forth, but this would probably be a wasted project long-term. There are some short-term-needs vs long-term-goals tradeoffs to make here. I'm not sure what the best path is.
from dask-ml.
It seems that all it should be is an ordering of which axes are prioritized in the sparse representation.
Yes, that is the most general way to put it. Would CSD (compressed sparse dimensions) make sense, for both CSR/CSC? For CSR, the compressed dimension would be 0
and for CSC
it would be 1
. We could keep a tuple of compressed_axes
pairs.
Then, CSR/CSC would just inherit from this or be special cases. We would implement all operations exactly as in 2-D CSR without any broadcasting. Tensordot would reduce to dot for the appropriate choice of axes.
from dask-ml.
We could also make a thin wrapper around scipy.sparse that just converted every operation back and forth, but this would probably be a wasted project long-term.
Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix
). Can't we start with this and progressively re-implement operations to finally reach the final implementation that wouldn't eventually use scipy.sparse
?
from dask-ml.
@hameerabbasi and @jakirkham 's thoughts on CSD might be a decent start. This code seems orthogonal to the logic in scipy.sparse (and so would presumably not be thrown away in the future) and might resolve the short-term need.
from dask-ml.
Related Issues (20)
- sklearn handles text labels differently than ml_dask on OneHotEncoding
- Implementation for make_s_curve HOT 2
- Import dask_ml with python 3.10 failed due to conflict with dask.distributed HOT 4
- Python 3.11 support HOT 2
- LogisticRegression.score returns an empty dask array
- Incremental does not handle dask arrays of ndim>2 in estimator training HOT 2
- loading dask_ml gives error contextualversionconflict with sklearn HOT 4
- For a single record data frame train_test_split() sometimes assigns this single record to test set. HOT 2
- The `log_loss`-function crashes when using mixed types
- Area under the receiving operating characteristic curve (AUROC) calculation. HOT 2
- The latest version doesn't support perceptron model
- sklearn StandardScaler vs dask StandardScaler. HOT 1
- Nearest Neighbors
- `TypeError` when predicting non-array data with `dask-expr` HOT 6
- Undeclared runtime dependency on setuptools HOT 1
- Documentation on PCA expected max memory usage HOT 1
- Support GPU-backed data for metrics.
- Unexpected behavior in train_test_split with shuffle=False
- ColumnTransformer does not work with Dask dataframes HOT 1
- Add versions after 2023.3.24 to the Anaconda main channel HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-ml.