While sparse arrays are supported in dask, this issue aims to open the discussion on how this could be applied in the the context of dask-ml.
In particular, even if #5 about TF-IDF gets resolved, the estimators downstream in the pipeline would also need to support sparse arrays for this to be of any use. The simplest example of such pipeline could for instance be #115 : some text vectorizers, training a wrapped out-of code scikit-learn model (e.g. PartialMultinomialNB
).
Potentially relevant estimators
TruncatedSVD
, text vectorizer, some estimators in dask_ml.preprocessing
, and wrapped scikit learn models that support incremental learning and sparse arrays natively.
Sparse array format
Several choices here,
- should the mrocklin/sparse be added as a hard dependency as the package that works out of the box for sparse arrays?
- should sparse arrays from scipy be wrapped to make them compatible in some limited fashion (if possible at all)?
In particular, as far as I understand, for the application at hand there is not need for ND sparse COO arrays provided by sparse
package, 2D would be enough. Furthermore, scikit learns mostly uses CSR and while it's relatively easy to convert between COO and CSR/CSC in the non distrubuted case, I'm not sure if it's still the case for dask. Then there is the partitioning strategy (see next section).
Partitioning strategy
At least as far text vectorizers and incremental learning estimators are concerned, I imagine, it might be easier to partition the arrays row wise (using all the column width), which might also be natural with the CSR format.
File storage format
For instance once someone manages to compute a distributed TF-IDF , the question arises how can one store it on disk without loading everything in memory at once. At present, it doesn't look like there is a canonical way to do this (dask/dask#2562 (comment)). zarr-developers/zarr-python#152 might be relevant but it essentially stores dense format with compression, as far as I understand, which make difficult to do later computation with the data, I believe.
Just a few general thoughts. I'm not sure what's your vision of the project in this respect is @mrocklin @TomAugspurger ; how much work this would represent or what might the easiest to start with..
cc @jorisvandenbossche @ogrisel