dask / dask-ml Goto Github PK

View Code? Open in Web Editor NEW

887.0 40.0 254.0 100.21 MB

Scalable Machine Learning with Dask

Home Page: http://ml.dask.org

License: BSD 3-Clause "New" or "Revised" License

Python 99.92% Shell 0.08%

hacktoberfest

dask-ml's Introduction

Dask-ML

Dask-ML provides scalable machine learning in Python using Dask alongside popular machine learning libraries like Scikit-Learn, XGBoost, and others.

You can try Dask-ML on a small cloud instance by clicking the following button:

LICENSE

New BSD. See License File.

dask-ml's People

Contributors

Stargazers

Watchers

Forkers

tomaugspurger brendancol jakirkham vishalbelsare libardo1 dsevero huyhoang17 mrocklin jorisvandenbossche chrinide jrbourbeau benjamesbabala ddcr ml-lab gvanzin aleeee sunxingxingtf xingqishi evanfwelch deshpand clustersdata rthmak gjreda selvamshan freevanx colinsongf zijiandu glemaitre lesteve rth gpfreitas nirizr amueller huskysun jbednar hbcbh1999 gbenatt92 bhargavvader rizplate dimitris0mg asliroy gccmbr liyingkun1237 almaleksia cr458 pandadumpling afcarl esthernamrm naoufelabs stsievert jaykimbravekjh js3711 gridl convexset ferasos ameykpatel laprej gglanzani fatihakici erdnaavlis isaac-you thomasjpfan sanyaade-machine-learning tmoerman bsipocz datajanko seismolabmodelinggroup kshitij12345 leanhtuan souravsingh abhimanyuaryan ravithejaburugu pakelley emanuelfontelles nmatare rmsare asgersoerensen zefr-inc ebo edenbaus i8dnlo jmeisele dantegd n8henrie paolof89 tanayag jjerphan rmccorm4 awesomemachinelearning atyamsriharsha stjordanis pentschev mbrukman malongge databill86 henriqueribeiro dsquant ibebrett ktp-forked-repos epydoc

dask-ml's Issues

Rename to dask_ml?

Most dask packages follow the following naming convention:

dask-foo for package name on PyPI, and github
dask_foo for import name in Python

The dask-ml package is currently imported as daskml rather than dask_ml. The underscore is a bit of a pain, so I can see how the current name would be preferred. Still though, I thought I'd bring this up. We may want to switch for conformity's sake.

cc @TomAugspurger @jcrist

Transfer large synthetic data approach from dask-glm

See make_poisson and similar functions in dask-glm's datasets.py. Transfer that approach into dask-ml's datasets.py so larger synthetic data sets can be created. Currently the approach in dask-ml's datasets.py just wraps sklearn.datasets.

Add NMF

Would be nice to have scikit-learn's NMF wrapped up for use with large Dask Arrays.

Move assert_estimator_equal to daskml.utils

And register it with pytest so we still have assertion rewriting:

pytest.register_assert_rewrite('daskml.utils')

Distributed TFIDF

Greetings!

I recently used dask to implement a distributed version of tfidf. I want to contribute to the dask project by putting it somewhere.

Would this be the correct repo.?

I thought maybe a feature_extraction directory would be appropriate.

Add Lars

Would be good to add Lars.

docs/environment.yml issues

Can we install xgboost from conda-forge?
There are some references to @TomAugspurger 's environment

DummyEnccoder should check X in transform

Validate that the dtypes match, and that the categories match for Categorical columns.

No module named 'daskml.cluster._k_means'

Hey guys,

Just got back and I tried to run pytest and got:

ImportError while importing test module '/home/severo/Projects/dask-ml/tests/test_kmeans.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_kmeans.py:7: in <module>
    from daskml.cluster import KMeans as DKKMeans
daskml/cluster/__init__.py:4: in <module>
    from .k_means import KMeans  # noqa
daskml/cluster/k_means.py:12: in <module>
    from ._k_means import _centers_dense
E   ModuleNotFoundError: No module named 'daskml.cluster._k_means'

I understand that it is a Cythonized .pyx file, but shouldn't we be calling pyximport.install() somewhere?

Once I added this to the top of daskml/cluster/k_means.py everything went back to normal.

Sparse arrays support

While sparse arrays are supported in dask, this issue aims to open the discussion on how this could be applied in the the context of dask-ml.

In particular, even if #5 about TF-IDF gets resolved, the estimators downstream in the pipeline would also need to support sparse arrays for this to be of any use. The simplest example of such pipeline could for instance be #115 : some text vectorizers, training a wrapped out-of code scikit-learn model (e.g. PartialMultinomialNB).

Potentially relevant estimators

TruncatedSVD, text vectorizer, some estimators in dask_ml.preprocessing, and wrapped scikit learn models that support incremental learning and sparse arrays natively.

Sparse array format

Several choices here,

should the mrocklin/sparse be added as a hard dependency as the package that works out of the box for sparse arrays?
should sparse arrays from scipy be wrapped to make them compatible in some limited fashion (if possible at all)?

In particular, as far as I understand, for the application at hand there is not need for ND sparse COO arrays provided by sparse package, 2D would be enough. Furthermore, scikit learns mostly uses CSR and while it's relatively easy to convert between COO and CSR/CSC in the non distrubuted case, I'm not sure if it's still the case for dask. Then there is the partitioning strategy (see next section).

Partitioning strategy

At least as far text vectorizers and incremental learning estimators are concerned, I imagine, it might be easier to partition the arrays row wise (using all the column width), which might also be natural with the CSR format.

File storage format

For instance once someone manages to compute a distributed TF-IDF , the question arises how can one store it on disk without loading everything in memory at once. At present, it doesn't look like there is a canonical way to do this (dask/dask#2562 (comment)). zarr-developers/zarr-python#152 might be relevant but it essentially stores dense format with compression, as far as I understand, which make difficult to do later computation with the data, I believe.

Just a few general thoughts. I'm not sure what's your vision of the project in this respect is @mrocklin @TomAugspurger ; how much work this would represent or what might the easiest to start with..

cc @jorisvandenbossche @ogrisel

Add Lasso

Would be good to add Lasso.

Shuffle blocks for partial_fit wrappers

https://twitter.com/shoyer/status/912387291749822465

The incremental learn's are susceptible to bias when the blocks aren't IID (the dataset is ordered by time, say, and X or y is correlated with time).

How best to handle this? A full shuffle is costly, and unnecessary. A shuffle of the blocks (plus maybe an in-memory shuffle of each block) should be fine. Actually... I should check now if the scheduler goes through the blocks in order.

Streaming K-Means

Probably through a partial_fit API.

Doc build is failing

https://readthedocs.org/projects/dask-ml/builds/6364862/

Fails on the sphinx-gallery examples with OOM errors. We'll probably need to just generate those ahead of time and include them statically :/

@wraps breaks introspection

dask-ml/dask_ml/datasets.py

Line 11 in cb3d0ea

@wraps(func)

The use of @wraps here leads to wrong behavior of some introspection tools, like ?? in Jupyter (will show you the code for the sklearn version).

I feel that it's best to wrap thing manually (assigning the __name__, __doc__ and __module__ attributes for inner manually), after line 17, before the return.

This issue is making some unpleasant workarounds necessary in the xarray_filters codebase (same datasets.py file) (where we are trying to give more helpful docstrings users beyond what's in sklearn). The issue for me revolves around knowing whether I'm going to wrap the make_blobs from dask_ml or from sklearn. I could do it by inspecting the signature, or the __module__ attribute, but those attributes are set equal to the ones in sklearn right now.

I'll put links to relevants parts of the xarray_filters codebase here as soon as I push the new code up.

Change fitted values to be concrete

See #6 (comment) for more. I think the eventual goal is to have a compute keyword in each estimators __init__ method that defaults to True.

Need to update MinMaxScaler and StandardScaler.

Cython should be added as a dependency

Hey Guys,

Just attempting to install in a fresh virtualenv:

➜ pip install dask-ml
Collecting dask-ml
  Downloading dask-ml-0.3.1.tar.gz (262kB)
    100% |████████████████████████████████| 266kB 1.4MB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/cy/fp9dph0x14x9s7fszwjjjh4c0000gn/T/pip-build-HNpvGT/dask-ml/setup.py", line 5, in <module>
        from Cython.Build import cythonize
    ImportError: No module named Cython.Build
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/cy/fp9dph0x14x9s7fszwjjjh4c0000gn/T/pip-build-HNpvGT/dask-ml/

Seems like pip install Cython was all I needed - this should be added to the requirements in setup.py, no?

K-Means fails with init='k-means++' and no random state

Inconsistent inherited docstrings

A few estimators subclass estimators from other ML packages (e.g. scikit-learn) and when a method is not overwritten, the docstring from the parent package is used directly in the documentation, which is not necessarily consistent. For instance,

docstrings inherited from scikit-learn where the variable dtype includes sparse arrays
"see also" sections that points to scikit-learn estimators etc.

I wonder if some automatic re-writing of the non-overwritten methods could partially address this...

How does dask-ml handle large datasets?

From the latest demos of dask-sklearn (and related projects) that I've seen, dask has primarily focused on parallelizing grid/random search and CV, and provides some smart caching mechanisms for transformers and estimators that have already been evaluated on a dataset.

However, that approach breaks down when, say, a slow algorithm is being trained on a very large dataset. With some slower algorithms, training one instance of the algorithm on one fold of the dataset can take hours or days, which makes training the algorithm unviable for all but the most patient practitioner.

Can dask-ml speed up the underlying ML algorithms that it supports?

Add FA

Would be nice to have scikit-learn's FA wrapped up for use with large Dask Arrays.

Revise documentation on hyperparameter search

Right around the time that I stumbled upon dask-searchcv, scikit-learn 0.19 was released where pipelines now support the memory parameter:
http://scikit-learn.org/stable/whats_new.html#version-0-19

As such, perhaps this section of the docs should be revised:

With the regular scikit-learn version, each stage of the pipeline must be fit for each of the combinations of the parameters, even if that step isn’t being searched over.

I'd be interested to hear if any work was done to compare/contrast dask-searchcv vs scikit-learn's changes in 0.19. Presumably dask-searchcv allows you to more easily harness multiple machines, but perhaps some rudimentary benchmarks could/would be interesting/appropriate.

From:
http://dask-ml.readthedocs.io/en/latest/hyper-parameter-search.html#efficient-search

LogisticRegression cannot train from Dask DataFrame

A simple example:

from dask import dataframe as dd
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=2)

X = dd.from_dask_array(X, columns=["a","b"])
y = dd.from_array(y)

lr = LogisticRegression()
lr.fit(X, y)

Returns KeyError: (<class 'dask.dataframe.core.DataFrame'>,)

I did not have time to try if it is also the case for other models.

More scalable alternative to k-means++ based on sampling

I have no direct experience with it but this NIPS 2016 paper looks very interesting and has some theoretical guarantees on the approximation.

Fast and Provably Good Seedings for k-Means
http://papers.nips.cc/paper/6478-fast-and-provably-good-seedings-for-k-means.pdf

Here are some benchmarks:

Add DictionaryLearning

Would be nice to have scikit-learn's DictionaryLearning wrapped up for use with large Dask Arrays.

Informative index in make_classification uses numpy

Misfire:

This line uses NumPy:

informative_idx = np.random.choice(n_features, n_informative)

However now I see that this is likely small. Closing.

kernel K-Means

https://pdfs.semanticscholar.org/c2ef/1651aaa2bdbf4c49d18682ac6f4401f95750.pdf

Once spectral clustering with Nystrom is done, this will be straightforward since we have all the building blocks

Apply the nystrom method
Apply truncated SVD
Apply K-Means

Add ElasticNet

Would be good to add ElasticNet.

distributed TF IDF with Multinomial Naive Bayes

would be state of the art and very useful if Dask could natively handle distributed TF IDF matrices as input to a multinomial naive bayes model. I know this is a difficult problem to solve because for most implementations of computing TF IDF you need the entire Term Document matrix in memory so I'm not sure know how to solve this problem tbh.
Problem referenced here: https://stackoverflow.com/questions/25145552/tfidf-for-large-dataset

High Cardinality Encoders

See http://hccencoding-project.readthedocs.io/en/latest/ and https://github.com/Robin888/hccEncoding-project

PCA, TruncatedSVD Transformers

This would both be useful / not too difficult (hopefully). Can build off the SVD that's in dask.array.

Consider using hypothesis for testing.

Maybe we should consider using hypothesis for testing. I use it here at work to test ETL and data pipelines and it works like a charm.

k-means Memory usage

Debugging a memory usage issue I'm seeing with k-means initialization. The issue is in this loop. A small example is

import numpy as np
import dask.array as da
from dask import delayed
from distributed import Client, wait

from sklearn.datasets import make_classification


if __name__ == '__main__':

    c = Client()
    s = c.cluster.scheduler
    N_SAMPLES = 1_000_000
    N_BLOCKS = 24

    def mem():
        print("{:.2f}".format(sum(s.worker_bytes.values()) / 10**9), "GB")

    def make_block(n_samples):
        X, y = make_classification(n_samples=n_samples)
        return X

    blocks = [delayed(make_block)(N_SAMPLES) for i in range(N_BLOCKS)]

    arrays = [da.from_delayed(block, dtype='f8', shape=(N_SAMPLES, 20))
              for block in blocks]
    stacked = da.vstack(arrays)

    print(stacked.nbytes / 10**9, "GB")
    X = c.persist(stacked)
    wait(X)

    for i in range(5):
        idx = np.random.randint(0, len(X), size=5)
        centers = X[idx].compute()
        mem()

Which outputs

3.84 GB
4.32 GB
4.80 GB
5.12 GB
5.60 GB
6.08 GB

@mrocklin is that increasing memory usage expected? Inside that for loop, the result is always going to be a small NumPy array.

(side-note, that generates some exceptions I've left out of the output):

distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/Users/taugspurger/.virtualenvs/dask-dev/lib/python3.6/site-packages/distributed/distributed/protocol/core.py", line 122, in loads
    value = _deserialize(head, fs)
  File "/Users/taugspurger/.virtualenvs/dask-dev/lib/python3.6/site-packages/distributed/distributed/protocol/serialize.py", line 160, in deserialize
    f = deserializers[header.get('type')]
KeyError: 'numpy.ndarray'

looking into those now.

First Release

I'd like to do an initial release to PyPI and conda forge this weekend.

Any objects @mrocklin, @daniel-severo? I'll merge #51 shortly, and #11 once I have another look over it and if you're comfortable with it @daniel-severo.

sklearn-xarray compatability

There was a question over in pangeo-data/pangeo#61 (comment) about how dask-ml works with sklearn-xarray

When I first tried it out, things like dask_ml.preprocessing.StandardScalar() failed since we do X.mean(0), instead of X.mean(axis=0). I think I tried fixing that locally and a small example worked, though my memory is hazy.

I don't personally have much experience with xarray , but would certainly welcome PRs to make things work smoothly.

cc @phausamann , @gmaze

Docs reorganization

I've been walking through the documentation and had a few notes. I'd be happy to make these changes but wanted to check in before I submitted anything.

The separation between single machine and distributed learning seems odd to me. Many of the topics listed in single machine (grid search, pipelining, possibly even incremental learning) are still relevant when on a cluster.

I might instead flatten the TOC to just remove the single-machine/distributed distinction, and give all of the subsections their own home. This might flatten the TOC to something like the following:

Pipelines
Hyper-parameter search
Incremental learning
Generalized Linear Models
Joblib
XGBoost
Clustering
Examples
API

I've also found it pleasant recently to start sections with the API relevant for that section. Futures docs for an example. This gives a quick list at the beginning of each section on the API relevant for that section. Those functions still link to the main API doc page.

Any thoughts or objections to this reorganization @TomAugspurger ?

Add ICA

Would be nice to have some form of ICA available for large Dask Arrays. Ideally could just use an implementation or a few as appropriate from scikit-learn.

Add License

I apparently forgot to include a license when creating the repo. The intent was BSD-3.

@daniel-severo, @mrocklin, as the other contributors up to this point, are you OK with adding that?

Candidate transformers

I think it'd be nice to have some transformers that work on dask and numpy arrays, & dask and pandas DataFrames. This would be good since

We can depend on dask and pandas, scikit-learn can't
A basic transformer is generally much less work than a full-blown estimato

API

See https://github.com/tomaugspurger/sktransformers for some inspiration (and maybe some tests)?

We should match scikit-learn as closely as possible where things overlap
All transformers should take an optional columns argument. When specified, the transformation will be limited to just columns (e.g. if doing a standard scaling, and columns=['A', 'B'], only 'A' and 'B' are scaled). By default, all columns are scaled
We should operate on np.ndarray, dask.array.Array, pandas.core.NDFrame, dask.dataframe._Frame.
Should our operations be nan-safe?

The big question right now is should fitting be eager, and fitted values concrete? e.g.

scaler = StandardScaler()
scaler.fit(X)  # X is a dask.array

So, has scaler.mean_ been computed, and is it a dask.array or a numpy.array? This is a big decision.

Candidates

Imputer
CategoricalEncoder (TODO: check on Joris' recent work in sklearn here)
DummyEncoder
VarianceThreshold
MinMaxScaler
PolynomialFeatures
QuantileTransformer
High-Cardinality Categorical (see https://gist.github.com/ogrisel/b6a97ed87939e3b559568ac2f6599cba)

Run check_estimator on our estimators

http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator

It'd be nice if these passed.

Alternating least squares

Someone mentioned wanting to implement ALS (I think @daniel-severo ?)

I thought I'd raise an issue to solicit and discuss algorithms.

Collect routines from other packages?

In some cases we import routines from other packages. In some cases we don't.

For example we import LogisticRegression from dask_glm but don't import GridSearchCV from dask_searchcv. Which approach should we take here? I'm slightly inclined to import from everywhere and consolidate here.

Add GMM

Hi all,
Is there any plan to implement a parallel version of GMM (Gaussian Mixture Modelling) ?
Thanks
g

eg: http://dx.doi.org/10.1109/CSAE.2012.6272849

Add LassoLars

Would be good to add LassoLars

Pandoc required to build docs

I was trying to build the dask-ml documentation locally and got the following output from make html

building [mo]: targets for 0 po files that are out of date
building [html]: targets for 45 source files that are out of date
updating environment: 45 added, 0 changed, 0 removed
reading sources... [ 11%] examples/dask-glm                                                                                                                                   
Notebook error:
PandocMissing in examples/dask-glm.ipynb:
Pandoc wasn't found.
Please check that pandoc is installed:
http://pandoc.org/installing.html
make: *** [html] Error 1

It looks like Pandoc is missing from my dev environment (created using conda env create -f ci/environment.yml --name=dask-ml_dev). After manually installing Pandoc from conda-forge, I was able to build the documentation successfully.

Not sure if I missed something when setting up my environment. If not, I'd be happy to submit a PR adding Pandoc to ci/environment.yml and ci/environment-27.yml

Wrap LightGBM

https://github.com/Microsoft/LightGBM

They have parallel and distributed modes.

Add CategoricalEncoder

scikit-learn/scikit-learn#9151

DOC: Add install page

Before I forget, it'd be good to document all the required / optional dependencies. I'll do this later today.

Joblib, parallel_backend, and performance

hi ya'll - Matthew suggested I ask some questions here.

I'm a little confused on some results I'm seeing - and am wondering if you guys can help me figure out why I am seemingly executing grid searches in serial.

I modified dask-docker and am using containers on a docker swarm with 64GB RAM and 48 cores each.

search = RandomizedSearchCV(self.estimator, self.estimator.param_space, cv=3, n_iter=20, verbose=10)

if self.estimator.use_dask:

    address = 'tcp://dask-cluster:8786'
    c = Client(address)

    with parallel_backend('dask.distributed', scheduler_host=address,
                          scatter=[X.values, y.values]):
            search.fit(X.values, y.values)

where X is pd.DataFrame and y is pd.Series.

When I run the above with a small number of rows, I can see in the status page that lots of tasks get executed in parallel. The blocks of work on the task stream end up looking like a straight vertical since they all get dispatched nearly simultaneously.

Once I start increasing the number of rows, I seem to get more and more serial execution, where the status page shows that really one additional task gets added at a time.

In this screenshot, I started at the full dataset, and started subtracting 10k on each run to see the affect on the execution time / parallelism. For some reason occasionally, e.g. on 80k and 40k, the work gets distributed a little differently?

When the number is higher, there is never more than one task active. When the number is lower, I see more (up to 4) getting triggered simultaneously.

Anyhow, my question ultimately is:

am I doing something in particular wong?
does this pattern look indicative of something I setup incorrectly?

Make chunks optional in the datasets.make_* methods

Just pick a good default.