Would be good to add <a href="http://scikit-learn.org/stable/modules/generated/sklearn

I think that <div class="highlight highlight-source-python notranslate position-re

+1, interested in this as well. The provided code <div class="highlight highlight-

Add Lasso about dask-ml HOT 12 OPEN

dask commented on August 10, 2024

Add Lasso

from dask-ml.

Comments (12)

TomAugspurger commented on August 10, 2024 1

I think that

from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)

lr = LinearRegression(regularizer=L1())
lr.fit(X, y)

Is basically correct. I haven't looked at the various options for scikit-learn's Lasso.

from dask-ml.

TomAugspurger commented on August 10, 2024

Note: I think that all the pieces should be in place thanks to dask-glm. This should be a matter of translating the scikit-learn API to a linear regression with dask-glm's L1 regularizer.

from dask-ml.

jakirkham commented on August 10, 2024

Do you have any code snippets that I should look at for trying to do something like this?

from dask-ml.

jakirkham commented on August 10, 2024

Hmm...so when scikit-learn implements these sorts of things, they seem to support a vector or matrix for y. However it seems that dask-glm only supports a vector for y. Do you know why that is? Would it be possible to change it? If so, how difficult would that be?

Edit: Have migrated this concern to issue ( #201 ).

from dask-ml.

TomAugspurger commented on August 10, 2024

It should certainly be possible, but I'm not sure offhand how much work it'll be.

…

On Wed, Jun 6, 2018 at 11:27 AM, jakirkham ***@***.***> wrote: Hmm...so when scikit-learn implements these sorts of things, they seem to support a vector or matrix for y. However it seems that dask-glm only supports a vector for y. Do you know why that is? Would it be possible to change it? If so, how difficult would that be? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#101 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHItovblUXI3wAKz-pSACqR8hl-Wt6ks5t6AMEgaJpZM4QjFS7> .

from dask-ml.

mrocklin commented on August 10, 2024

Just checking that by matrix, do you mean ndarray with two dimensions, or do you mean an np.matrix object? If the former then is it an array that could be squeezed, or is there something more complex here with multiple labels? On Thu, Jun 7, 2018 at 8:54 AM, Tom Augspurger <[email protected]> wrote:

…

It should certainly be possible, but I'm not sure offhand how much work it'll be. On Wed, Jun 6, 2018 at 11:27 AM, jakirkham ***@***.***> wrote: > Hmm...so when scikit-learn implements these sorts of things, they seem to > support a vector or matrix for y. However it seems that dask-glm only > supports a vector for y. Do you know why that is? Would it be possible to > change it? If so, how difficult would that be? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#101 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABQHItovblUXI3wAKz- pSACqR8hl-Wt6ks5t6AMEgaJpZM4QjFS7> > . > — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#101 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszFAeOKV2HotC1dGZEdqpw_Y5FrZeks5t6SKBgaJpZM4QjFS7> .

from dask-ml.

jakirkham commented on August 10, 2024

Meaning 2-D ndarray (though it is a fair question). Should add that scikit-learn typically coerces 1-D ndarrays into singleton 2-D ndarrays when 2-D ndarrays are allowed.

Not sure whether squeezing make sense. More likely iterating over the 1-D slices and fitting them independently would make sense, which appears to be what scikit-learn is doing. So this should benefit quite nicely from Distributed.

from dask-ml.

valkmit commented on August 10, 2024

+1, interested in this as well. The provided code

from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)

lr = LinearRegression(regularizer=L1())
lr.fit(X, y)

is missing the ability to set the alpha value - the coefficients seem to point to this not being a proper lasso regression.

The following example I quickly threw together also doesn't appear to work properly, but it piggybacks on top of Dask GLM's ElasticNet the same way scikit's Lasso runs on top of scikit's ElasticNet.

family = dask_glm.families.Normal()
regularizer = dask_glm.regularizers.ElasticNet(weight=1)
b = dask_glm.algorithms.gradient_descent(X=X, y=y, max_iter=100000, family=family, regularizer=regularizer, alpha=0.01, normalize=False)

from dask-ml.

stsievert commented on August 10, 2024

Isn't it possible to set the regularization value with the code below?

from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer="l1", C=1e-6)
lr.fit(X, y)
assert np.abs(lr.coef_).max() < 1e-3, "C=1e-6 should produce mostly 0 coefs"

C and alpha/lamduh control the strength of the regularization (but might be inverses of each other).

from dask-ml.

valkmit commented on August 10, 2024

Isn't it possible to set the regularization value with the code below?
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer="l1", C=1e-6)
lr.fit(X, y)
assert np.abs(lr.coef_).max() < 1e-3, "C=1e-6 should produce mostly 0 coefs"
C and alpha/lamduh control the strength of the regularization (but might be inverses of each other).

Indeed this is what I was missing, appreciate the pointer!

from dask-ml.

valkmit commented on August 10, 2024

Given a small C, the regression does appear to function similarly to Lasso given a small C (implying a large alpha). However, you're also right in that C is inverse of the alpha parameter.

Scikit's documentation says alpha = 1/2C where C is given in other linreg libraries. So an alpha of 0.01 should correspond with a C of 50.

However, with the following code comparing the outputs of both scikit's lasso and Dask's "lasso"

from sklearn.linear_model import Lasso
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(penalty='l1', C=50, fit_intercept=False)
lr.fit(X, y)

r = Lasso(alpha=0.01, fit_intercept=False)
r.fit(X.compute(), y.compute())

print(lr.coef_)
print(r.coef_)

The coefficients for the dask model fit appear unstable. For very small C, they do look the same.

I'm no ML expert - in fact I'm just slapping some code together - but it seems like there's definitely an inverse relationship, just not one that's 1/2C. Which would be fine, except the performance of dask ml at very small C is several times worse than scikit - about 30x worse, for values of C and alpha that empirically appear to give very similar coefficients.

Is there something else I am missing here? Or is this performance slowdown to be expected.

from dask-ml.

stsievert commented on August 10, 2024

except the performance of dask ml at very small C is several times worse than scikit - about 30x worse, for values of C and alpha that empirically appear to give very similar coefficients.

What do you mean "30× worse"? I'm not sure I'd expect Dask-ML any kind of timing acceleration with a small array.

C and alpha that empirically appear to give very similar coefficients.

I've verified that C and alpha give very similar coefficients. The two sets of coefficients are very close with relative error, a standard benchmark in optimization:

# script above
import numpy.linalg as LA
rel_error = LA.norm(lr.coef_ - r.coef_) / LA.norm(r.coef_)
print(rel_error)  # 0.00172; very small. Two vectors have close euclidean distance

print(np.abs(r.coef_).max())  # 89.2532; the scikit-learn coefs are large
print(np.abs(lr.coef_ - r.coef_).mean())  # 0.01543; the mean error is small
print(np.abs(lr.coef_ - r.coef_).max())  # 0.10180; the max error is still pretty large
print(np.median(np.abs(lr.coef_ - r.coef_))) # 0.01077; not as expected (1e-3 or 1e-4 expected), fair given debugging

from dask-ml.

Add Lasso about dask-ml HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs