GithubHelp home page GithubHelp logo

Add Lasso about dask-ml HOT 12 OPEN

dask avatar dask commented on August 10, 2024
Add Lasso

from dask-ml.

Comments (12)

TomAugspurger avatar TomAugspurger commented on August 10, 2024 1

I think that

from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)

lr = LinearRegression(regularizer=L1())
lr.fit(X, y)

Is basically correct. I haven't looked at the various options for scikit-learn's Lasso.

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 10, 2024

Note: I think that all the pieces should be in place thanks to dask-glm. This should be a matter of translating the scikit-learn API to a linear regression with dask-glm's L1 regularizer.

from dask-ml.

jakirkham avatar jakirkham commented on August 10, 2024

Do you have any code snippets that I should look at for trying to do something like this?

from dask-ml.

jakirkham avatar jakirkham commented on August 10, 2024

Hmm...so when scikit-learn implements these sorts of things, they seem to support a vector or matrix for y. However it seems that dask-glm only supports a vector for y. Do you know why that is? Would it be possible to change it? If so, how difficult would that be?

Edit: Have migrated this concern to issue ( #201 ).

from dask-ml.

TomAugspurger avatar TomAugspurger commented on August 10, 2024

from dask-ml.

mrocklin avatar mrocklin commented on August 10, 2024

from dask-ml.

jakirkham avatar jakirkham commented on August 10, 2024

Meaning 2-D ndarray (though it is a fair question). Should add that scikit-learn typically coerces 1-D ndarrays into singleton 2-D ndarrays when 2-D ndarrays are allowed.

Not sure whether squeezing make sense. More likely iterating over the 1-D slices and fitting them independently would make sense, which appears to be what scikit-learn is doing. So this should benefit quite nicely from Distributed.

from dask-ml.

valkmit avatar valkmit commented on August 10, 2024

+1, interested in this as well. The provided code

from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)

lr = LinearRegression(regularizer=L1())
lr.fit(X, y)

is missing the ability to set the alpha value - the coefficients seem to point to this not being a proper lasso regression.

The following example I quickly threw together also doesn't appear to work properly, but it piggybacks on top of Dask GLM's ElasticNet the same way scikit's Lasso runs on top of scikit's ElasticNet.

family = dask_glm.families.Normal()
regularizer = dask_glm.regularizers.ElasticNet(weight=1)
b = dask_glm.algorithms.gradient_descent(X=X, y=y, max_iter=100000, family=family, regularizer=regularizer, alpha=0.01, normalize=False)

from dask-ml.

stsievert avatar stsievert commented on August 10, 2024

Isn't it possible to set the regularization value with the code below?

from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer="l1", C=1e-6)
lr.fit(X, y)
assert np.abs(lr.coef_).max() < 1e-3, "C=1e-6 should produce mostly 0 coefs"

C and alpha/lamduh control the strength of the regularization (but might be inverses of each other).

from dask-ml.

valkmit avatar valkmit commented on August 10, 2024

Isn't it possible to set the regularization value with the code below?

from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer="l1", C=1e-6)
lr.fit(X, y)
assert np.abs(lr.coef_).max() < 1e-3, "C=1e-6 should produce mostly 0 coefs"

C and alpha/lamduh control the strength of the regularization (but might be inverses of each other).

Indeed this is what I was missing, appreciate the pointer!

from dask-ml.

valkmit avatar valkmit commented on August 10, 2024

Given a small C, the regression does appear to function similarly to Lasso given a small C (implying a large alpha). However, you're also right in that C is inverse of the alpha parameter.

Scikit's documentation says alpha = 1/2C where C is given in other linreg libraries. So an alpha of 0.01 should correspond with a C of 50.

However, with the following code comparing the outputs of both scikit's lasso and Dask's "lasso"

from sklearn.linear_model import Lasso
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(penalty='l1', C=50, fit_intercept=False)
lr.fit(X, y)

r = Lasso(alpha=0.01, fit_intercept=False)
r.fit(X.compute(), y.compute())

print(lr.coef_)
print(r.coef_)

The coefficients for the dask model fit appear unstable. For very small C, they do look the same.

I'm no ML expert - in fact I'm just slapping some code together - but it seems like there's definitely an inverse relationship, just not one that's 1/2C. Which would be fine, except the performance of dask ml at very small C is several times worse than scikit - about 30x worse, for values of C and alpha that empirically appear to give very similar coefficients.

Is there something else I am missing here? Or is this performance slowdown to be expected.

from dask-ml.

stsievert avatar stsievert commented on August 10, 2024

except the performance of dask ml at very small C is several times worse than scikit - about 30x worse, for values of C and alpha that empirically appear to give very similar coefficients.

What do you mean "30× worse"? I'm not sure I'd expect Dask-ML any kind of timing acceleration with a small array.

C and alpha that empirically appear to give very similar coefficients.

I've verified that C and alpha give very similar coefficients. The two sets of coefficients are very close with relative error, a standard benchmark in optimization:

# script above
import numpy.linalg as LA
rel_error = LA.norm(lr.coef_ - r.coef_) / LA.norm(r.coef_)
print(rel_error)  # 0.00172; very small. Two vectors have close euclidean distance

print(np.abs(r.coef_).max())  # 89.2532; the scikit-learn coefs are large
print(np.abs(lr.coef_ - r.coef_).mean())  # 0.01543; the mean error is small
print(np.abs(lr.coef_ - r.coef_).max())  # 0.10180; the max error is still pretty large
print(np.median(np.abs(lr.coef_ - r.coef_))) # 0.01077; not as expected (1e-3 or 1e-4 expected), fair given debugging

from dask-ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.