GithubHelp home page GithubHelp logo

scikit-learn-contrib / skglm Goto Github PK

View Code? Open in Web Editor NEW
137.0 137.0 27.0 25.77 MB

Fast and modular sklearn replacement for generalized linear models

Home Page: http://contrib.scikit-learn.org/skglm

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

skglm's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

skglm's People

Contributors

badr-moufad avatar borispf avatar cool-rr avatar enlai111 avatar jasperlamm avatar jjerphan avatar klopfe avatar mathurinm avatar pabannier avatar pascalcarrivain avatar qb3 avatar tomaszkacprzak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skglm's Issues

How to control number of cores?

I apologize for opening an issue that isn't a real bug.

I need to control the number of cores used when I fit a model.
I've tried the following things:
(1)

from numba import set_num_threads
set_num_threads(1)

(2)

from numba import config
config.NUMBA_NUM_THREADS = 1

(3) Going through the source code and replacing every instance of @njit with @njit(parallel=False).

No matter what I try, I see my core usage shoot up to 128.
I'm on a cluster of machines connected by slow interconnect, so this core usage actually makes the model run very slow.
Does anybody understand how to control this?

ENH - slow ``gram_cd_solver`` when fitted on sparse dataset

As investigated in #59 through timing and benchmarks, fitting a gram_cd_solver on a sparse dataset comes with a big overhead of computing the gram matrix, as opposed to the dense case.

Sparse case

sparse-case

Dense case

dense-case

It seems that it has to do more with scipy sparse matrix multiplication. More details about that (explanation, code snippets).

It would be beneficial to have a more efficient way to compute the gram matrix in the sparse case as it would speed up drastically the solver.

BUG - Ill-conditioned residual matrix in AA can lead to wasted computations

As AA has been refactored in a class, we should make sure to catch a sneaky bug that can make AA fail silently.
I've already stumbled upon cases where the residual matrix is ill-conditioned, which makes the solution of the quadratic problem associated to AA take very large positive and very small negative values which cancel out.
This bug is not caught by the try - except block paying attention to LinAlgError, and computational time is wasted.

We should make sure to catch this bug in the AA class.

See https://github.com/scikit-learn-contrib/skglm/blob/main/skglm/solvers/cd_solver.py#L295 for more detailed explanation.

Considerable overhead when fitting with ``Lasso``

Description

We have a considerable overhead when fitting Lasso Estimator as shown in the screenshot below

To reproduce go to benchopt Lasso benchmark repo

Investigation

After investigating, this overhead is because of the computation of the global_lipschitz

n_features = X.shape[1]
self.global_lipschitz = norm(X, ord=2) ** 2 / len(y)
self.lipschitz = np.zeros(n_features, dtype=X.dtype)

that we introduced after adding the FISTA solver #91.

Potential fix

The global_lipschitz is only relevant for the FISTA solver. Hence, it should be computed only in this case.

ENH Create a solver dispatcher

Currently, users are not explicitly warned when an unsupported solver is used with some penalties or datafits. We should write some form of mechanism (a dispatcher?) which raises an exception when the coupling (datafit, penalty, solver) is unsupported.

E.g.: for #100 currently we should raise an exception, that is explicit for the user

Release skglm 0.2

We have made some major changes in terms of API and added some cool solvers, it deserves a release

We need to close first:

and maybe #53

FEAT Add line search for FISTA solver

Currently FISTA relies on "global" Lipschitz constant. For non-Lipschitz datafits (eg: Poisson), this is a problem, hence the support of line search.

MNT sklearn 1.2dev breaks estimator fitting

Validation of parameters at fit time seems to be an issue, parameters which should be set by the parent class of, e.g., MCPRegressor, are not returned by get_params() and so clf._validate_params() fails:

FAILED skglm/tests/test_estimators.py::test_check_estimator[Lasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'war...
FAILED skglm/tests/test_estimators.py::test_check_estimator[wLasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'wa...
FAILED skglm/tests/test_estimators.py::test_check_estimator[ElasticNet] - ValueError: The parameter constraints ['alpha', 'l1_ratio', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'cop...
FAILED skglm/tests/test_estimators.py::test_check_estimator[MCP] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_...
FAILED skglm/tests/test_estimators.py::test_estimator[X0-Lasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_s...
FAILED skglm/tests/test_estimators.py::test_estimator[X0-wLasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_...
FAILED skglm/tests/test_estimators.py::test_estimator[X0-ElasticNet] - ValueError: The parameter constraints ['alpha', 'l1_ratio', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X...
FAILED skglm/tests/test_estimators.py::test_estimator[X0-MCP] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_sta...
FAILED skglm/tests/test_estimators.py::test_estimator[X1-Lasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_s...
FAILED skglm/tests/test_estimators.py::test_estimator[X1-wLasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_...
FAILED skglm/tests/test_estimators.py::test_estimator[X1-ElasticNet] - ValueError: The parameter constraints ['alpha', 'l1_ratio', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X...
FAILED skglm/tests/test_estimators.py::test_estimator[X1-MCP] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_sta...
FAILED skglm/tests/test_estimators.py::test_generic_estimator[Quadratic-L1-False-Lasso-pen_args0] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'm...
FAILED skglm/tests/test_estimators.py::test_generic_estimator[Quadratic-WeightedL1-False-WeightedLasso-pen_args1] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', ...
FAILED skglm/tests/test_estimators.py::test_generic_estimator[Quadratic-L1_plus_L2-False-ElasticNet-pen_args2] - ValueError: The parameter constraints ['alpha', 'l1_ratio', 'fit_intercept', 'nor...
FAILED skglm/tests/test_estimators.py::test_generic_estimator[Quadratic-MCPenalty-False-MCPRegression-pen_args3] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', '...
FAILED skglm/tests/test_estimators.py::test_grid_search[Lasso] - ValueError: 
FAILED skglm/tests/test_estimators.py::test_grid_search[wLasso] - ValueError: 
FAILED skglm/tests/test_estimators.py::test_grid_search[ElasticNet] - ValueError: 
FAILED skglm/tests/test_estimators.py::test_grid_search[MCP] - ValueError: 
FAILED skglm/tests/test_group.py::test_equivalence_lasso - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_start', ...
FAILED skglm/tests/test_group.py::test_vs_celer_grouplasso[15-50-True] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', ...
FAILED skglm/tests/test_group.py::test_vs_celer_grouplasso[5-50-False] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', ...
FAILED skglm/tests/test_group.py::test_vs_celer_grouplasso[19-59-False] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol',...
================================================================ 24 failed, 112 passed, 1 xfailed, 16 warnings in 114.40s (0:01:54) ================================================

Reproduce by

    clf = MCPRegression(
        alpha=alpha, gamma=np.inf, fit_intercept=False, tol=tol)
    print(clf.get_params())
    clf.fit(X, y)

in the setup of test_estimators.py

ENH Cache Numba compilation for the user

import time
import numpy as np
from numpy.linalg import norm
from sklearn.linear_model import Lasso as Lasso_sk
from skglm.estimators import Lasso

n_samples = 100
n_features = 10_000

X = np.random.normal(0, 1, (n_samples, n_features))
y = np.random.normal(0, 1, (n_samples,))

alpha_max = norm(X.T @ y, ord=np.inf) / n_samples
alpha = alpha_max * 0.1

start = time.time()
clf = Lasso(alpha).fit(X, y)
print("skglm:", time.time() - start)

start = time.time()
clf = Lasso_sk(alpha).fit(X, y)
print("sklearn:", time.time() - start)

This script gives:

skglm: 4.0232319831848145
sklearn: 0.2305459976196289

This is due to the compilation cost. We should cache this compilation once and for all, ideally during install (by pre-building/pre-compiling the IR generated by Numba) or when the user first runs a script, using njit(cache=True).

ENH support fit_intercept

Fititng an intercept can drastically change the size of the supports and the time it takes to fit:

import numpy as np
import time
from libsvmdata import fetch_libsvm
X, y = fetch_libsvm("finance")

from celer import Lasso

for fit_intercept in [False, True]:
    alpha_max = np.max(np.abs(X.T @ (y - fit_intercept*np.mean(y)))) / len(y)
    clf = Lasso(alpha_max / 200, fit_intercept=fit_intercept, verbose=1)

    t0 = time.time()
    clf.fit(X, y)
    t1 = time.time()
    print(f"### fit_intercept={fit_intercept}, time to fit {t1-t0:.2f} s, supp size: {(clf.coef_ != 0).sum()}")

To do it without pain my idea is to add a gradient step with np.ones(features) in the solver at the end of each CD epoch. That is like adding a column full of ones to X, but without copying it.

One needs to add a way to compute the lipschitz constant of this; I do not see another solution than to make it an attribute of datafit (eg 1 for Quadratic, 1/4 for Logistic)

ENH add weighted ridge CD/FISTA solver

Probably calls for a non working set solver, or we can work something out with is_penalized (that could be called yields_sparse_coefs or something, ie if this is not true, we should always include the feature.

MNT - Merge ``Lasso`` and ``WeightedLasso`` estimators

Following #110 (comment),

Lasso and WeightedLasso estimators share the same implementation logic.
How weights=None is handled in the WeightedLasso estimator hints at merging the two estimators by adding a weights argument to Lasso.

To achieve that

  • remove WeightedLasso
  • add a weights argument in Lasso estimator
  • handle weighs argument in path and fit methods

ENH skglm import time is high

In [1]: import time

In [2]: t0 = time.time()
   ...: import skglm
   ...: 
   ...: print(time.time() - t0)
0.7723352909088135

I thought this was due to numba compilation, but it's not fixed on #44. importing Lasso from sklearn.linear_model takes 1e-5 s, that is weird

CLN remove seaborn dependency

We use it only to define C_LIST in utils.py, it's not worth having a dependency.

For the doc, it can be installed directly on Circle CI together with numpydoc and other doc-only packages.

BUG - code breaks down whithout throwing any error when calling ``cd_solver``

Script to reproduce

import numpy as np
from skglm.datafits import Quadratic
from skglm.penalties import L1
from skglm.solvers.cd_solver import cd_solver
from skglm.utils import make_correlated_data

n_samples, n_features = 10, 50
X, y, _ = make_correlated_data(n_samples, n_features, random_state=0)
alpha_max = np.linalg.norm(X.T @ y, ord=np.inf) / n_samples

datafit = Quadratic()
penalty = L1(alpha=alpha_max / 10.)

w = np.zeros(n_features)
Xw = np.zeros(n_samples)
cd_solver(X, y, datafit, penalty, w, Xw)

print("This line will never be printed")
print("This one also will never be printed")
print("...")

This is a dangerous behavior as the code breaks without giving any hint on what's wrong (even with a debugger).

source of the bug and a potential fix

After debugging, I find out that the problem emanates from passing in a non-initialized datafit to cd_solver. In fact, the solver attempts to access the Lipschitz constants of datafit to compute grads, which doesn't exist.

A natural way to fix this bug is to make cd_solver a private member to prevent calling it directly from the outside.

Reflections

Not having the Lipschitz constant doesn't explain why the code breaks without throwing any error. I suspect a bug in Numba compilation and here is a small script to reproduce the bug

import numpy as np
from numba import float64
from numba.experimental import jitclass

@jitclass([('Xty', float64[:])])
class Quadratic:
    def __init__(self):
        pass

    def gradient_scalar(self, X, Xw, j):
        return (X[:, j] @ Xw - self.Xty[j]) / len(Xw)

X = np.ones((2, 5))
Xw = np.zeros(2)

Quadratic().gradient_scalar(X, Xw, 0)

print("This line will never be printed")

ENH Issue a warning when solution found is null

Most non-expert users might not be familiar with the rescaling by n_samples for alpha. Issuing a warning when alpha > alpha_max and giving hints to the user (for instance by printing alpha_max) and informing the user of the scaling by n_samples would be a valuable addition.

ENH/BUG? check_estimator has some very adversarial cases for SVC

To reproduce, run this in test_estimators.py:

    estimator_name = "SVC"

    clf = clone(dict_estimators_ours[estimator_name])
    clf.verbose = 2
    clf.tol = 1e-6  # failure in float32 computation otherwise
    # if isinstance(clf, WeightedLasso):
    #     clf.weights = None
    check_estimator(clf)

square-root LASSO implementation

As suggested by @agramfort , I am opening an issue here (original) to make a feature request for the square-root LASSO.

Describe the workflow you want to enable

The unconstrained LASSO is currently implemented in lasso_path. However, the parameter for unconstrained LASSO is sensitive to the measurement noise. Instead, the square-root LASSO has objective norm(Ax-b) + alpha * norm(x, 1) and its parameter, alpha, is insensitive to the noise level. I would like an sklearn interface (with fit, predict, etc.) coupled with a lasso_path-like function, srlasso_path, which solves the square-root LASSO optimization program.

Describe your proposed solution

I propose an sklearn-friendly wrapper around the cvxpy or cvxopt interface. More ambitiously, I propose an implementation based on [1] or [2] that implements a fast method for solving square-root LASSO (either via (accelerated) PGD or ADMM).

[1] "On fast convergence of proximal algorithms for SQRT-LASSO optimization: Don’t worry about its nonsmooth loss function" by Li, Xinguo, Jiang, Haoming, et al. (2020) PMLR.
[2] "The flare package for high dimensional linear regression and precision matrix estimation in R" by Li, Xingguo, Zhao, Tuo, Yuan, Xiaoming and Liu, Han. (2015) JMLR.

Describe alternatives you've considered, if relevant

Currently, statsmodels offers a square-root LASSO solver implemented via the Python package cvxopt - this approach is slow. Using cvxpy with another interior point solver like MOSEK offers only 2x speed-up (it is snails compared to lasso_path).

Separately, if the problem is appropriately regularized to avoid over-fitting, some work has shown that it's possible to use accelerated proximal gradient descent to achieve convergence. However, the paper offers an algorithm with several critical implementation details unprovided (referring to [1] above).

Additional context

Note that the characterization of srLASSO as a second-order conic programming (SOCP) problem is given in [3]

[3] Belloni, A., Chernozhukov, V. and Wang, L. Square-Root LASSO: Pivotal recovery of sparse signals via conic programming https://arxiv.org/pdf/1009.5689.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.