scikit-learn-contrib / skglm Goto Github PK

Fast and modular sklearn replacement for generalized linear models

Home Page: http://contrib.scikit-learn.org/skglm

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

skglm's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

DenMune: Density-peak clustering using mutual nearest neighbors

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

lightning

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

skglm

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

imbalanced-learn

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

polylearn

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

hdbscan

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

categorical-encoding

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

boruta_py

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

sklearn-pandas

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

skope-rules

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

stability-selection

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

metric-learn

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

skglm's People

Contributors

Stargazers

Watchers

skglm's Issues

How to control number of cores?

I apologize for opening an issue that isn't a real bug.

I need to control the number of cores used when I fit a model.
I've tried the following things:
(1)

from numba import set_num_threads
set_num_threads(1)

(2)

from numba import config
config.NUMBA_NUM_THREADS = 1

(3) Going through the source code and replacing every instance of @njit with @njit(parallel=False).

No matter what I try, I see my core usage shoot up to 128.
I'm on a cluster of machines connected by slow interconnect, so this core usage actually makes the model run very slow.
Does anybody understand how to control this?

ENH - slow ``gram_cd_solver`` when fitted on sparse dataset

As investigated in #59 through timing and benchmarks, fitting a gram_cd_solver on a sparse dataset comes with a big overhead of computing the gram matrix, as opposed to the dense case.

Sparse case

Dense case

It seems that it has to do more with scipy sparse matrix multiplication. More details about that (explanation, code snippets).

It would be beneficial to have a more efficient way to compute the gram matrix in the sparse case as it would speed up drastically the solver.

BUG - Ill-conditioned residual matrix in AA can lead to wasted computations

As AA has been refactored in a class, we should make sure to catch a sneaky bug that can make AA fail silently.
I've already stumbled upon cases where the residual matrix is ill-conditioned, which makes the solution of the quadratic problem associated to AA take very large positive and very small negative values which cancel out.
This bug is not caught by the try - except block paying attention to LinAlgError, and computational time is wasted.

We should make sure to catch this bug in the AA class.

See https://github.com/scikit-learn-contrib/skglm/blob/main/skglm/solvers/cd_solver.py#L295 for more detailed explanation.

ENH Inform user when `ws_strategy = subdiff` is not available

It is not obvious that ws_strategy=subdiff won't work for some penalties (e.g. L0.5, L2/3, ...). Throwing an error and issuing a warning in this situation would prevent users from spending time figuring out the issue.

ENH - Add support for intercept in ``SqrtLasso``

Doable after merging #68

Expose a fit_intercept argument in SqrtLasso
Handle fit_intercept in the path method
Set intercept_ in the fit method

Considerable overhead when fitting with ``Lasso``

Description

We have a considerable overhead when fitting Lasso Estimator as shown in the screenshot below

To reproduce go to benchopt Lasso benchmark repo

Investigation

After investigating, this overhead is because of the computation of the global_lipschitz

skglm/skglm/datafits/single_task.py

Lines 52 to 54 in cca6d48

 n_features = X.shape[1] 

 self.global_lipschitz = norm(X, ord=2) ** 2 / len(y) 

 self.lipschitz = np.zeros(n_features, dtype=X.dtype)

that we introduced after adding the FISTA solver #91.

Potential fix

The global_lipschitz is only relevant for the FISTA solver. Hence, it should be computed only in this case.

MAINT - move ``github_link.py`` to ``sphinxext`` folder

Following our discussion in #75 (comment)

Refactoring of the utils.py file

The utils.py cureently contains maths functions, as well as jit / numba specific functions.
IMO this file should be splited.

FEAT add FISTA solver

could be useful for the SLOPE penalty

ENH fit intercept in prox newton solver

ENH write documentation for intercept fitting and clarify comments

in particular with respect to computation in optimality conditions

ENH Create a solver dispatcher

Currently, users are not explicitly warned when an unsupported solver is used with some penalties or datafits. We should write some form of mechanism (a dispatcher?) which raises an exception when the coupling (datafit, penalty, solver) is unsupported.

E.g.: for #100 currently we should raise an exception, that is explicit for the user

Release skglm 0.2

We have made some major changes in terms of API and added some cool solvers, it deserves a release

We need to close first:

and maybe #53

DOC add Whatsnew.rst

ENH add support for GammaRegressor

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.GammaRegressor.html#sklearn.linear_model.GammaRegressor

given how easy it was to add Poisson datafit in #78 this may be a quick win

FEAT - enable fitting ``Lasso``, ``WeightedLasso``, and ``ElasticNet`` with positivity constraint

Doable thanks to #110

To do so

Add a positive argument in the estimators
pass in the positive argument to the penalties

MNT remove MultitaskLasso inheritance from scikit-learn

Now that the handle the intercept differently, we no longer need this

ENH - Add fixpoint strategy for Prox-Newton solver

Following https://github.com/a-rahimi/sparse-regression/blob/main/sparse%20regression.ipynb, the current prox-Newton solver does not handle fixpoint strategy, hence we cannot use this solver to solve l05 or l23 penalized problems

FEAT Add line search for FISTA solver

Currently FISTA relies on "global" Lipschitz constant. For non-Lipschitz datafits (eg: Poisson), this is a problem, hence the support of line search.

MNT sklearn 1.2dev breaks estimator fitting

Validation of parameters at fit time seems to be an issue, parameters which should be set by the parent class of, e.g., MCPRegressor, are not returned by get_params() and so clf._validate_params() fails:

FAILED skglm/tests/test_estimators.py::test_check_estimator[Lasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'war...
FAILED skglm/tests/test_estimators.py::test_check_estimator[wLasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'wa...
FAILED skglm/tests/test_estimators.py::test_check_estimator[ElasticNet] - ValueError: The parameter constraints ['alpha', 'l1_ratio', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'cop...
FAILED skglm/tests/test_estimators.py::test_check_estimator[MCP] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_...
FAILED skglm/tests/test_estimators.py::test_estimator[X0-Lasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_s...
FAILED skglm/tests/test_estimators.py::test_estimator[X0-wLasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_...
FAILED skglm/tests/test_estimators.py::test_estimator[X0-ElasticNet] - ValueError: The parameter constraints ['alpha', 'l1_ratio', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X...
FAILED skglm/tests/test_estimators.py::test_estimator[X0-MCP] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_sta...
FAILED skglm/tests/test_estimators.py::test_estimator[X1-Lasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_s...
FAILED skglm/tests/test_estimators.py::test_estimator[X1-wLasso] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_...
FAILED skglm/tests/test_estimators.py::test_estimator[X1-ElasticNet] - ValueError: The parameter constraints ['alpha', 'l1_ratio', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X...
FAILED skglm/tests/test_estimators.py::test_estimator[X1-MCP] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_sta...
FAILED skglm/tests/test_estimators.py::test_generic_estimator[Quadratic-L1-False-Lasso-pen_args0] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'm...
FAILED skglm/tests/test_estimators.py::test_generic_estimator[Quadratic-WeightedL1-False-WeightedLasso-pen_args1] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', ...
FAILED skglm/tests/test_estimators.py::test_generic_estimator[Quadratic-L1_plus_L2-False-ElasticNet-pen_args2] - ValueError: The parameter constraints ['alpha', 'l1_ratio', 'fit_intercept', 'nor...
FAILED skglm/tests/test_estimators.py::test_generic_estimator[Quadratic-MCPenalty-False-MCPRegression-pen_args3] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', '...
FAILED skglm/tests/test_estimators.py::test_grid_search[Lasso] - ValueError: 
FAILED skglm/tests/test_estimators.py::test_grid_search[wLasso] - ValueError: 
FAILED skglm/tests/test_estimators.py::test_grid_search[ElasticNet] - ValueError: 
FAILED skglm/tests/test_estimators.py::test_grid_search[MCP] - ValueError: 
FAILED skglm/tests/test_group.py::test_equivalence_lasso - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', 'warm_start', ...
FAILED skglm/tests/test_group.py::test_vs_celer_grouplasso[15-50-True] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', ...
FAILED skglm/tests/test_group.py::test_vs_celer_grouplasso[5-50-False] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol', ...
FAILED skglm/tests/test_group.py::test_vs_celer_grouplasso[19-59-False] - ValueError: The parameter constraints ['alpha', 'fit_intercept', 'normalize', 'precompute', 'max_iter', 'copy_X', 'tol',...
================================================================ 24 failed, 112 passed, 1 xfailed, 16 warnings in 114.40s (0:01:54) ================================================

Reproduce by

    clf = MCPRegression(
        alpha=alpha, gamma=np.inf, fit_intercept=False, tol=tol)
    print(clf.get_params())
    clf.fit(X, y)

in the setup of test_estimators.py

FEAT Add positivity constraint for `L1` and `L21` penalties

ENH Cache Numba compilation for the user

import time
import numpy as np
from numpy.linalg import norm
from sklearn.linear_model import Lasso as Lasso_sk
from skglm.estimators import Lasso

n_samples = 100
n_features = 10_000

X = np.random.normal(0, 1, (n_samples, n_features))
y = np.random.normal(0, 1, (n_samples,))

alpha_max = norm(X.T @ y, ord=np.inf) / n_samples
alpha = alpha_max * 0.1

start = time.time()
clf = Lasso(alpha).fit(X, y)
print("skglm:", time.time() - start)

start = time.time()
clf = Lasso_sk(alpha).fit(X, y)
print("sklearn:", time.time() - start)

This script gives:

skglm: 4.0232319831848145
sklearn: 0.2305459976196289

This is due to the compilation cost. We should cache this compilation once and for all, ideally during install (by pre-building/pre-compiling the IR generated by Numba) or when the user first runs a script, using njit(cache=True).

ENH support fit_intercept

Fititng an intercept can drastically change the size of the supports and the time it takes to fit:

import numpy as np
import time
from libsvmdata import fetch_libsvm
X, y = fetch_libsvm("finance")

from celer import Lasso

for fit_intercept in [False, True]:
    alpha_max = np.max(np.abs(X.T @ (y - fit_intercept*np.mean(y)))) / len(y)
    clf = Lasso(alpha_max / 200, fit_intercept=fit_intercept, verbose=1)

    t0 = time.time()
    clf.fit(X, y)
    t1 = time.time()
    print(f"### fit_intercept={fit_intercept}, time to fit {t1-t0:.2f} s, supp size: {(clf.coef_ != 0).sum()}")

To do it without pain my idea is to add a gradient step with np.ones(features) in the solver at the end of each CD epoch. That is like adding a column full of ones to X, but without copying it.

One needs to add a way to compute the lipschitz constant of this; I do not see another solution than to make it an attribute of datafit (eg 1 for Quadratic, 1/4 for Logistic)

doc building fails because of scikit-learn update

max_iter = 0 is no longer supported in sklearn 1.1, it raises:

ValueError: max_iter == 0, must be >= 1.

we can use max_iter=1 it's not important

ENH clarify the design of `subdiff_distance`

The arguments grad and w mismatch: grad is restricted to the working set while w is not. renaming to grad_ws seems clearer

ENH add weighted ridge CD/FISTA solver

Probably calls for a non working set solver, or we can work something out with is_penalized (that could be called yields_sparse_coefs or something, ie if this is not true, we should always include the feature.

MNT - Merge ``Lasso`` and ``WeightedLasso`` estimators

Following #110 (comment),

Lasso and WeightedLasso estimators share the same implementation logic.
How weights=None is handled in the WeightedLasso estimator hints at merging the two estimators by adding a weights argument to Lasso.

To achieve that

remove WeightedLasso
add a weights argument in Lasso estimator
handle weighs argument in path and fit methods

ENH skglm import time is high

In [1]: import time

In [2]: t0 = time.time()
   ...: import skglm
   ...: 
   ...: print(time.time() - t0)
0.7723352909088135

I thought this was due to numba compilation, but it's not fixed on #44. importing Lasso from sklearn.linear_model takes 1e-5 s, that is weird

DOC Write tutorial "How to add a datafit? How to add a penalty? in skglm"

I think that'd be a valuable addition for any newcomers wanting to add a custom datafit or penalty. Plus it's a great piece of doc to understand how skglm works behind the scenes.

FEAT Add Log-sum penalty

coef_ and intercept_ are deprecated for OneVsRestClassifier

The following assignment will fail in scikit-learn 1.1:
https://github.com/scikit-learn-contrib/skglm/blob/main/skglm/estimators.py#L917

coef_ and intercept_ have been deprecated and are removed in scikit-learn 1.1.
Testing the RC on the code, break one of the tests.

ENH add BaseSolver class, transform cd_solver and pn_solver into classes

it will be easier to pass instances of such classes to GeneralizedLinearEstimator (currently the solvers are hardcoded), and it will simplify the API.

ENH warn/raiseError in WS solvers if penalty is L05 or L23 for ws_strategy='subdiff'

This will cause the solver not to move since 0 is always a critical point

DOC Write documentation for intercept fitting

The intercept fitting logic in skglm is non trivial for non experts. Adding a few lines of docs would be a valuable addition for future maintainers and users.

DOC update citation in README and doc

we are no longer a @online but a @inproceedings :)

MNT remove is_classif parameter

It is now inferred based on the datafit in glm_fit

This should make tests and code a bit simpler

ENH - Add `fit_intercept` for FISTA solver

CLN remove seaborn dependency

We use it only to define C_LIST in utils.py, it's not worth having a dependency.

For the doc, it can be installed directly on Circle CI together with numpydoc and other doc-only packages.

ENH add Newton LSMR for smooth problems

It seems to work very well: scikit-learn/scikit-learn#23507

FEAT Add Cox loss

For survival analysis, Cox loss is the go-to model (see http://www.sthda.com/english/wiki/cox-proportional-hazards-model) . @Klopfe thinks he might have a very interesting use case for skglm using Cox loss and non-convex penalties for bulk RNAseq data.

Currently, it is implemented in glmnet which is slow. Might be a valuable addition.

BUG - code breaks down whithout throwing any error when calling ``cd_solver``

Script to reproduce

import numpy as np
from skglm.datafits import Quadratic
from skglm.penalties import L1
from skglm.solvers.cd_solver import cd_solver
from skglm.utils import make_correlated_data

n_samples, n_features = 10, 50
X, y, _ = make_correlated_data(n_samples, n_features, random_state=0)
alpha_max = np.linalg.norm(X.T @ y, ord=np.inf) / n_samples

datafit = Quadratic()
penalty = L1(alpha=alpha_max / 10.)

w = np.zeros(n_features)
Xw = np.zeros(n_samples)
cd_solver(X, y, datafit, penalty, w, Xw)

print("This line will never be printed")
print("This one also will never be printed")
print("...")

This is a dangerous behavior as the code breaks without giving any hint on what's wrong (even with a debugger).

source of the bug and a potential fix

After debugging, I find out that the problem emanates from passing in a non-initialized datafit to cd_solver. In fact, the solver attempts to access the Lipschitz constants of datafit to compute grads, which doesn't exist.

A natural way to fix this bug is to make cd_solver a private member to prevent calling it directly from the outside.

Reflections

Not having the Lipschitz constant doesn't explain why the code breaks without throwing any error. I suspect a bug in Numba compilation and here is a small script to reproduce the bug

import numpy as np
from numba import float64
from numba.experimental import jitclass

@jitclass([('Xty', float64[:])])
class Quadratic:
    def __init__(self):
        pass

    def gradient_scalar(self, X, Xw, j):
        return (X[:, j] @ Xw - self.Xty[j]) / len(Xw)

X = np.ones((2, 5))
Xw = np.zeros(2)

Quadratic().gradient_scalar(X, Xw, 0)

print("This line will never be printed")

BUG Raise ValueError when `y < 0` for Poisson datafit

Currently, there is no warning and the solver diverges. We should raise a value error at fit time.

ENH add Poisson datafit

ENH AndersonAccelerator.accelerate() does not create new points most of the time

This means computation of p_obj_acc is not needed. It's a big bottleneck in datasets such as Finance where the subproblems are very small compared to n_features.

Returning an additional boolean variable as the output of accelerate would allow us to skip the p_obj_acc computation in cases where it's not needed

DOC update doc style to a more modern theme

@PABannier said the doc looks old and @Badr-MOUFAD did a great job at modernizing celer's doc at https://mathurinm.github.io/celer

DOC add Poisson example

ENH Issue a warning when solution found is null

Most non-expert users might not be familiar with the rescaling by n_samples for alpha. Issuing a warning when alpha > alpha_max and giving hints to the user (for instance by printing alpha_max) and informing the user of the scaling by n_samples would be a valuable addition.

ENH/BUG? check_estimator has some very adversarial cases for SVC

To reproduce, run this in test_estimators.py:

    estimator_name = "SVC"

    clf = clone(dict_estimators_ours[estimator_name])
    clf.verbose = 2
    clf.tol = 1e-6  # failure in float32 computation otherwise
    # if isinstance(clf, WeightedLasso):
    #     clf.weights = None
    check_estimator(clf)

Faster computation of penalty.value

Currently penalty.value on takes w in argument

skglm/skglm/penalties/separable.py

Line 77 in 99d0a06

def value(self, w):

Computation could be made faster by using penalty.value(w, ws) (one cannot simply call penalty.value(w[ws]) because of the weighted l1 norm

square-root LASSO implementation

As suggested by @agramfort , I am opening an issue here (original) to make a feature request for the square-root LASSO.

Describe the workflow you want to enable

The unconstrained LASSO is currently implemented in lasso_path. However, the parameter for unconstrained LASSO is sensitive to the measurement noise. Instead, the square-root LASSO has objective norm(Ax-b) + alpha * norm(x, 1) and its parameter, alpha, is insensitive to the noise level. I would like an sklearn interface (with fit, predict, etc.) coupled with a lasso_path-like function, srlasso_path, which solves the square-root LASSO optimization program.

Describe your proposed solution

I propose an sklearn-friendly wrapper around the cvxpy or cvxopt interface. More ambitiously, I propose an implementation based on [1] or [2] that implements a fast method for solving square-root LASSO (either via (accelerated) PGD or ADMM).

[1] "On fast convergence of proximal algorithms for SQRT-LASSO optimization: Don’t worry about its nonsmooth loss function" by Li, Xinguo, Jiang, Haoming, et al. (2020) PMLR.
[2] "The flare package for high dimensional linear regression and precision matrix estimation in R" by Li, Xingguo, Zhao, Tuo, Yuan, Xiaoming and Liu, Han. (2015) JMLR.

Describe alternatives you've considered, if relevant

Currently, statsmodels offers a square-root LASSO solver implemented via the Python package cvxopt - this approach is slow. Using cvxpy with another interior point solver like MOSEK offers only 2x speed-up (it is snails compared to lasso_path).

Separately, if the problem is appropriately regularized to avoid over-fitting, some work has shown that it's possible to use accelerated proximal gradient descent to achieve convergence. However, the paper offers an algorithm with several critical implementation details unprovided (referring to [1] above).

Additional context

Note that the characterization of srLASSO as a second-order conic programming (SOCP) problem is given in [3]

[3] Belloni, A., Chernozhukov, V. and Wang, L. Square-Root LASSO: Pivotal recovery of sparse signals via conic programming https://arxiv.org/pdf/1009.5689.pdf

	n_features = X.shape[1]
	self.global_lipschitz = norm(X, ord=2) ** 2 / len(y)
	self.lipschitz = np.zeros(n_features, dtype=X.dtype)