GithubHelp home page GithubHelp logo

slopecd's Introduction

Coordinate Descent for SLOPE

This repository provides code to reproduce the experiments for the academic research paper Coordinate Descent for SLOPE.

This repository contains the following items:

  • The code folder contains the code for the solvers, the results produces from our experiments, a few (smaller) experiments, as well as scripts to generate the figures in the paper experiments
  • The tex folder contains the source code for the paper.
  • The benchmark_slope folder is a submodule of the benchopt benchmark located at https://github.com/klopfe/benchmark_slope, which was used to run the main benchmarks in the paper.

Installation

First make sure that you have conda available on your computer. Installation instructions are available here.

Then, start by creating a conda environment within which the benchmarks should be run. Here, we also install two R packages that were used in one of the experiments.

conda create -n slopecd -c conda-forge -y \
  python=3.9 r=4.2 r-slope=0.4 r-glmnet=4.1
conda activate slopecd
pip install benchopt

After this, make sure that you have navigated to the root folder of the extracted archive. Then run

pip install code/

to install the python module slope.

Finally, to install the benchopt benchmark, run

benchopt install -y benchmark_slope/

Running the Experiments

Some experiments are available in code/expes and can be run simply by calling python expes/<experiment>, or Rscript expes/<experiment> where <experiment> is the name of one of the python or R files in the folder.

To re-run the main benchmarks from the paper, modify benchmark_slope/config.yml to include or exclude objectives, solvers, and datasets by commenting or uncommenting them. Then call

benchopt run benchmark_slope/ --config benchmark_slope/config.yml

to run the benchmark.

Results

The results used in the paper are stored in the code/results folder.

Figures

The figures can be re-created by calling the python scripts in code/scripts/figures.

slopecd's People

Contributors

jolars avatar klopfe avatar mathurinm avatar qb3 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

antoinesimoes

slopecd's Issues

Add tests for solvers

Some basic tests such as checking that solvers converge for a few simulated problems is probably enough.

Add screening rule to solvers

We should eventually incorporate screening rules in the solvers for more realistic benchmarking. Probably it will be sufficient to just use the strong screening rule.

Convergence issue for Rhee2006 for the hybrid method

The hybrid method fails to converge for this example:

import matplotlib.pyplot as plt
import numpy as np

from slope.solvers import hybrid_cd
from slope.utils import preprocess, lambda_sequence
from slope.data import get_data

X, y = get_data("Rhee2006")

fit_intercept = True
reg = 0.1
q = 0.2

# removes one column with zero variance
X = preprocess(X)

lambdas = lambda_sequence(X, y, fit_intercept, reg, q)

w, intercept, primals, gaps, times, _ = hybrid_cd(
    X, y, lambdas, max_epochs=100, fit_intercept=fit_intercept, verbose=False,
    use_reduced_X=False
)

plt.close("all")
plt.semilogy(primals)
plt.show(block=False)

Issue when using hybrid solver on sparse matrix

the code to replicate the error

import matplotlib.pyplot as plt
import numpy as np
from benchopt.datasets import make_correlated_data
from scipy import stats

from slope.data import get_data
from slope.solvers import hybrid_cd, oracle_cd, prox_grad, admm, newt_alm
from slope.utils import dual_norm_slope

dataset = "Rhee2006"
if dataset == "simulated":
    X, y, _ = make_correlated_data(n_samples=10, n_features=20, random_state=0)
    # X = csc_matrix(X)
else:
    X, y = get_data(dataset)

fit_intercept = True

randnorm = stats.norm(loc=0, scale=1)
q = 0.1
reg = 0.01

alphas_seq = randnorm.ppf(1 - np.arange(1, X.shape[1] + 1) * q / (2 * X.shape[1]))

alpha_max = dual_norm_slope(X, (y - fit_intercept * np.mean(y)) / len(y), alphas_seq)

alphas = alpha_max * alphas_seq * reg
plt.close("all")

max_epochs = 10000
max_time = 100
verbose = True
fit_interecpt = True

tol = 1e-4


beta_cd, intercept_cd, primals_cd, gaps_cd, time_cd = hybrid_cd(
    X,
    y,
    alphas,
    fit_intercept=fit_intercept,
    max_epochs=max_epochs,
    verbose=verbose,
    tol=tol,
    max_time=max_time,
    use_reduced_X=False
)

Consider adding covariance updates

We may want to try to implement "covariance updates" for our solver, in precisely the same fashion as in the Friedman paper on the elastic net. So we incrementally build the Gram matrix as predictors become nonzero and use that in our updates. It should help a lot when n > p I think (and possibly even in more instances since we can cache more stuff efficiently). I wrote a little bit about it in section 2.1.3. As this only really works for least squares SLOPE I think it may be of middling interest however. What do you think?

Oracle solver not converging on Scheetz2006

Now I am seeing weird issues with Scheetz2006 and the oracle solver.

Check out the following example.

import matplotlib.pyplot as plt
import numpy as np
from benchopt.datasets import make_correlated_data
from scipy import stats

from slope.data import get_data
from slope.solvers import hybrid_cd, oracle_cd
from slope.utils import dual_norm_slope

dataset = "Scheetz2006"
if dataset == "simulated":
    X, y, _ = make_correlated_data(n_samples=10, n_features=10, random_state=0)
    # X = csc_matrix(X)
else:
    X, y = get_data(dataset)

fit_intercept = False

randnorm = stats.norm(loc=0, scale=1)
q = 0.1
reg = 0.01

alphas_seq = randnorm.ppf(1 - np.arange(1, X.shape[1] + 1) * q / (2 * X.shape[1]))

alpha_max = dual_norm_slope(X, (y - fit_intercept * np.mean(y)) / len(y), alphas_seq)

alphas = alpha_max * alphas_seq * reg

max_epochs = 10000
max_time = 60
tol = 1e-4

beta_cd, intercept_cd, primals_cd, gaps_cd, time_cd = hybrid_cd(
    X,
    y,
    alphas,
    fit_intercept=fit_intercept,
    max_epochs=max_epochs,
    verbose=True,
    tol=tol,
    max_time=max_time,
    cluster_updates=True,
)

beta_oracle, intercept_oracle, primals_oracle, gaps_oracle, time_oracle = oracle_cd(
    X,
    y,
    alphas,
    fit_intercept=fit_intercept,
    max_epochs=max_epochs,
    verbose=True,
    tol=tol,
    max_time=max_time,
    w_star=beta_cd
)

primals_star = np.min(np.hstack((np.array(primals_cd), np.array(primals_oracle))))

plt.clf()

plt.semilogy(time_cd, primals_cd - primals_star, label="cd")
plt.semilogy(time_oracle, primals_oracle - primals_star, label="cd_oracle")

plt.xlabel("Time (s)")

plt.ylabel("suboptimality")
plt.legend()
plt.title(dataset)
plt.show(block=False)

image

Improve performance of `slope_threshold()` function

I believe that the slope_threshold() function is likely the current bottleneck for the hybrid solver (and any CD solver).

https://github.com/QB3/slopecd/blob/27060e603843568fb88c580bf4a92449e0330767/code/slope/utils.py#L61

Here is a line profile of the CD part, which is where 75% of the time is spent for this simulation, but note that this is without numba since profiling doesn't work with numba.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def block_cd_epoch(w, X, R, alphas, cluster_indices, cluster_ptr, c):
    11       686       2321.0      3.4      0.0      n_samples = X.shape[0]
    12     70792      95061.0      1.3      1.7      for j in range(len(c)):
    13     70106     149992.0      2.1      2.6          if c[j] == 0:
    14      4289       4590.0      1.1      0.1              continue
    15     65817     177943.0      2.7      3.1          cluster = cluster_indices[cluster_ptr[j]:cluster_ptr[j+1]]
    16     65817     199659.0      3.0      3.5          sign_w = np.sign(w[cluster])
    17     65817     681207.0     10.4     11.9          sum_X = X[:, cluster] @ sign_w
    18     65817     314579.0      4.8      5.5          L_j = sum_X.T @ sum_X / n_samples
    19     65817      97729.0      1.5      1.7          c_old = c[j]
    20     65817     310375.0      4.7      5.4          x = c_old + (sum_X.T @ R) / (L_j * n_samples)
    21    131634    2043248.0     15.5     35.7          beta_tilde = slope_threshold(
    22     65817     490107.0      7.4      8.6              x, alphas/L_j, cluster_indices, cluster_ptr, c, j)
    23     65817     249870.0      3.8      4.4          c[j] = np.abs(beta_tilde)
    24     65817     310975.0      4.7      5.4          w[cluster] = beta_tilde * sign_w
    25     65817     103474.0      1.6      1.8          if c_old != beta_tilde:
    26     64833     494068.0      7.6      8.6              R += (c_old - beta_tilde) * sum_X

Currently slope_threshold() searches from the top down, but this is of course ineffective. My guess is that the easiest solution is simply to start from the current cluster, check if the coefficient either will decrease or increase and go in that direction.

feat: fit intercept

I think we should fit the intercept for all models since it very rarely makes sense not to do so.

Add reduction caching for sparse X

We could try to figure out some way to cache reductions for the sparse X case, like we now optionally do with the dense case. The performance benefit is only large for some cases when there is a lot of clustering, however, so again I'm not sure it's worth all the work.

feat: experiment setup

This is a meta-issue to discuss and list the simulated and real data setups for the experiments. These just some ideas off the top of my head. Let's discuss it!

Real Data

Gaussian

I suppose we display time to optimality curves for something like $\text{reg} \in \lambda_{\text{max}} \times \{0.1, 0.01, 0.001\}$, a little bit depending on $n$ and $p$ relationship.

@mathurinm, do you want to patch libsvm to support these breheny data sets?

Logistic Regression (if we do it!)

all the usual suspects (rcv1, covtype.binary, news20.binary, gisette etc etc)

Simulated data

  • High-dimensional setup: 200 x 20 000, 20 signals, some type of correlation structure (latent or AR process type)
  • Low-dimensional setup: 20 000 x 200, 40 signals, vary over some type of correlation structure (latent or AR process type)
  • High-dimensional sparse setup: 200 x 2 000 000, 20 signals, binary X, sparsity 0.001, some type of correlation structure (AR and/or block)

Lambda sequence settings

I don't think we need to meddle much with the lambda sequence setup other than to vary the $q$ parameter, maybe something like $q \in \{0.05, 0.1, 0.2\}$.

OSCAR

In principle I don't see any reason to do OSCAR, but maybe it's clever to do so anyway since it might help draw attention to people who are interested in OSCAR. What do you think?

Competitors

  • Proximal gradient descent
  • Anderson acceleration
  • Fista acceleration
  • ADMM
  • Oracle
  • Hybrid solver (ours)

Fix NumbaPerformanceWarnings

We're currently producing all these warnings that would be nice to get rid off. I am not sure if there are any performance implications actually. Maybe you who've worked more with numba know if that's the case (@mathurinm, @Klopfe)? (In which case we really need to do something about it.)

/home/gerd-jln/research/slopecd/code/slope/solvers/hybrid.py:71: NumbaPerformanceWarning: '@' is fa
ster on contiguous arrays, called on (array(float64, 1d, A), array(float64, 1d, A))
  L_archive[k] = (X_reduced[:, k].T @ X_reduced[:, k]) / n_samples
/home/gerd-jln/.pyenv/versions/3.10.2/lib/python3.10/site-packages/numba/core/typing/npydecl.py:913
: NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float64, 1d, A), ar
ray(float64, 1d, A))
  warnings.warn(NumbaPerformanceWarning(msg))
/home/gerd-jln/research/slopecd/code/slope/solvers/hybrid.py:125: NumbaPerformanceWarning: '@' is f
aster on contiguous arrays, called on (array(float64, 1d, A), array(float64, 1d, C))
  x = c_old + sum_X @ R / (L_j * n_samples)
/home/gerd-jln/research/slopecd/code/slope/solvers/hybrid.py:142: NumbaPerformanceWarning: '@' is f
aster on contiguous arrays, called on (array(float64, 1d, A), array(float64, 1d, A))
  n_c = update_cluster(
/home/gerd-jln/.pyenv/versions/3.10.2/lib/python3.10/site-packages/numba/core/typing/npydecl.py:913
: NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float64, 1d, A), ar
ray(float64, 1d, C))
  warnings.warn(NumbaPerformanceWarning(msg))

One solution is of course to write our own dot-product for loop, but that seems somewhat crude (and maybe we shoot ourselves in the foot if there are additional optimization tricks for dot products that we fail to replicate.

Oracle not convering on rcv1

The oracle method is not converging on rcv1 data set for some reason, at least
not with reg = 0.02. Try this example:

import matplotlib.pyplot as plt
import numpy as np
from benchopt.datasets import make_correlated_data
from scipy import stats

from slope.data import get_data
from slope.solvers import hybrid_cd, oracle_cd
from slope.utils import dual_norm_slope

dataset = "rcv1.binary"
if dataset == "simulated":
   X, y, _ = make_correlated_data(n_samples=10, n_features=10, random_state=0)
else:
   X, y = get_data(dataset)

fit_intercept = False

randnorm = stats.norm(loc=0, scale=1)
q = 0.1
reg = 0.02

alphas_seq = randnorm.ppf(1 - np.arange(1, X.shape[1] + 1) * q / (2 * X.shape[1]))

alpha_max = dual_norm_slope(X, (y - fit_intercept * np.mean(y)) / len(y), alphas_seq)

alphas = alpha_max * alphas_seq * reg

max_epochs = 10000
max_time = np.inf
tol = 1e-4

beta_cd, intercept_cd, primals_cd, gaps_cd, time_cd = hybrid_cd(
   X,
   y,
   alphas,
   fit_intercept=fit_intercept,
   max_epochs=max_epochs,
   verbose=True,
   tol=tol,
   max_time=max_time,
   cluster_updates=True,
)

beta_oracle, intercept_oracle, primals_oracle, gaps_oracle, time_oracle = oracle_cd(
   X,
   y,
   alphas,
   fit_intercept=fit_intercept,
   max_epochs=max_epochs,
   verbose=True,
   tol=tol,
   max_time=max_time,
)

primals_star = np.min(np.hstack((np.array(primals_cd), np.array(primals_oracle))))

plt.clf()

plt.semilogy(time_cd, primals_cd - primals_star, label="cd")
plt.semilogy(time_oracle, primals_oracle - primals_star, label="cd_oracle")

plt.xlabel("Time (s)")

# plt.semilogy(np.arange(len(gaps_cd))*10, gaps_cd, label="cd")
# plt.xlabel("Epoch")

plt.ylabel("suboptimality")
plt.legend()
plt.title(dataset)
plt.show(block=False)

Gets me:

Epoch: 9771, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9781, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9791, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9801, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9811, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9821, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9831, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9841, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9851, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9861, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9871, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9881, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9891, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9901, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9911, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9921, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9931, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9941, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9951, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9961, loss: 0.18407451740050443, gap: 1.06e-01
Epoch: 9971, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9981, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9991, loss: 0.18407451740050446, gap: 1.06e-01

etc for the oracle method.

Consider merging clusters in hybrid solver

Right now, in the hybrid solver, we are not actually merging clusters in the implementation. We just set them to exactly the same value. I'm not sure if this is a real problem, but what can happen is of course that we update c[j] to 0.2, which is the same value as c[k], k > j so that now c[j] and c[k] are one cluster. But then when we arrive at the update for c[k], the indices for c[j] haven't actually been included in c[k], so at that point we only update the indices that were in c[k] at the start, which could break the cluster apart. I hope that made sense. But like I said, I'm not sure it's a problem in practice.

https://github.com/QB3/slopecd/blob/27060e603843568fb88c580bf4a92449e0330767/code/slope/utils.py#L61

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.