jolars / slopecd Goto Github PK

Python 48.01% Makefile 0.49% R 1.03% TeX 50.47%

slopecd's Introduction

Coordinate Descent for SLOPE

This repository provides code to reproduce the experiments for the academic research paper Coordinate Descent for SLOPE.

This repository contains the following items:

The code folder contains the code for the solvers, the results produces from our experiments, a few (smaller) experiments, as well as scripts to generate the figures in the paper experiments
The tex folder contains the source code for the paper.
The benchmark_slope folder is a submodule of the benchopt benchmark located at https://github.com/klopfe/benchmark_slope, which was used to run the main benchmarks in the paper.

Installation

First make sure that you have conda available on your computer. Installation instructions are available here.

Then, start by creating a conda environment within which the benchmarks should be run. Here, we also install two R packages that were used in one of the experiments.

conda create -n slopecd -c conda-forge -y \
  python=3.9 r=4.2 r-slope=0.4 r-glmnet=4.1
conda activate slopecd
pip install benchopt

After this, make sure that you have navigated to the root folder of the extracted archive. Then run

pip install code/

to install the python module slope.

Finally, to install the benchopt benchmark, run

benchopt install -y benchmark_slope/

Running the Experiments

Some experiments are available in code/expes and can be run simply by calling python expes/<experiment>, or Rscript expes/<experiment> where <experiment> is the name of one of the python or R files in the folder.

To re-run the main benchmarks from the paper, modify benchmark_slope/config.yml to include or exclude objectives, solvers, and datasets by commenting or uncommenting them. Then call

benchopt run benchmark_slope/ --config benchmark_slope/config.yml

to run the benchmark.

Results

The results used in the paper are stored in the code/results folder.

Figures

The figures can be re-created by calling the python scripts in code/scripts/figures.

slopecd's People

Contributors

Stargazers

Watchers

Forkers

antoinesimoes

slopecd's Issues

Add tests for solvers

Some basic tests such as checking that solvers converge for a few simulated problems is probably enough.

Add screening rule to solvers

We should eventually incorporate screening rules in the solvers for more realistic benchmarking. Probably it will be sufficient to just use the strong screening rule.

Consider benchmarking the full SLOPE path

We should consider adding benchmarking for the full SLOPE path. We should also use the sequential strong screening rule strategy for SLOPE when doing so.

Convergence issue for Rhee2006 for the hybrid method

The hybrid method fails to converge for this example:

import matplotlib.pyplot as plt
import numpy as np

from slope.solvers import hybrid_cd
from slope.utils import preprocess, lambda_sequence
from slope.data import get_data

X, y = get_data("Rhee2006")

fit_intercept = True
reg = 0.1
q = 0.2

# removes one column with zero variance
X = preprocess(X)

lambdas = lambda_sequence(X, y, fit_intercept, reg, q)

w, intercept, primals, gaps, times, _ = hybrid_cd(
    X, y, lambdas, max_epochs=100, fit_intercept=fit_intercept, verbose=False,
    use_reduced_X=False
)

plt.close("all")
plt.semilogy(primals)
plt.show(block=False)

Issue when using hybrid solver on sparse matrix

the code to replicate the error

import matplotlib.pyplot as plt
import numpy as np
from benchopt.datasets import make_correlated_data
from scipy import stats

from slope.data import get_data
from slope.solvers import hybrid_cd, oracle_cd, prox_grad, admm, newt_alm
from slope.utils import dual_norm_slope

dataset = "Rhee2006"
if dataset == "simulated":
    X, y, _ = make_correlated_data(n_samples=10, n_features=20, random_state=0)
    # X = csc_matrix(X)
else:
    X, y = get_data(dataset)

fit_intercept = True

randnorm = stats.norm(loc=0, scale=1)
q = 0.1
reg = 0.01

alphas_seq = randnorm.ppf(1 - np.arange(1, X.shape[1] + 1) * q / (2 * X.shape[1]))

alpha_max = dual_norm_slope(X, (y - fit_intercept * np.mean(y)) / len(y), alphas_seq)

alphas = alpha_max * alphas_seq * reg
plt.close("all")

max_epochs = 10000
max_time = 100
verbose = True
fit_interecpt = True

tol = 1e-4


beta_cd, intercept_cd, primals_cd, gaps_cd, time_cd = hybrid_cd(
    X,
    y,
    alphas,
    fit_intercept=fit_intercept,
    max_epochs=max_epochs,
    verbose=verbose,
    tol=tol,
    max_time=max_time,
    use_reduced_X=False
)

Collect timings inside solvers

Return wall clock time at each epoch for each solver.

Consider adding covariance updates

We may want to try to implement "covariance updates" for our solver, in precisely the same fashion as in the Friedman paper on the elastic net. So we incrementally build the Gram matrix as predictors become nonzero and use that in our updates. It should help a lot when n > p I think (and possibly even in more instances since we can cache more stuff efficiently). I wrote a little bit about it in section 2.1.3. As this only really works for least squares SLOPE I think it may be of middling interest however. What do you think?

Oracle solver not converging on Scheetz2006

Now I am seeing weird issues with Scheetz2006 and the oracle solver.

Check out the following example.

import matplotlib.pyplot as plt
import numpy as np
from benchopt.datasets import make_correlated_data
from scipy import stats

from slope.data import get_data
from slope.solvers import hybrid_cd, oracle_cd
from slope.utils import dual_norm_slope

dataset = "Scheetz2006"
if dataset == "simulated":
    X, y, _ = make_correlated_data(n_samples=10, n_features=10, random_state=0)
    # X = csc_matrix(X)
else:
    X, y = get_data(dataset)

fit_intercept = False

randnorm = stats.norm(loc=0, scale=1)
q = 0.1
reg = 0.01

alphas_seq = randnorm.ppf(1 - np.arange(1, X.shape[1] + 1) * q / (2 * X.shape[1]))

alpha_max = dual_norm_slope(X, (y - fit_intercept * np.mean(y)) / len(y), alphas_seq)

alphas = alpha_max * alphas_seq * reg

max_epochs = 10000
max_time = 60
tol = 1e-4

beta_cd, intercept_cd, primals_cd, gaps_cd, time_cd = hybrid_cd(
    X,
    y,
    alphas,
    fit_intercept=fit_intercept,
    max_epochs=max_epochs,
    verbose=True,
    tol=tol,
    max_time=max_time,
    cluster_updates=True,
)

beta_oracle, intercept_oracle, primals_oracle, gaps_oracle, time_oracle = oracle_cd(
    X,
    y,
    alphas,
    fit_intercept=fit_intercept,
    max_epochs=max_epochs,
    verbose=True,
    tol=tol,
    max_time=max_time,
    w_star=beta_cd
)

primals_star = np.min(np.hstack((np.array(primals_cd), np.array(primals_oracle))))

plt.clf()

plt.semilogy(time_cd, primals_cd - primals_star, label="cd")
plt.semilogy(time_oracle, primals_oracle - primals_star, label="cd_oracle")

plt.xlabel("Time (s)")

plt.ylabel("suboptimality")
plt.legend()
plt.title(dataset)
plt.show(block=False)

Improve performance of `slope_threshold()` function

I believe that the slope_threshold() function is likely the current bottleneck for the hybrid solver (and any CD solver).

https://github.com/QB3/slopecd/blob/27060e603843568fb88c580bf4a92449e0330767/code/slope/utils.py#L61

Here is a line profile of the CD part, which is where 75% of the time is spent for this simulation, but note that this is without numba since profiling doesn't work with numba.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def block_cd_epoch(w, X, R, alphas, cluster_indices, cluster_ptr, c):
    11       686       2321.0      3.4      0.0      n_samples = X.shape[0]
    12     70792      95061.0      1.3      1.7      for j in range(len(c)):
    13     70106     149992.0      2.1      2.6          if c[j] == 0:
    14      4289       4590.0      1.1      0.1              continue
    15     65817     177943.0      2.7      3.1          cluster = cluster_indices[cluster_ptr[j]:cluster_ptr[j+1]]
    16     65817     199659.0      3.0      3.5          sign_w = np.sign(w[cluster])
    17     65817     681207.0     10.4     11.9          sum_X = X[:, cluster] @ sign_w
    18     65817     314579.0      4.8      5.5          L_j = sum_X.T @ sum_X / n_samples
    19     65817      97729.0      1.5      1.7          c_old = c[j]
    20     65817     310375.0      4.7      5.4          x = c_old + (sum_X.T @ R) / (L_j * n_samples)
    21    131634    2043248.0     15.5     35.7          beta_tilde = slope_threshold(
    22     65817     490107.0      7.4      8.6              x, alphas/L_j, cluster_indices, cluster_ptr, c, j)
    23     65817     249870.0      3.8      4.4          c[j] = np.abs(beta_tilde)
    24     65817     310975.0      4.7      5.4          w[cluster] = beta_tilde * sign_w
    25     65817     103474.0      1.6      1.8          if c_old != beta_tilde:
    26     64833     494068.0      7.6      8.6              R += (c_old - beta_tilde) * sum_X

Currently slope_threshold() searches from the top down, but this is of course ineffective. My guess is that the easiest solution is simply to start from the current cluster, check if the coefficient either will decrease or increase and go in that direction.

feat: fit intercept

I think we should fit the intercept for all models since it very rarely makes sense not to do so.

Add reduction caching for sparse X

We could try to figure out some way to cache reductions for the sparse X case, like we now optionally do with the dense case. The performance benefit is only large for some cases when there is a lot of clustering, however, so again I'm not sure it's worth all the work.

sign beta_tilde

https://github.com/QB3/slopecd/blob/9c1a2e80da8fba48ee8da3e8f08a79f1243075dd/code/slope/solvers/hybrid.py#L23

I don't think this correct, I think one should not use abs here. It will probably matter very little. However, if beta_tilde changes sign then all betas should change sign.

feat: experiment setup

This is a meta-issue to discuss and list the simulated and real data setups for the experiments. These just some ideas off the top of my head. Let's discuss it!

Real Data

Gaussian

E2006-log1p, 16 087 x 4 272 227, sparse https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-log1p
E2006-tfidf, 16 087 x 150 360, sparse, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf
Scheetz2006, 120 x 18 975, https://myweb.uiowa.edu/pbreheny/data/Scheetz2006.html
bcTCGA, 536 x 17 322, dense, https://myweb.uiowa.edu/pbreheny/data/bcTCGA.html
Rhee2006, 842 x 361, sparse? https://myweb.uiowa.edu/pbreheny/data/Rhee2006.html
YearPredictionMSD, 463 715 x 90, dense https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#YearPredictionMSD

I suppose we display time to optimality curves for something like $\text{reg} \in \lambda_{\text{max}} \times \{0.1, 0.01, 0.001\}$, a little bit depending on $n$ and $p$ relationship.

@mathurinm, do you want to patch libsvm to support these breheny data sets?

Logistic Regression (if we do it!)

all the usual suspects (rcv1, covtype.binary, news20.binary, gisette etc etc)

Simulated data

High-dimensional setup: 200 x 20 000, 20 signals, some type of correlation structure (latent or AR process type)
Low-dimensional setup: 20 000 x 200, 40 signals, vary over some type of correlation structure (latent or AR process type)
High-dimensional sparse setup: 200 x 2 000 000, 20 signals, binary X, sparsity 0.001, some type of correlation structure (AR and/or block)

Lambda sequence settings

I don't think we need to meddle much with the lambda sequence setup other than to vary the $q$ parameter, maybe something like $q \in \{0.05, 0.1, 0.2\}$.

OSCAR

In principle I don't see any reason to do OSCAR, but maybe it's clever to do so anyway since it might help draw attention to people who are interested in OSCAR. What do you think?

Competitors

Fix NumbaPerformanceWarnings

We're currently producing all these warnings that would be nice to get rid off. I am not sure if there are any performance implications actually. Maybe you who've worked more with numba know if that's the case (@mathurinm, @Klopfe)? (In which case we really need to do something about it.)

/home/gerd-jln/research/slopecd/code/slope/solvers/hybrid.py:71: NumbaPerformanceWarning: '@' is fa
ster on contiguous arrays, called on (array(float64, 1d, A), array(float64, 1d, A))
  L_archive[k] = (X_reduced[:, k].T @ X_reduced[:, k]) / n_samples
/home/gerd-jln/.pyenv/versions/3.10.2/lib/python3.10/site-packages/numba/core/typing/npydecl.py:913
: NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float64, 1d, A), ar
ray(float64, 1d, A))
  warnings.warn(NumbaPerformanceWarning(msg))
/home/gerd-jln/research/slopecd/code/slope/solvers/hybrid.py:125: NumbaPerformanceWarning: '@' is f
aster on contiguous arrays, called on (array(float64, 1d, A), array(float64, 1d, C))
  x = c_old + sum_X @ R / (L_j * n_samples)
/home/gerd-jln/research/slopecd/code/slope/solvers/hybrid.py:142: NumbaPerformanceWarning: '@' is f
aster on contiguous arrays, called on (array(float64, 1d, A), array(float64, 1d, A))
  n_c = update_cluster(
/home/gerd-jln/.pyenv/versions/3.10.2/lib/python3.10/site-packages/numba/core/typing/npydecl.py:913
: NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float64, 1d, A), ar
ray(float64, 1d, C))
  warnings.warn(NumbaPerformanceWarning(msg))

One solution is of course to write our own dot-product for loop, but that seems somewhat crude (and maybe we shoot ourselves in the foot if there are additional optimization tricks for dot products that we fail to replicate.

Oracle not convering on rcv1

The oracle method is not converging on rcv1 data set for some reason, at least
not with reg = 0.02. Try this example:

import matplotlib.pyplot as plt
import numpy as np
from benchopt.datasets import make_correlated_data
from scipy import stats

from slope.data import get_data
from slope.solvers import hybrid_cd, oracle_cd
from slope.utils import dual_norm_slope

dataset = "rcv1.binary"
if dataset == "simulated":
   X, y, _ = make_correlated_data(n_samples=10, n_features=10, random_state=0)
else:
   X, y = get_data(dataset)

fit_intercept = False

randnorm = stats.norm(loc=0, scale=1)
q = 0.1
reg = 0.02

alphas_seq = randnorm.ppf(1 - np.arange(1, X.shape[1] + 1) * q / (2 * X.shape[1]))

alpha_max = dual_norm_slope(X, (y - fit_intercept * np.mean(y)) / len(y), alphas_seq)

alphas = alpha_max * alphas_seq * reg

max_epochs = 10000
max_time = np.inf
tol = 1e-4

beta_cd, intercept_cd, primals_cd, gaps_cd, time_cd = hybrid_cd(
   X,
   y,
   alphas,
   fit_intercept=fit_intercept,
   max_epochs=max_epochs,
   verbose=True,
   tol=tol,
   max_time=max_time,
   cluster_updates=True,
)

beta_oracle, intercept_oracle, primals_oracle, gaps_oracle, time_oracle = oracle_cd(
   X,
   y,
   alphas,
   fit_intercept=fit_intercept,
   max_epochs=max_epochs,
   verbose=True,
   tol=tol,
   max_time=max_time,
)

primals_star = np.min(np.hstack((np.array(primals_cd), np.array(primals_oracle))))

plt.clf()

plt.semilogy(time_cd, primals_cd - primals_star, label="cd")
plt.semilogy(time_oracle, primals_oracle - primals_star, label="cd_oracle")

plt.xlabel("Time (s)")

# plt.semilogy(np.arange(len(gaps_cd))*10, gaps_cd, label="cd")
# plt.xlabel("Epoch")

plt.ylabel("suboptimality")
plt.legend()
plt.title(dataset)
plt.show(block=False)

Gets me:

Epoch: 9771, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9781, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9791, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9801, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9811, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9821, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9831, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9841, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9851, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9861, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9871, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9881, loss: 0.18407451740050448, gap: 1.06e-01
Epoch: 9891, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9901, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9911, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9921, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9931, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9941, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9951, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9961, loss: 0.18407451740050443, gap: 1.06e-01
Epoch: 9971, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9981, loss: 0.18407451740050446, gap: 1.06e-01
Epoch: 9991, loss: 0.18407451740050446, gap: 1.06e-01

etc for the oracle method.

Consider merging clusters in hybrid solver

Right now, in the hybrid solver, we are not actually merging clusters in the implementation. We just set them to exactly the same value. I'm not sure if this is a real problem, but what can happen is of course that we update c[j] to 0.2, which is the same value as c[k], k > j so that now c[j] and c[k] are one cluster. But then when we arrive at the update for c[k], the indices for c[j] haven't actually been included in c[k], so at that point we only update the indices that were in c[k] at the start, which could break the cluster apart. I hope that made sense. But like I said, I'm not sure it's a problem in practice.