jamesyang007 / ghostbasil Goto Github PK

GhostKnockoff + BASIL

C++ 70.80% Starlark 0.61% Python 6.22% R 12.83% Jupyter Notebook 9.54%

ghostbasil's Introduction

GhostBASIL

This is a repository for experimental code using GhostKnockoff and BASIL framework for fast and efficient LASSO fitting on high-dimensional data. Specifically, this library is created with the intention of applying it on GWAS summary statistics.

Dependencies

Bazel

Developer Installation

Run the following command to generate machine-specific .bazelrc:

./generate_bazelrc

Then, install the developement dependencies:

mamba update -y conda
mamba env create
conda activate ghostbasil

Our tools require omp to be installed on the system. On Mac, it must be installed with Homebrew as our build system assumes:

brew install libomp

See the following for installations of each of the sub-packages:

References

ghostbasil's People

Contributors

Stargazers

Watchers

Forkers

biona001

ghostbasil's Issues

Remove `-Xclang -fopenmp` from linking

ghostbasil/generate_bazelrc

Line 32 in 5d78c36

build --linkopt=-Xclang

Group lasso Newton method accelerated?

There are some results of accelerating Newton method. If there's still significant bottleneck in the Newton method, might be worth looking into accelerated methods.

Checkpoint

It would be nice if we can provide checkpoint information so that we can fit at any lambda with a good warm-start.

Add support for `D` in `GhostMatrix` to be a full matrix

This will be used in group-knockoff setting.

Basil lasso early exit

Currently, basil fits lasso without early-exiting. This may bring extra bottleneck? I'm not sure by how much.

`glmnetbasil` support

It would be nice to have a glmnet-like function that takes in individual-level data X, y and performs BASIL. The covariance method as in ghostbasil would be ideal and the covariance matrix can be constructed on the fly as strong set variables come in.

Different strong rule

Strong rules for BASIL is different from the one proposed for glmnet.

At every basil iteration, we must do a KKT check where we compute the gradient for all variables. Current formulation of BASIL uses a simple strong rule where we add delta number of variables with the highest (absolute) gradient. The paper mentions one extension is to use the usual strong rules as in glmnet to add variables in a data-dependent fashion. The usual strong rule rarely discards more than necessary, so this seems like a better rule.

Store gradients?

Not sure how much the KKT check is adding, but when number of KKT passed is not 0 or all lambdas in the batch, we have to recompute the gradient again. This can be avoided if we store the gradients when we check KKT. Based on preliminary study, as lambda goes down, all lambdas pass KKT, so we might save memory here..

Compute full R^2

It might be worth providing an option that computes the full R^2, not just the differences (current implementation)

Add `penalty.factor`

This gives a nice explanation of the penalty factor objective. Trevor recommends mixing the l1 and l2 together as in this article and penalizing the two together. I'm not sure if that's what we want though because we don't have an elastic net penalty.

`GroupGhostMatrix` as block-diagonal

@statzihuai mentioned that GroupGhostMatrix is making the algorithm not scalable. Change this to have block-diagonal structure!

Change objective into elastic net

Currently, we use the PGR objective that regularizes the correlation matrix A towards the identity. It's not clear why this is the "correct" thing to do. We propose to change the objective to simply the Gaussian + elastic net. This prepares us to integrate the penalty factor as well (#3).

Could you check if the output below matches your expectation? Hopefully I didn't misuse your package in some way.

library(Matrix)
library(ghostbasil)

# read data
Ci <- as.matrix(read.table('Ci.txt'))
S <- as.matrix(read.table('S.txt'))
r <- scan('r.txt')
lambdas <- scan('lambdas.txt')

# form ghostbasil inputs
S <- BlockMatrix(list(S))            # dim(S) = 190 by 190
A <- BlockGroupGhostMatrix(Ci, S, 6) # dim(A) = 1140 by 1140

# run ghostbasil
result <- ghostbasil(A, r, delta.strong.size=500, user.lambdas=lambdas,
    max.strong.size = nrow(A), n.threads=12, use.strong.rule=F)

Some observations:

The model seems to converged (no error message printed and result$error is "")
result$rsqs returns a vector of NaN

result$rsqs
  [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [19] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [37] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [55] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [73] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [91] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

The number of non-zero beta is not uniformly increasing

colSums(result$betas != 0)
  [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [19]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [37]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [55]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [73]   0   0   0   0   0   0   0   0   0   0   0   0   0   1   8  11  18  31
 [91]  37  48  70 104 147 195 286 428 706   5

Validation Method

In addition to r, suppose we now have a validation r_validation (e.g. r_validation from another dataset or pseudo summary statistics). So we can train ghostbasil using r and compute validation R^2 based on r_validation as

R2_validation=t(beta)%*%r_validation / sqrt(A$quad_form(beta))

Then can we stop basil when R2_validation stop increasing / reach its maximum? This is the current way that I choose optimal lambda and simulations show that it works well.

From an email discussion with Zihuai.

Block matrix acceleration

For any block matrix, we can make the following optimizations:

Parallelize lasso-fitting on strong set for each block
We only need to refit lasso for the blocks that increased in strong-set.