GithubHelp home page GithubHelp logo

ghostbasil's Introduction

GhostBASIL

This is a repository for experimental code using GhostKnockoff and BASIL framework for fast and efficient LASSO fitting on high-dimensional data. Specifically, this library is created with the intention of applying it on GWAS summary statistics.

Dependencies

Developer Installation

Run the following command to generate machine-specific .bazelrc:

./generate_bazelrc

Then, install the developement dependencies:

mamba update -y conda
mamba env create
conda activate ghostbasil

Our tools require omp to be installed on the system. On Mac, it must be installed with Homebrew as our build system assumes:

brew install libomp

See the following for installations of each of the sub-packages:

References

ghostbasil's People

Contributors

jamesyang007 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

biona001

ghostbasil's Issues

Group lasso Newton method accelerated?

There are some results of accelerating Newton method. If there's still significant bottleneck in the Newton method, might be worth looking into accelerated methods.

Checkpoint

It would be nice if we can provide checkpoint information so that we can fit at any lambda with a good warm-start.

Basil lasso early exit

Currently, basil fits lasso without early-exiting. This may bring extra bottleneck? I'm not sure by how much.

`glmnetbasil` support

It would be nice to have a glmnet-like function that takes in individual-level data X, y and performs BASIL. The covariance method as in ghostbasil would be ideal and the covariance matrix can be constructed on the fly as strong set variables come in.

Different strong rule

Strong rules for BASIL is different from the one proposed for glmnet.

At every basil iteration, we must do a KKT check where we compute the gradient for all variables. Current formulation of BASIL uses a simple strong rule where we add delta number of variables with the highest (absolute) gradient. The paper mentions one extension is to use the usual strong rules as in glmnet to add variables in a data-dependent fashion. The usual strong rule rarely discards more than necessary, so this seems like a better rule.

Store gradients?

Not sure how much the KKT check is adding, but when number of KKT passed is not 0 or all lambdas in the batch, we have to recompute the gradient again. This can be avoided if we store the gradients when we check KKT. Based on preliminary study, as lambda goes down, all lambdas pass KKT, so we might save memory here..

Compute full R^2

It might be worth providing an option that computes the full R^2, not just the differences (current implementation)

Add `penalty.factor`

This gives a nice explanation of the penalty factor objective. Trevor recommends mixing the l1 and l2 together as in this article and penalizing the two together. I'm not sure if that's what we want though because we don't have an elastic net penalty.

Change objective into elastic net

Currently, we use the PGR objective that regularizes the correlation matrix A towards the identity. It's not clear why this is the "correct" thing to do. We propose to change the objective to simply the Gaussian + elastic net. This prepares us to integrate the penalty factor as well (#3).

Diagnostic

Create diagnostic information for basil to study properties such as number of basil iterations, number of KKT check passes per basil iteration, etc.

GhostBasil very slow for p=1140 problem

Hi James,

Here is an example where GhostBasil runs slowly. This region has 190 variables, and I generate 5 knockoff copies, so overall A is 1140*1140. On sherlock with 12 cores, it takes ~200 seconds to converge.

Could you check if the output below matches your expectation? Hopefully I didn't misuse your package in some way.

library(Matrix)
library(ghostbasil)

# read data
Ci <- as.matrix(read.table('Ci.txt'))
S <- as.matrix(read.table('S.txt'))
r <- scan('r.txt')
lambdas <- scan('lambdas.txt')

# form ghostbasil inputs
S <- BlockMatrix(list(S))            # dim(S) = 190 by 190
A <- BlockGroupGhostMatrix(Ci, S, 6) # dim(A) = 1140 by 1140

# run ghostbasil
result <- ghostbasil(A, r, delta.strong.size=500, user.lambdas=lambdas,
    max.strong.size = nrow(A), n.threads=12, use.strong.rule=F)

Some observations:

  • The model seems to converged (no error message printed and result$error is "")
  • result$rsqs returns a vector of NaN
result$rsqs
  [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [19] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [37] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [55] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [73] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
 [91] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
  • The number of non-zero beta is not uniformly increasing
colSums(result$betas != 0)
  [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [19]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [37]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [55]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [73]   0   0   0   0   0   0   0   0   0   0   0   0   0   1   8  11  18  31
 [91]  37  48  70 104 147 195 286 428 706   5

Validation Method

In addition to r, suppose we now have a validation r_validation (e.g. r_validation from another dataset or pseudo summary statistics). So we can train ghostbasil using r and compute validation R^2 based on r_validation as

R2_validation=t(beta)%*%r_validation / sqrt(A$quad_form(beta))

Then can we stop basil when R2_validation stop increasing / reach its maximum? This is the current way that I choose optimal lambda and simulations show that it works well.

From an email discussion with Zihuai.

Block matrix acceleration

For any block matrix, we can make the following optimizations:

  • Parallelize lasso-fitting on strong set for each block
  • We only need to refit lasso for the blocks that increased in strong-set.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.