chr1swallace / coloc Goto Github PK

Repo for the R package coloc

R 100.00%

coloc's Introduction

coloc

The coloc package can be used to perform genetic colocalisation analysis of two potentially related phenotypes, to ask whether they share common genetic causal variant(s) in a given region.

Most of the questions I get relate to misunderstanding the assumptions behind coloc (dense genotypes across a single genomic region) and/or the data structures used. Please read vignette("a02_data",package="coloc") before starting an issue.

version 5

This update (version 5) supercedes previously published version 4 by introducing use of the SuSiE approach to deal with multiple causal variants rather than conditioning or masking. See

Wang, G., Sarkar, A., Carbonetto, P., & Stephens, M. (2020). A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology). https://doi.org/10.1111/rssb.12388

for the full SuSiE paper and

Wallace (2021). A more accurate method for colocalisation analysis allowing for multiple causal variants. PLoS Genetics. https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009440

for a description of its use in coloc.

To install from R, do

if(!require("remotes"))
   install.packages("remotes") # if necessary
library(remotes)
install_github("chr1swallace/coloc@main",build_vignettes=TRUE)

Note that in all simulations, susie outperforms the earlier conditioning approach, so is recommended. However, it is also new code, so please consider the code "beta" and let me know of any issues that arise - they may be a bug on my part. If you want to use it, the function you want to look at is coloc.susie. It can take raw datasets, but the time consuming part is running SuSiE. coloc runs SuSiE and saves a little extra information using the runsusie function before running an adapted colocalisation on the results. So please look at the docs for runsusie too. I found a helpful recipe is

Run runsusie on dataset 1, storing the results
Run runsusie on dataset 2, storing the results
Run coloc.susie on the two outputs from above

More detail is available in the vignette a06_SuSiE.html accessible by

vignette("a06_SuSiE",package="coloc")

Background reading

For usage, please see the vignette at https://chr1swallace.github.io/coloc

Key previous references are:

original propostion of proportional colocalisation Plagnol et al (2009)
proportional colocalisation with type 1 error rate control Wallace et al (2013)
colocalisation by enumerating all the possible causal SNP configurations between two traits, assuming at most one causal variant per trait Giambartolomei et al (2013)
Thoughts about priors in coloc are described in Wallace C (2020) Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses. PLOS Genetics 16(4): e1008720

Frequently Asked Questions

see FAQ

Notes to self

to generate website: https://chr1swallace.github.io/coloc/

Rscript -e "pkgdown::build_site()"

coloc's People

Contributors

Stargazers

Watchers

coloc's Issues

input for coloc.abf

Hi! I was trying to run the coloc.abf function and got results I wasn't expecting, and wanted to check on the inputs.

The tutorial says to supply the variance of Beta in the varBeta vector. Should this be a vector of the variance of Beta, the standard error of beta, the standard deviation of beta, or the square of the standard error of Beta?

Thanks!

Checking whether dataset is a list

Hi,

While running coloc susie, accidentally my dataset 2 is NULL, but I get this message:

Error in check_dataset(d, suffix, req = c("beta", "varbeta", "LD", "snp")) : 
  dataset 1: is not a list

I this there is bug that always prints dataset 1 instead of the right one.
Of course I should pre-filter my input datasets, but just want to report this.

Data harmonization for MAF

Hello,

I had several questions regarding data input and harmonization:

With regards to the MAF input, is this referring to the allele frequency of the 'effect allele' against which the beta is oriented?
Moreover, does coloc require data harmonization prior to analysis such that effect alleles are harmonized in the same direction? If so, I imagine a QC step would be to remove strand ambiguous alleles.
If harmonization is indeed required as above, is the same harmonization necessary if supplying sd(Y) directly? The subtext here I suppose is does the sign of beta matter - from understanding of methodology I assume it does not.

Thank you!

coloc.signals without LD info

Dear Chris,

I am trying to run a colocalization analysis using the masking method in the condmask development branch.

Because I don't have the info about the LD structure for any of the datasets, I used method="mask" (which to my understanding is made for cases when you don't have LD info).

GWAS_dataset=list(pvalues=GWAS_data$pval,N=36000,s=0.2,MAF=GWAS_data$minor_allele_frequency,beta=GWAS_data$beta,varbeta=GWAS_data$se,type="cc",snp=GWAS_data$rsid,LD=NULL)
eQTL_dataset=list(eQTL_df$pval_nominal,N=1000,MAF=eQTL_df$maf,beta=eQTL_df$slope,varbeta=eQTL_df$slope_se,type="quant",snp=eQTL_df$rs_id_dbSNP151_GRCh38p7,LD=NULL)
col_analysis=coloc.signals(dataset1 = GWAS_dataset,dataset2 = eQTL_dataset,method ="mask",mode="iterative",p1 = 2.48e-5,p2=0,00161276,p12 = 0,00000626954)

I got two errors:

If you don't specify LD=NULL in the dataset list but in the coloc.signal function, I get an error in finemap.signals at :
LD <- LD[D$snp, D$snp]
Error in LD[D$snp, D$snp] : incorrect number of dimensions

because dataset$LD is only defined at the beginning of coloc.signals if the LD argument is not NULL (default is NULL):
if (!("LD" %in% names(dataset1)) & !is.null(LD))
dataset1$LD <- LD

If I do specify LD=NULL in the dataset list, you then get an error in map_mask at:

friends <- apply(abs(LD[, sigsnps, drop = FALSE]) > sqrt(r2thr),
1, any, na.rm = TRUE)

Error in abs(LD[, sigsnps, drop = FALSE])

Am I doing something wrong or is this a bug?

Thanks for your help!

Best

Sylvain

Input format for eQTL and GWAS analysis

Hi there,

I'm now trying to use this tool for my co-localization analysis of GWAS and eQTL.
What are the required input formats of COLOC tool? Is there any tutorial for eQTL, GWAS analysis by this tool? Thanks!

Simon

Output SNP.PP.H3 from coloc.abf

Hi there!

Thank you very much for this package. It's very convenient to use.

At the moment, I'm trying to have an additional column that would be the probability of H3 (i.e. SNP.PP.H3) per SNP. The coloc.abf function only outputs SNP.PP.H4. I tried looking into the source code (specifically the coloc.abf function and the combine.abf internal function) in an attempt to add some lines that would include the aforementioned column to my results, but alas, to no avail. Could you maybe help me with this or point me in the right direction? Any help would be greatly appreciated!

Confusing results when using beta and varbeta in coloc.abf()

Hi! I was new to using coloc and it's really a great tool. However, I met some confusing results.
I read the reference in https://chr1swallace.github.io/coloc/reference/coloc.abf.html , and the coloc.abf() recommend to using beta, varbeta, N, MAF when sdY unknown. But when I test this function with the same dataset1 and dataset2 (which expect a PPH4=1), the result was not as expected:
l_gwas=list(N=eqtl$N,MAF=eqtl$MAF,beta=gwas$b,varbeta=gwas$SE,type="quant",snp=as.character(gwas$SNP))
coloc.abf(l_gwas,l_gwas)

And I got exactly the same results when add p values columns.

The interesting things is, when I using only N, MAF, p values columns, I got the right result!
l_gwas=list(N=gwas$N,pvalues=gwas$p,MAF=gwas$MAF,type="quant",snp=as.character(gwas$SNP))
res=coloc.abf(l_gwas,l_gwas)

According to the article and reference documentation, the beta and varbeta should be used first. However, the result shows that I should only provide p values.
By the way, no PPH4 better than 0.005 when I use "beta, varbeta, N, MAF" to conduct a colocalization the a real GWAS summary and eQTL summary data. Which seems something wrong.

All program was running on R3.6.1 and coloc 4.0-4. Thank you for your help.

Jiang Feng

question regarding beta's

Dear reader,

I am trying to use coloc for the first time. I want to perform fine mapping and colocalization analysis between two traits. For one trait i only have the sum. stats from a meta analysis and for the other trait i have QTL info.
For both traits, i have the EAF (instead of MAF), aan the beta refers to the Effect allele. My question is as follows:
Should i use a beta that also refers to the minor allele, or is this not necessary? I ask this, because for some snps, the effect allele is not the minor allele, so, if i would like to report the beta for the minor allele, i would the have to change the direction.

Hope you can help me out.

thank you very much in advance,

Sergio

Susie parameters

Hi, in the documentation of runsusie it says the default for coloc are

arguments passed to susie_rss. In particular, if you want to match some coloc defaults, set

prior_variance=0.2^2 (if a case-control trait) or (0.15/sd(Y))^2 if a quantitative trait

estimate_prior_variance=FALSE

But it seems that in the code below, these parameters are not set as above

coloc/R/susie.R

Lines 409 to 410 in a2825b8

 res=do.call(susie_rss, 

 c(list(z=z, R=LD, max_iter=maxit), susie_args))

I think the PIP from this current setup, which is the default susie_rss, seems to make more sense compared to using prior_variance=0.2^2 and estimate_prior_variance=FALSE.

Which parameters do you exactly recommend using for COLOC?

Thanks!

process.dataset shouldn't create z.df1 etc twice

problem with credible sets in susie branch

Hi Chris,

I'm running the "susie" branch of coloc, using the latest version of susieR@master, and I noticed strange behavior in the Susie vignette. Your example dataset D3 is supposed to have two credible sets, corresponding to variants s25 and s25.1. However, the output that I get is:

S3 <- runsusie(D3,nref=503)
running iterations: 100
converged: TRUE
summary(S3)

Variables in credible sets:

variable variable_prob cs
101 1 2

Credible sets summary:

cs cs_log10bf cs_avg_r2 cs_min_r2 variable
2 0.01342722 1 1 101

I obtain the correct output if I try to run Susie without the null_weight parameter, like so:

z <- D3$beta/sqrt(D3$varbeta)
R <- D3$LD
S3noPrior <- susieR::susie_rss(z=z, R=R, z_ld_weight=1/503)
summary(S3noPrior)

Variables in credible sets:

variable variable_prob cs
25 1 2
75 1 3

Credible sets summary:

cs cs_log10bf cs_avg_r2 cs_min_r2 variable
2 22.78721 1 1 25
3 22.78702 1 1 7

However, getting rid of null_weight leads to downstream errors in the coloc pipeline. Can you help me figure out what's happening?

Best,
Matei

Posterior probability with p values versus beta and varbeta

Hi,

I've been using the coloc package to determine whether protein QTLs colocalise with eQTLs. I have to say i've enjoyed using this package, I find it very user-friendly. However, I'm just getting to grips with the theory behind it and its usage.

I have been taking a ~1Mb region surrounding a sentinel pQTL variant (dataset1) and then also taking eQTLs for the gene of interest within the same region (dataset2). I have run the coloc.abf function with just p values and MAF (without beta and varbeta) and gotten > 0.9 for PP.H4.abf, then I ran it without p values and instead with beta and var beta (standard error). In this case, I get a PP.H4.abf of less < 0.01. When I include all information, I get the latter (< 0.01).

This has led me to believe there may be something wrong in my input files or in my understanding as these are drastically different results. I just wanted to ask if you could speculate on what might be going on, whether this is to be expected or if there is any guidance on this matter?

Many thanks,
Anthony

Running coloc.abf without MAF for case control GWAS- studies

Hi Chris,

I am using new user of coloc package. I am using coloc.abf function for colocalisation analysis between hQTLs and GWAS summary statistics.

For one of the GWAS studies, it is a case control study and the parameters which are available are p-value. beta-values, var-beta but not MAF. At the moment, when I attempt to run the analysis, I get an error saying that for the type “cc”, I should have MAFs.

In the vignette, its says I can run basic coloc.abf with regression coefficient and variance of these coefficients for each SNP. For, GWAS studies, I do have these parameters but I get the above error.

My question is, is it possible to run “cc” type dataset without MAF’s with only regression coefficient and variance of these coefficients.

`runsusie` for a dataset without varbeta

Hi,

Thank you very much for coloc, it is very helpful!
I am wondering, if it is possible to use runsusie function for a dataset without varbeta? I am trying to co-localize multiple signals between GWAS and eQTL summary stats. Unfortunately, I have only MAF, p-value, z-statistics and number of samples from eQTL dataset. I tried to apply runsusie, assuming that it will work, because coloc.abf can accept data without varbeta. But runsusie failed with an error message "Error in sqrt(d$varbeta) : non-numeric argument to mathematical function". Is there any chance it can work without varbeta?

Thank you for your answer in advance!

Unable to run coloc.abf with beta and varbeta

Dear Chris,

I am trying to use coloc.abf with beta and varbeta. I received the following error:

res=coloc.abf(dataset1=list(snp=gwas$snp,beta=gwas$beta,varbeta=gwas$varbeta,type="cc"),dataset2=list(snp=eqtl$snp,beta=eqtl$beta,varbeta=eqtl$varbeta,type="quant"))
Processing dataset
Error in process.dataset(d = dataset1, suffix = "df1") :
Must give, as a minimum, either (beta, varbeta, type) or (pvalues, MAF, N, type)

Since I have provided beta and varbeta, I am not sure why this error would occur. Could you help me on this? Thanks!

The use of case:control only when P is used

Hi,

I am doing coloc between a GWAS and an eQTL dataset. GWAS is case-control in this case. For type=cc, coloc.abf also requires s, which is proportion of case, as an input. However, diving into the code it seems that s is only used when p-values are used, but not when beta and varbeta is provided (see the following line)

coloc/R/claudia.R

Line 212 in 2ca77d0

 abf <- approx.bf.p(p=df$pvalues, f=df$MAF, type=d$type, N=d$N, s=d$s, suffix=suffix) 

Therefore, is it possible to do coloc with beta and varbeta for case-control GWAS without giving s?

Thanks in advance!

Simulation example on coloc conditioning vignette not working

Dear Coloc developers,

Thank you for creating such a helpful user friendly tool. I aim to use the new feature on testing for colocalization on multiple SNPs in a locus between traits and I am trialing the examples on your vignette (https://chr1swallace.github.io/coloc/articles/a05_conditioning.html) on this but the script to simulate a dataset is not working. I will be grateful if you could look into this.

library(mvtnorm)
simx <- function(nsnps,nsamples,S,maf=0.1) {
    mu <- rep(0,nsnps)
    rawvars <- rmvnorm(n=nsamples, mean=mu, sigma=S)
    pvars <- pnorm(rawvars)
    x <- qbinom(1-pvars, 2, maf)
}


sim.data <- function(nsnps=50,nsamples=200,causals=floor(nsnps/2),nsim=1) {
  cat("Generate",nsim,"small sets of data\n")
  ntotal <- nsnps * nsamples * nsim
  S <- (1 - (abs(outer(1:nsnps,1:nsnps,`-`))/nsnps))^4
  X1 <- simx(nsnps,ntotal,S)
  X2 <- simx(nsnps,ntotal,S)
  Y1 <- rnorm(ntotal,rowSums(X1[,causals,drop=FALSE]/2),2)
  Y2 <- rnorm(ntotal,rowSums(X2[,causals,drop=FALSE]/2),2)
  colnames(X1) <- colnames(X2) <- paste("s",1:nsnps,sep="")
  df1 <- cbind(Y=Y1,X1)
  df2 <- cbind(Y=Y2,X2)
  if(nsim==1) {
    return(new("simdata",
               df1=as.data.frame(df1),
               df2=as.data.frame(df2)))
  } else {
    index <- split(1:(nsamples * nsim), rep(1:nsim, nsamples))
    objects <- lapply(index, function(i) new("simdata", df1=as.data.frame(df1[i,]),
                                             df2=as.data.frame(df2[i,])))
    return(objects)
  }
}

set.seed(46411)
data <- sim.data()

coloc-mask-paper repo not public

This link was given in the recent preprint but isn't live yet:

https://github.com/chr1swallace/coloc-mask-paper

thanks!
Mike

coloc.abf: document defaults for p1 etc, name elements of list returned

error using finemap.signals for method 'cond'

Hi,
I'm running finemap.signals to use the 'cond' method and getting the following error

approximating linear analysis of binary trait Error in if (sum(use) == 1) { : missing value where TRUE/FALSE needed

I tried to fix this but this is causing other issues. I kindly request you to look into this issue.

Many thanks,
Pooja

pcs.prepare should subset to common SNPs

Issue raised by Wenhua.

> pcs <- pcs.prepare(A,B)
Error in rbind(X1, X2) : 
  number of columns of matrices must match (see arg 2)

Some small feedback

Hi Chris,

I did a quick scan. Hopefully the following is helpful.

DESCRIPTION: (I am not 100% sure if the following is needed or not) Perhaps you are missing a line specifying that snpStats comes from Bioconductor, i.e.
```
Remotes: bioc::release/snpStats
```
DESCRIPTION: You could include the ORCiDs of yourself and your co-authors in the Authors@R vector. You add this to each person in a with the comment option, e.g. comment = c(ORCID = "XXXX-XXXX-XXXX-XXXX").
_pkdown.yaml: Currently your pkgdown/githubpages site is missing the rendered vignettes, i.e. the Articles page is blank. I think this is because your code in _pkgdown.yaml is too complicated. I think you are trying to exclude simdata.Rmd from being shown as a separate vignette so I would code:
```
articles:
- title: Vignettes
  navbar: ~
  contents:
  - a01_intro
  - a02_proportionality
  - a03_enumeration
  - a04_sensitivity
  - a05_conditioning
```
And from simdata.Rmd remove the vignette: section in its header.
README.md:
- To build the vignettes is the code before devtools::build_vignettes() necessary?
- The RStudio folk usually put their build badges just below the title of the README.
GitHub repo: This is missing the link to your githubpages URL in the standard place, i.e. add the githubpages URL in your repo settings

Best wishes
Tom

coloc.abf: priors p value and making sense of output?

hi chris,

I'm using GWAS summary stats and eQTL to conduct coloc analysis. I use coloc.abf using variance and beta for SNPs at default function parameters.

Upon analysing, I check output, for example, SNP1 that has SNP.PP.H4 of 1.13E-04. But in GWAS SNP has P-value of 0.72 and in eQTL, SNP has P-value of 0.08.

I find it odd that P-values from separate analysis are so big (that is non significant), yet the posterior probability is so low (that is more significant).

Also, the defaults for priors are as follows:
p1 prior probability a SNP is associated with trait 1, default 1e-4
p2 prior probability a SNP is associated with trait 2, default 1e-4
p12 prior probability a SNP is associated with both traits, default 1e-5

I'm sorry, but how are these used?

Would you be able to shed some light on these two questions?

I'm using R4.0 and coloc_3.2-1

About Differing N and s Parameters & Matching Two Datasets

Hi,

First of all, thanks a lot for developing coloc, it has been very useful so far for my projects!

I am mainly using coloc.abf (coloc v4.0.4) and my first question is regarding the sample size (N) and case/N (s) inputs. For some GWAS summary statistics files, these tend to be differing; for example, markerX is only tested in 8 out of 10 cohorts of the meta-analysis, but markerY is tested in 6 out of 10 cohorts, making the N and s different for these two markers. In this case, considering that coloc only allows single N and s input, what's the best option, is it calculating N and s based on maximal numbers? (For example, if it is a GWAS in 50K cases and 50K controls -> assign N = 100K and s = 0.5; independent of the fact that some of the markers were measured in 40K cases and 40K controls actually).

Second question: is it best practice to match two datasets prior to coloc, i.e. removing all markers that is found in dataset1 but missing in dataset2 and so forth?

Thanks in advance!
Fahri

coloc.abf is slow

It could be quicker - thin and tag the SNPs better, so we select a sparser set of models. Make sure we tag on the joint set of X, not one then the other.

error when using finemap.abf() function

Dear Chris,
My apologies for disturbing you again.
I am trying to run the finemap.abf() function, but keep getting an error which yields NA's in the SNP.PP column (see below).
The script was running fine when using a small set of snps (169), but now i wanted to run on a whole chromosome (552K SNP's) and it gives me this error. Only thing that changed in the input file were the headers(which were then adjusted in the dataset=parameters and that SNP ID is now CPTID instead of rsid).
Do you know perhaps how to solve this.

Thank you veru much in advance.

Kind regards,

Sergio

MAF question

Hi Chris

Thanks for the great package.

I have coded all effect alleles as non reference and adjusted the betas accordingly. Should I provide the allele frequency for the alternative/effect allele or the least frequent (which is usually non-reference but not always)?

Secondly does it matter if I use the study AF or external sources i.e. 1kg (assuming the same ethnicity)?

Many thanks
Matt

coloc.signals asks for N when s is present

my.res <- coloc.signals(
              dataset1 = list(N = ncol(pheno) - 4, beta = tmpQtl$slope, varbeta = tmpQtl$var,
                                       type = "quant", MAF = tmpQtl$MAF, snp = tmpQtl$sid
              ,
              dataset2 = list(beta = log(tmpGwas$`OR(A1)`), varbeta = tmpGwas$var, type = "cc",
                                        s = 0.29, snp = tmpGwas$snp),
              method = "mask", mode = "iterative", maxhits = 5, p12 = 1e-5)

I have this set up for coloc.signals, but I got this error

Error in check.dataset(dataset2, 2) : dataset 2: sample size N not set

Error when using coloc.signals

Hello,

First, I wanted to thank you for this great tool. Second, I wanted to run coloc.signals but I am encountering what seems to be a memory error and some warnings regarding sdY.est (even when I provide varbeta, MAF, and N):

Error: vector memory exhausted (limit reached?)
In addition: Warning messages:
1: In sdY.est(vbeta = varbeta, maf = MAF, n = N) :
estimating sdY from maf and varbeta, please directly supply sdY if known
2: In sdY.est(d$varbeta, d$MAF, d$N) :
estimating sdY from maf and varbeta, please directly supply sdY if known

I have tried running it locally on Rstudio and also on a cluster and no matter how much memory I provide, it still fails. I know coloc.signals is part of the new coloc-4.0-4 version so I was wondering if this might be a bug. Thank you for your help.

Nelson Barrientos

Issue with the `est_cond` function when running conditional coloc

Dear Dr. Wallace,

Thank you for the recent update to coloc, we're happy to see that this popular method is still improving.

We're trying to run the new conditional coloc on some simulated data for benchmarking purposes, but cannot get the functions finemap.signals() and coloc.signals() to work. Both functions emit the following error:

Error in if (sum(use) == 1) { : missing value where TRUE/FALSE needed
Calls: finemap.signals -> map_cond -> est_cond
Execution halted

I'm attaching the code and summary statistics file to this message, so that you can reproduce the issue.
code_conditional_coloc_github_issue.txt
associations_github_issue.txt

Unfortunately, the LD matrix is too large to share here, therefore I'm attaching it as a link.

Perhaps we're doing something wrong here?

Thank you for your time and best wishes
Adriaan

finemap.signals without beta and varbeta

Dear Chris,

Can one use masking without providing beta and varbeta? Now when using (pvalues, MAF, N, type) as input, it throws an error in coloc.signals():

Error in eval(jsub, SDenv, parent.frame()) : object 'varbeta' not found

The error message seems to originate from the finemap.signals -> map_mask(D, LD, r2thr, sigsnps = names(hits)) function. Within that function absolute values of z-statistics are used (calculated from beta and varbeta), which can be estimated just based on the p-values too. Are the signed z-statistics that are the output of finemap.signals() used somewhere downstream in the coloc.signals() function? It seems that just the names of the output vector of finemap.signals() are used as an input to coloc.process().

Related to use of finemap.signals() function in coloc.signals(), would it be possible to specify the independent variants/separate signals to use in coloc.signals()? For example, to specify the fm1 object beforehand as an input for coloc.signals() and not estimating that within the function?

Thanks,
Silva

method to simulate data doesn't show up at beginning of vignette.pdf

error when running coloc.SuSie

Dear chris,

Thanks a lot for the new SuSie colocalization tool!
I have been trying to get it running to perform colocalization analysis.
I got my data sets checked by the check.dataset function, and they were found correct (NULL exit).
I then tried to run coloc.susie but i keep getting the error below. This error results in an empty results object.
running iterations: 100
converged: TRUE
running iterations: 100
converged: TRUE
Warning messages:
1: In (function (z, R, maf = NULL, maf_thresh = 0, z_ld_weight = 0, :
The z_ld_weight > 0 feature is under development.
2: In (function (z, R, maf = NULL, maf_thresh = 0, z_ld_weight = 0, :
The z_ld_weight > 0 feature is under development.

i was wondering if you know how to troubleshoot this.

Thanks in advance,

Implementing coloc's conditional analysis in moloc

Hello,

I am trying to implement the new conditional analysis included in coloc to moloc. I am not very familiar with the statistics that you use, so I am somewhat blindly adapting coloc scripts to work with moloc. I hope to get your input as to whether or not my proposed code makes sense and whether or not there are any statistical problems with applying coloc's conditional analysis to moloc.

If I understand coloc.signals correctly, you first identify independent variants associated with the individual traits (finemap.signals) and then calculate the estimates conditioning on these independent signals (est_all_cond). Then you perform the Bayesian colocalization on every combination of the conditional estimates (coloc.detail). If colocalizing on the conditional estimates of all traits, coloc.process does not do much other than reformat the data for easier interpretation.

Does it make sense then to replace coloc.detail and coloc.process with moloc's moloc_test function? I've attached some code that I adapted from coloc.signals that implements this approach (just replace .txt with .R). Are there any statistical problems with doing this?

Let me know if I should post this on moloc's github instead.

Best,
James

moloc.signals.txt

coloc.susie vs coloc.abf

Hi Chris,

I've been testing out coloc.susie to look for colocalization between various trait GWAS vs molecular QTL signals using UKBB as the reference panel. I've found the coloc.susie to be really useful at many loci that clearly have multiple independent signals - thanks for this functionality. One thing I wanted to pick your brain on -

Sometimes, SuSiE doesn't identify any credible sets for a trait (even after lowering the coverage and min_abs_corr parameters), usually when there’s a GWAS/QTL association identified but the signal is weak - we just use coloc.abf in that case. In other cases where SuSiE does identify one or more signals - I compared PP H4 from coloc.susie (best PP H4 among all pairs) vs coloc.abf for the same region. Attached is the plot for type 2 diabetes GWAS vs chromatin accessibility QTL in our tissue of interest, colored by if SuSiE identified one signal for each trait or more than one signal for a trait. Apart from cases where coloc.susie clearly identifies relevant signals and colocalization, coloc.abf gives somewhat higher PP H4 values, often in regions where SuSiE identified only one signal per trait. Is this expected? I would think that SuSiE should help get a clean credible set so PP H4 from coloc.susie should be used regardless of how many signals were identified. On the other hand, we are using a reference panel for SuSiE which can be messy. Would you suggest using coloc.abf where there are clearly single signals for both traits? Or a more manual, visual inspection at each locus of interest?

fig.pp_comparison1.pdf

merge coloc.abf, coloc.abf.imputed

coloc.abf and coloc.abf.imputed should really be one function, allowing either

p values + sample sizes + priors for variance
or
beta + se

Coloc.abf does not output probabilities in R

Dear Chris

I am trying to run coloc.abf on 2 datasets. I input p-values and N samples for both (one is a case-control study), as well as the MAF (taken from one of the studies). The total N SNPs is 1,558,262. When I run coloc.abf this is the output:

PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
NaN NaN NaN NaN NaN
[1] "PP abf for shared variant: NaN%"

So essentially it is unable to calculate the posterior probabilities.

I tried with subsets of the same data, and it works - so I get the probabilities if I run for, say, 1,000 SNPs, and it does so for multiple randomly-selected rows - but I want to investigate the entire list.

Would you be able to let me know why this happens? I checked the results of coloc.abf and it provides me with data on each SNP, but not the probability. I am looking forward to your reply.

Best
Miruna

Inconsistent test data results with demo by using runsusie()

Hello,

I am testing the new method of coloc by reference artical “https://chr1swallace.github.io/coloc/articles/a06_SuSiE.html”. 

The result of coloc.abf() is consistent with article, while the result of runsusie() is not.（I have uploaded the codes and result info）

I am very confused about the above situation，and is this caused by the r package version? My Coloc version is  5.0.0.9002 and susieR is 0.11.7. I would be very grateful if you could help me.

Matching variants between phenotypes

Hi,

I have two datasets, a GWAS and eQTL data. If I want to colocalize those two datasets, do I need to subset the variants to those that are present in both datasets? Or is this done automatically?

Is there a downside to such a subsetting step?

Thanks

Variants with zero variance raise an error

In the estimation of sdY from varbeta, the variance of beta has to be a non-infinite non-zero value. In my eQTL mapping, there are some rare cases that P=1, therefore Z=0, and thus variance might also be 0. I am currently removing these few SNPs.
Is there a better way to handle this?
Many thanks

Separate MAFs for cases and controls

Hello,

I am using summary statistics from a case-control GWAS where the MAFs of the cases and controls are provided in separate columns. Would you have recommendations on the best way to use these columns with coloc.abf?

Thank you!

Error in data.frame(V, z, r, lABF) : arguments imply differing number of rows: 4, 3

Hi Chris,

I'm using the 'coloc.abf' function on the summary p-value data, and I get an error like this：
Error in data.frame(V, z, r, lABF) :
arguments imply differing number of rows: 4, 3

Is there a problem in the eqtl dataset?

The dataset I used is:

dataset_eqtl
pvalues N MAF type s snp
2.30e-05 1000 0.073901 quant 0.5 rs10421291
6.21e-05 1000 0.232738 quant 0.5 rs10119
9.24e-05 1000 0.465455 quant 0.5 rs11083773
1.84e-06 1000 0.227231 quant 0.5 rs3760842

dataset_gwas
pvalues N MAF type s snp
0.00129 2000 0.073901 cc 0.5 rs10421291
0.00000 2000 0.232738 cc 0.5 rs10119
0.81400 2000 0.465455 cc 0.5 rs11083773
0.22200 2000 0.227231 cc 0.5 rs3760842

The command I used is:
coloc.abf(dataset_gwas,dataset_eqtl,MAF=eqtl.tmp$freq)

Sensitivity function returning non-numeric matrix extent when running with non-default p1,p2, and p12

Dear Chris and collaborators,

I am trying to run sensitivity analysis over my coloc object. I obtained this using priors determined based on the number of SNPs in my testing region (5108), as p1=1/nsnps; p2=p1; p12= p1/10. With this, I set p1=1.957713e-04, p2=1.957713e-04, and p12=1.957713e-05. However, when I run sensitivity using these values I get the next error:

"Error in matrix(f(p12), nrow = nrow(pr1), ncol = ncol(pr1), byrow = TRUE) : non-numeric matrix extent"

This only happens when I set priors different to the coloc defaults.

Do you think this could be a bug or is it that I'm doing something wrong? Thanks in advance for your help with this issue, and thanks for such a useful package!

Best,

Isaac

Add pretty hypothesis picture to vignette

coloc.test needs extension to work with correlated beta coefficients

@mdfortune is working on this, see https://gist.github.com/mdfortune/5813081

Case proportion or case:control ratio?

Hi,

While using coloc.abf, I got confused with the s parameter. In the document from R package, it says s is "for a case control dataset, the proportion of samples in dataset 1 that are cases". However, on other sources of documents, it says that we need "ratio of cases:controls".

Therefore, I just want to make sure which number should I actually use. Thank you so much!

Set the maximum number of iterations to a lower amount

In the "runsusie" function, currently the "max_iter" is set to 100, is it possible to set it to a smaller amount?

a bug when using 'mask'

When I using the 'mask' method to do colocalization, this error occurred. And I think the reason for the error is that I did not pass any LD parameters to it, but called

if(!is.null(sigsnps)) {
   expectedz <- rep(0,nrow(x))
   friends <- apply(abs(LD[,sigsnps,drop=FALSE])>sqrt(r2thr),1,any,na.rm=TRUE)

in map_mask(). I do not know why. Please help me , thanks much.(I have attached my data)dataset.zip. This error does not appear in other data, only in this data

coloc.signals requires sdY even when MAF and N is given

Hi,

I am trying to run coloc.signals on my dataset. For my QTL dataset, I supplied N, beta, varbeta, type, LD, MAF, snp, method. However I got this output:

PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
7.19e-146 2.97e-146  6.63e-01  2.74e-01  6.38e-02 
[1] "PP abf for shared variant: 6.38%"
Error in process.dataset(d = dataset1, suffix = "df1") : 
  dataset df1: must give sdY for type quant, or, if sdY unknown, MAF and N so it can be estimated
In addition: Warning message:
In sdY.est(d$varbeta, d$MAF, d$N) :
  estimating sdY from maf and varbeta, please directly supply sdY if known

It is strange to be because PP4 can be calculated and I got the warning message "estimating sdY from maf and varbeta, please directly supply sdY if known", just like what I got using coloc.abf. But I also got the error message "dataset df1: must give sdY for type quant, or, if sdY unknown, MAF and N so it can be estimated". I did not see this when using coloc.abf.

Thanks so much!

error in coloc.signals

Hello,

I am having some issues with the newly implemented conditional analysis (coloc.signals). I am running this on multiple loci and most of the time it works fine, however in some cases I get an error message about the beta containing missing values, although there are none.

In one example I tried splitting up the dataframe containing the input data to find the SNP causing the error. Strangely when trying rows 1:2000 I get the error, however when running rows 1:1000 or 1001:2000 it works.

I also can run the coloc.abf() function on the same datasets without any problems.

The error message looks like this:

colocres_cond = coloc.signals(df1, df2, p12=1e-5, LD=ld, method="cond", pthr=5e-8, mode="allbutone", maxhits=3)
approximating linear analysis of binary trait
quality of linear approximation (ideal is 1): 1.1913
Error in check.dataset(x, req = c("beta", "varbeta", "MAF")) :
dataset : beta contains missing values
In addition: Warning messages:
1: In sdY.est(vbeta = varbeta, maf = MAF, n = N) :
estimating sdY from maf and varbeta, please directly supply sdY if known
2: In bin2lin(D) : linear approx quality is not good

The beta and varbeta for the two datasets look like this:

summary(df1$beta)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.673529 -0.029262 0.006979 -0.009623 0.034412 0.319705
summary(df1$varbeta)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003323 0.0004039 0.0006114 0.0015925 0.0017810 0.0092896
sum(is.na(df1$beta))
[1] 0
sum(is.na(df1$varbeta))
[1] 0

summary(df2$beta)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0890000 -0.0086000 0.0003000 0.0008649 0.0100000 0.2600000
summary(df2$varbeta)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.600e-05 4.761e-05 7.225e-05 2.250e-04 2.250e-04 4.410e-02
sum(is.na(df2$beta))
[1] 0
sum(is.na(df2$varbeta))
[1] 0

Any help on this would be appreciated.

	res=do.call(susie_rss,
	c(list(z=z, R=LD, max_iter=maxit), susie_args))

chr1swallace / coloc Goto Github PK

coloc's Introduction

coloc

version 5

Background reading

Frequently Asked Questions

Notes to self

coloc's People

Contributors

Stargazers

Watchers

Forkers

coloc's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs