xihaoli / staarpipeline Goto Github PK
View Code? Open in Web Editor NEWAn R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline
License: GNU General Public License v3.0
An R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline
License: GNU General Public License v3.0
Hi all,
from reading and trying to understand/run the pipeline, the relatedness parameter is set to TRUE during the creation of the null model object, irrespective of whether kins
is NULL or a matrix, c.f. lines 159 in fit_nullmodel.R
As a direct consequence, STAAR
will call STAAR_O_SMMAT
(since sparse_kins
is set to TRUE, c.f. line 81 in fit_nullmodel.R
) rather than STAAR_O
. From the outside, it would thus appear that a function designed to account for population structure/relatedness is called despite no information about the population structure/relatedness being provided, which is surprising.
Was this intended? Or should the relatedness
element of obj_nullmodel
be set to FALSE, such that STAAR_O
would be called?
I would suggest add following lines for the preinstall packages. These packages seem need to be installed manually in the order (like STAAR depend on SeqArray).
BiocManager::install("SeqArray")
BiocManager::install("SeqVarTools")
devtools::install_github("hanchenphd/GMMAT")
BiocManager::install("GENESIS")
devtools::install_github("xihaoli/STAAR",ref="main")
BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
BiocManager::install("GenomicFeatures")
devtools::install_github("zilinli1988/SCANG")
I am also get confused for the Intel Math Kernel Library.
Hello and first many thanks for generating such a nice and useful pipeline!
I've seen that you recently improved STAAR according to imbalanced binary scenarios. Is this already integrated in the STAARPipeline approach?
Many thanks in advance!
Best
Andi
I have a dataset with ~700 samples, eventually to increase to ~1000 but I'm testing out STAARpipeline on what I have so far. I know that is very small for a human GWAS study, but I thought the gene centric and/or sliding window analysis, in combination with the weights from annotations used in STAARpipeline, might bolster the power enough to be worthwhile.
I noticed that my results files from the STAARpipeline_Gene_Centric_Coding.R script just contained a list of NULL
values. I went back and stepped through the script manually, and got the following message printed for the first gene:
# of selected samples: 721
# of selected variants: 103
# of selected samples: 721
# of selected variants: 12
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 721
# of selected variants: 3
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 721
# of selected variants: 9
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 721
# of selected variants: 3,724,472
If I do str(results)
:
List of 5
$ plof : NULL
$ plof_ds : NULL
$ missense : NULL
$ disruptive_missense: NULL
$ synonymous : NULL
Is my dataset just too small, or is there some other issue to be fixed? I am running 0.9.6 from the Docker container because I was working on this back in October, but can change to 0.9.7 if you think that would help.
Hi Dr. Li
In determining rare variant in the gene-centric coding pipeline, does STAARpipeline use allelic frequency information from population database such as gnomAD, 1000 Genome, ESP6500, etc to determine whether the variant in exonic region is also rare in all populations? Also, how do you define 'rare' in all other pipeline? Like rare in the cohort or rare in the population databases?
Thank you very much
Hi all,
It would be nice if the Dockerfile containing all the STAAR-related packages were added to biocontainers. Any interest?
Hi,
I want to use STAARpipeline to detect rare variation information in plants. I have my own vcf file and its annotation information. How can I generate the variants list to be annotated? I read the FAVORdatabase_chrsplit.csv in the example, and I don't know what it means, so I don't know how to start.
Looking forward to your reply!
Ayn
Hello,
I am trying to extract burden effects for all genes present in the rvas results rather than significant genes to perform meta-analysis? Is it possible to do so if so what is the potential way to accomplish it?
Regards
Akhil
When I am running the process to the step of STAARpipeline/0.2.1Varinfo_gds.R, if I provide the chromosome as a sex chromosome, it will automatically recognize it as chrNA. I would like to know if this procedure supports calculations for sex chromosomes.
Hello,
I am trying to see if there have been an implementation of this pipeline in the google cloud platform or if any docker has been available for this to perform the same in google cloud as I have seen the same in the Dnanexus platform for ukbiobank.
Regards
Akhil
Hi everyone!
Is there any possibility of providing a small tutorial on how to use the docker image from STAAR-pipeline?
Without any directions I'm not sure how to execute the correct steps with it and I'm not sure how to proceed. I'm working on implementing the staar-pipeline for our cluster and the docker image would be the best way to do so.
Thanks
Hello,
I have performed RVAS analysis using STAARpipeline using TOPMed dataset and got the results but when I performed the analysis on the updated dataset, I am not getting the same results as before and Genes that are reported as significant in the original analysis is not significant or not present in the results files. Can you help me out regarding this?
Thank you so much for help.
Trying to use Dynamic_Window_SCANG.r function I run into an error come from comparison class(genotype)=="dgCMatrix". While, going deeper into this I found that the problem is that genotype object (= Geno) has two classes "matrix" "array" and the comparison cannot be done. The issue with the two classes, appears after R version 4.0.0 (https://cran.r-project.org/doc/manuals/r-release/NEWS.html).
Hello I have a question based on the STAAR code and its underlying publications.
In the code it says:
#For each noncoding functional category, the conditional STAAR-B p-value is a p-value from an omnibus test #' that aggregated conditional Burden(1,25) and Burden(1,1), #' together with conditional p-values of each test weighted by each annotation using Cauchy method.
--> So this seems to me like calculate e.g. different burden tests by annotation weights, af weights etc. and integrating them afterwards
But based on your publication it seems to me that you integrated the beta-allele frequency weighting directly with the functional variant annotation and the variant score to generate e.g. QBurden.
Which one is correct?
Best
Andi
Hi, Dr. li
thanks you for the tool, It looks handy for users to go through the GWAS analysis. And I want to know there are provide a toy data for a new guy, like me.
Looking forward your reply!
Hi,
Recently I am using your pipeline on Docker to do burden test with dynamic window analysis.
After generating aGDS file, I conducted step0-step1-step5. I tested with WES data of 6 people (2 trios) with a chr1 region. Here is my original vcf file, generated aGDS file and commands:
dynamic_window_test.zip
As I only tested one chromosome, I changed some lines of codes:
#### Number of jobs for each chromosome
jobs_num <- matrix(rep(0,3),nrow=1)
for(chr in 1:1)
{
print(chr)
gds.path <- agds_dir[1] # agds_dir[1]
genofile <- seqOpen(gds.path)
filter <- seqGetData(genofile, QC_label)
SNVlist <- filter == "PASS"
position <- as.numeric(seqGetData(genofile, "position"))
position_SNV <- position[SNVlist]
jobs_num[chr,1] <- chr
jobs_num[chr,2] <- min(position[SNVlist])
jobs_num[chr,3] <- max(position[SNVlist])
seqClose(genofile)
}
About groupid and arraryid in step 0, I directly used groupid = arraryid = scang_num.
Though no errors were reported,the output was null. I am not sure if these changes were correct.
Have no idea which step went wrong. Could you give me some advice?
Thanks!
Hi, I'm interested in your software. I would like to use this software on other species, but there are no directly available annotation files, only vcf files. Can you provide test data in the tutorial? It would be even better to provide a pipeline that starts with the raw vcf.
Thank you.
Hello,
While running theGene_Centic_Coding
function, I noticed a strange issue while processing through a list of genes for a specific chromosome.
On any given gene, the function seems to work properly until the internal coding
function attempts to run the STAAR
function:
try(pvalues <- STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category,
rare_maf_cutoff = rare_maf_cutoff, rv_num_cutoff = rv_num_cutoff),
silent = silent)
I am receiving the following error, and thus no results from the current gene:
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, : Dimensions don't match for genotype and annotation!
This error occurs virtually for all genes. Looking into this, it appears the issue is how the annotation data is subset for the final list of variants that are lof in plof:
Anno.Int.PHRED.sub.category <- Anno.Int.PHRED.sub[lof.in.plof, ]
When I run this, lof.in.plof
is a vector of NAs, TRUEs, and FALSEs, with the number of TRUEs corresponding to the final filtered number of variants to use (in my case, 5). When the annotation data in Anno.Int.PHRED.sub
is subset using this vector, however, the final dimensions of the table still contain the number of rows that correspond to the previous number of variants (which, in my case, was 129).
The Geno
matrix has the dimensions [n samples x 5 variants]. When Anno.Int.PHRED.sub.category
is passed to the STAAR
function, however, its dimensions are still [n samples x 129 variants], causing the error.
If I wrap the which
function around lof.in.plof
, the dimensions of the resulting table are [n samples x 5] and STAAR
is able to run properly and gives no error:
Anno.Int.PHRED.sub.category <- Anno.Int.PHRED.sub[which(lof.in.plof),]
I assume this fix makes sense and there shouldn't be a reason Anno.Int.PHRED.sub.category
should still contain rows with NA data..? The final dimensions of this annotation table should indeed match that of the genotype matrix, no?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.