Comments (16)
I would like to add that my WGS VCFs did not have the AVGDP annotation, so I added that manually by copying the DP values for every variant as follows for every chromosome.
`
#open GDS
gds.path <- paste0(dir_geno,gds_file_name_1,chr,gds_file_name_2)
genofile <- seqOpen(gds.path, readonly = FALSE)
avgdp<- seqGetData(genofile , "annotation/info/DP")
print( length(avgdp)) # for chr22 [1] 730895
seqAddValue(genofile, "annotation/info/AVGDP", avgdp ,"dummy sequencing depth added by samreen")
seqClose(genofile)
`
from staarpipeline-tutorial.
Hi @samreenzafer,
Thanks for reaching out. It's OK that your aGDS files do not have a specific annotation named AVGDP
(average DP), and what you have done makes sense. Generally, we would want to use a set of high-quality genetic variants to compute ancestry PCs and sparse GRM for null model fitting, and AVGDP
is one of the filtering criteria.
For your question, I wonder whether the sample.id
from your phenotype data can be matched with the sample.id
stored in your aGDS file. It is also OK if your phenotype sample ids are a subset of the sample ids in the aGDS files. Could you please double check?
Best,
Xihao
from staarpipeline-tutorial.
Thank you for responding.
for the sampleID question , yes they are overlapping.
phenotype.id <- obj_nullmodel$id_include
genotype.id<- seqGetData(genofile, "sample.id")
print(length(phenotype.id))
[1] 1004
print(length(seqGetData(genofile, "sample.id")))
[1] 1026
length(intersect( phenotype.id,genotype.id))
[1] 1004
from staarpipeline-tutorial.
It seems like your outcome variable is binary, and if so, you should use family=binomial(link="logit")
instead of family=gaussian(link="identity")
and make sure that your outcome is coded as numeric 0/1 instead of a factor variable. Also, while it is true that "groups"
need only be used if your outcome is a continuous variable, you could still consider adding your "group"
variable as a fixed effect term in your logistic mixed model.
Hope this helps.
from staarpipeline-tutorial.
For group, do you mean I should use my "group_race"
variable from the phenotype file for "groups"
I will try the family=binomial(link="logit")
and let you know.
And yes my outcome variable was coded as 1/2 instead of 0/1. I will also change that.
I hadn't seen any instructions on the tutorial page to format the phenotype file, and had just assumed the plink-like notation. That was my bad.
from staarpipeline-tutorial.
I got this error while using the suggested logit
Error in glmmkin(fixed = fixed, data = data, kins = kins, id = id, random.slope = random.slope, :
Error: heteroscedastic linear mixed models are only applicable when "family" is gaussian.
Calls: fit_nullmodel -> glmmkin
Execution halted
obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+as.factor(group_race), data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID", groups="group_race",family=binomial(link="logit"),verbose=TRUE)
from staarpipeline-tutorial.
Hi @samreenzafer, for binary traits, please leave groups
parameter as NULL
.
from staarpipeline-tutorial.
I'm not sure what I'm doing wrong but I get errors.
obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+as.factor(group_race), data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID", family=binomial(link="logit"),verbose=TRUE, groups=NULL)
it ran for 3 iterations and then gave the error
Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'chol2inv': not a positive definite matrix Calls: fit_nullmodel ... chol2inv -> chol -> chol -> .local -> as -> asMethod Execution halted
Then I edited the command to remove the as.factor(group_race)
obj_nullmodel <- fit_nullmodel(pheno~as.factor(SNPSEX)+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 ,
data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID",
family=binomial(link="logit"),verbose=TRUE, groups=NULL)
And I still get an error, but after 500 iterations.
[1] "kins is a sparse matrix."
Fixed-effect coefficients:
(Intercept) SNPSEX PC1 PC2 PC3 PC4
16.0408287 -18.3676970 -27.0664801 9.4036154 -5.5903104 3.7215324
PC5 PC6 PC7 PC8 PC9 PC10
6.9741816 20.4901888 -6.8478820 -2.8771294 0.6520779 1.4433908
Iteration 1 :
Variance component estimates:
[1] 1.000000 7.234874
Fixed-effect coefficients:
[1] 17.1054427 -19.5066252 -30.2674194 9.8406604 -6.9357156 4.8385081
[7] 8.1287788 24.0488017 -12.4647225 -3.3129108 -0.1664124 2.1897960
...
...
...
...
Iteration 499 :
Variance component estimates:
[1] 1.000000 1.040721
Fixed-effect coefficients:
[1] 12.9362534 -15.2646269 -27.1436580 9.3835255 -5.6920925 3.8344001
[7] 7.0144809 20.7832368 -7.6375962 -2.8396810 0.5219339 1.5484195
Iteration 500 :
Variance component estimates:
[1] 1.0000000 0.5132395
Fixed-effect coefficients:
[1] 13.9159314 -16.2561242 -27.1981325 9.4333121 -5.6752385 3.8187186
[7] 7.0241264 20.8224936 -7.4864314 -2.8504011 0.5512372 1.5232672
Fixed-effect coefficients:
(Intercept) SNPSEX PC1 PC2 PC3 PC4
16.0408287 -18.3676970 -27.0664801 9.4036154 -5.5903104 3.7215324
PC5 PC6 PC7 PC8 PC9 PC10
6.9741816 20.4901888 -6.8478820 -2.8771294 0.6520779 1.4433908
Iteration 1 :
Error: Not a matrix.
In addition: Warning message:
Average Information REML not converged, refitting model using Brent method...
Execution halted
I don't understand how the program runs, so not sure what is it that is " not a matrix " here.
from staarpipeline-tutorial.
Is it even okay for me to have all 3 of my ethnic group of samples analysed together (the creating of the SGRM, fitting null model and all the variant analyses)? Or should this pipeline be run separately for the 3 ethnic groups?
from staarpipeline-tutorial.
And here is my GRM object
> grm<- get(load("chrall.degree2.sparseGRM.sGRM.RData"))
> str(grm)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
..@ i : int [1:1027] 0 0 1 2 3 4 5 6 7 8 ...
..@ p : int [1:1027] 0 1 3 4 5 6 7 8 9 10 ...
..@ Dim : int [1:2] 1026 1026
..@ Dimnames:List of 2
.. ..$ : chr [1:1026] "0_NA19331" "0_NA19334" "0_BKP000674" "0_BKP000684" ...
.. ..$ : chr [1:1026] "0_NA19331" "0_NA19334" "0_BKP000674" "0_BKP000684" ...
..@ x : num [1:1027] 0.5 0.23 0.494 0.499 0.497 ...
..@ uplo : chr "U"
..@ factors : list()
> min(grm@x)
[1] 0.2297566
> max(grm@x)
[1] 0.5668696
Which I had created using the following command.
Rscript ../extdata/runPipeline_wrapper.R --prefix.in chrall_pruned --prefix.in.unfiltered chrall --prefix.out chrall.degree2.sparseGRM --degree 2 --num_threads 8 --no_pcs 20 --KINGformat.out TRUE
outputting
chrall.degree2.sparseGRM.kins chrall.degree2.sparseGRM.score chrall.degree2.sparseGRM.sGRM.RData
from staarpipeline-tutorial.
Hi @xihaoli
I tried a few things, and realized that the "SNPSEX" in the phenotype is causing the problems for me.
I refitted the null object without using SNPSEX as a covariate (also did NOT use group_race)
obj_nullmodel <- fit_nullmodel(pheno~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 ,
data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID",
family=binomial(link="logit"),verbose=TRUE, groups=NULL)
The code ran to completion in only 16 iterations (as opposed to 500 iterations when I used SNPSEX which had given me the "not a matrix" error).
Then I tried one randome Individual_Analysis on arrayid=130
Rscript STAARpipeline_Individual_Analysis_newSampleIDs.noSexnoRace.R 130
which gave the output log
[1] 1004
[1] 1026
[1] "IDs of phenotype: "
[1] "0_NA20821" "0_NA20822" "0_NA20826" "0_NA20827" "0_NA20828" "0_NA20832"
[1] "IDs of genotype: "
[1] "0_NA20821" "0_NA20822" "0_NA20826" "0_NA20827" "0_NA20828" "0_NA20832"
[1] 1004
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 2,290
# of selected samples: 1,026
# of selected variants: 2,941,749
Time difference of 14.57882 secs
and output of
> df<- get(load("analysis/Individual_Analysis.noSexnoRace_130.Rdata"))
> head(df)
CHR POS REF ALT ALT_AF MAF N pvalue pvalue_log10
1 7 20022978 G A 0.08850000 0.08850000 1004 0.1671100 0.77699757
2 7 20023277 AAAAAC A 0.18661972 0.18661972 1004 0.7529397 0.12323983
3 7 20023350 G A 0.08133733 0.08133733 1004 0.7110876 0.14807688
4 7 20026620 T C 0.05700000 0.05700000 1004 0.7886436 0.10311921
5 7 20026886 CT C 0.19687815 0.19687815 1004 0.7687004 0.11424288
6 7 20027095 T C 0.08941059 0.08941059 1004 0.9202903 0.03607516
Score Score_se Est Est_se
1 3.2924486 2.383156 0.57971434 0.4196116
2 -1.4022395 4.454869 -0.07065652 0.2244735
3 -1.1257320 3.039268 -0.12187006 0.3290266
4 0.8054780 3.004703 0.08921759 0.3328116
5 -1.3251899 4.506296 -0.06525875 0.2219118
6 0.3447801 3.445455 0.02904349 0.2902374
which now has numeric Pvalues in the PValue column. I think I can now run for all the entire job array.
Question: Why do you think SNPSEX could be causing a problem?
This is what my phenotype file is coded as
FID SNPSEX pheno group_race PC1 PC2 ...PC10
0_NA20814 1 0 EUR -0.0233194936208413 0.0177841478492174
0_NA20815 1 0 EUR -0.0232876963243618 0.0178482237390411
0_NA20818 2 0 EUR -0.023596640369937 0.0174856248889635
0_NA20819 2 0 EUR -0.0235735234473422 0.016707921936772
0_NA20821 2 0 EUR -0.0233292853163099 0.0167566107200051
0_NA20822 2 0 EUR -0.0233028712187108 0.0174146357365599
0_NA20826 2 0 EUR -0.0235409212779821 0.0179969624535894
0_NA20827 1 0 EUR -0.0234523001514743 0.0176404968800983
0_NA20828 2 0 EUR -0.0233579084370098 0.0173618077100926
0_NA20832 2 0 EUR -0.0232000420926164 0.0169872294492342
where SNPSEX is 1 (for male) or 2 (for female),
pheno is 0 (control) or 1 (case) - I made this change after you corrected my error.
from staarpipeline-tutorial.
Hi @samreenzafer,
Thanks for following up. First of all, your output log of arrayid=130
looks good to me, so you may proceed with running other arrayid
's and summarize/visualize all results.
For your question, yes it is possible that the non-convergence issue is on SNPSEX
due to the collinearity between SNPSEX
and intercept or some PCs. A similar issue is here. Could you please check?
Lastly, even if you end up not including SNPSEX
, you may also consider adding as.factor(group_race)
to the fixed effect formula when fitting the null model.
from staarpipeline-tutorial.
That makes sense. The cases in our cohort are all male, whereas controls have males and females (only for the European cohort). You can see the tables below.
Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort.
So I think adding as.factor(group_race)
to the fixed effect formula when fitting the null model , and Excluding SNPSEX
makes most sense.
pheno == 0. (i.e. controls)
AFR EUR HISP
Num males 321 213 169
Num females 0 212 0
pheno == 1 (i.e. cases)
AFR EUR HISP
Num males 12 61 16
Num males 0 0 0
from staarpipeline-tutorial.
Hi @samreenzafer,
Thanks for checking. It all makes sense now. Here you may not be able to include SNPSEX
in the null model fitting, because knowing a sample with SNPSEX=2
guarantees that pheno=0
.
Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort.
So I think adding as.factor(group_race) to the fixed effect formula when fitting the null model , and Excluding SNPSEX makes most sense.
This sounds good.
from staarpipeline-tutorial.
from staarpipeline-tutorial.
Hi Samreen,
This is great to hear. You may close this case for now, and if you have any questions, please feel free to reopen it.
Best,
Xihao
from staarpipeline-tutorial.
Related Issues (20)
- vcf to gds HOT 7
- #SNVs per case/control? HOT 1
- Fitting NULL model for binary outcomes HOT 5
- Error in Gene Centric Analysis HOT 1
- Error in results_plof_genome[, "cMAC"] : subscript out of bounds HOT 2
- Followup Question to Issue #28 HOT 2
- STAARpipeline_Gene_Centric_Noncoding HOT 2
- Dynamic Window dim(X) error HOT 3
- Can't annotate individual variant results HOT 2
- [Suggestion-Implementation] Add information to summary and annotations of results HOT 1
- Conditional analysis - Summary Gene Centric Noncoding not running to completion HOT 6
- Ukbiobank Agds files generation HOT 16
- Plots for gene centric ncRNA regions HOT 5
- FATAL ERROR - Too many first alleles as the major allele (~21.5%). HOT 1
- warning messages in generating the annotated GDS (aGDS) file. HOT 3
- Controls / cases counts inverted when using binary model HOT 7
- kinship matrix HOT 2
- variant set in gene-centric coding/noncoding analysis HOT 2
- in the Step 2: Individual (single-variant) analysis, Error in if (chr == 1) { : argument is of length zero HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from staarpipeline-tutorial.