thierrygosselin / grur Goto Github PK
View Code? Open in Web Editor NEWgrur: an R package tailored for RADseq data imputations
Home Page: https://thierrygosselin.github.io/grur/
grur: an R package tailored for RADseq data imputations
Home Page: https://thierrygosselin.github.io/grur/
Hi Thierry,
I get the following error when running missing_visualization on the tutorial file "example_vcf2dadi_ferchaud_2015.vcf" and on a .vcf of my own:
$ ibm <- grur::missing_visualization(data = "example_vcf2dadi_ferchaud_2015.vcf", strata = "strata.stickleback.tsv")
Folder created:
missing_visualization_20181209@1523
Importing data
Reading VCF...
Generated a filters parameters file: [email protected]
Number of SNPs: 31802
Number of samples: 177
conversion timing: 1 sec
VCF: biallelic SNPs
Cleaning VCF sample names
Synchronizing sample IDs in VCF and strata...
Reads assembly: reference-assisted
Filters parameters file: updated
Number of chromosome/contig/scaffold: 196
Number of locus: 17095
Number of markers: 31802
Number of individuals: 177
Working time: 1 sec
Deprecated function, update your code to use: filter_monomorphic
Scanning for monomorphic markers...
Number of markers before/blacklisted/after: 31802/0/31802
Tidy genomic data:
Number of markers: 31802
Number of chromosome/contig/scaffold: 196
Number of individuals: 177
Number of populations: 8
Informations:
Number of populations: 8
Number of individuals: 177
Number of ind/pop:
HAD = 21
HAL = 19
KIB = 17
KRO = 20
MOS = 20
MAR = 20
NOR = 20
ODD = 40
Number of duplicate id: 0
Number of chrom/scaffolds: 196
Number of locus: 17095
Number of SNPs: 31802
Proportion of missing genotypes (overall): 0.022169
Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
Generating Identity by missingness plot
Redundancy analysis...
Error in UseMethod("ungroup") :
no applicable method for 'ungroup' applied to an object of class "character"
In addition: Warning message:
Trying to compute distinct() for variables not found in the data:
strata.select
version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 5.1
year 2018
month 07
day 02
svn rev 74947
language R
version.string R version 3.5.1 (2018-07-02)
nickname Feather Spray
Hi Dr. Gosselin,
I used the radiator package to read in my vcf
vcf <- read_vcf(data = "populations.snps.vcf", strata = "Whitefish.strata.tsv", parallel.core = 1L)
I then attempted to use the grur package to perform the identity by missingness analysis and got an error.
ibm.whitefish<-missing_visualization(
vcf,
strata = "Whitefish.strata.tsv",
distance.method = "euclidean",
ind.missing.geno.threshold = c(2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90),
filename = NULL,
parallel.core = 1L,
write.plot = TRUE
)
Here is the output
Default "..." arguments assigned in missing_visualization:
path.folder = NULL
Folder created: missing_visualization_20220713@1009
File written: [email protected]
Importing data
File written: individuals qc info and stats summary
File written: individuals qc plot
Informations:
Number of populations: 33
Number of individuals: 545
Number of ind/pop:
CclPEND21C = 15
PabBERL20C = 16
PgeBERL19C = 32
PspBERL20C = 16
PwiBGLR2021C = 26
PwiBHNR20C = 14
PwiBIGW0321C = 47
PwiBOIS20C = 18
PwiBRUN21C = 15
PwiCLFR21C = 16
PwiCLWR20C = 16
PwiEFSW14C = 16
PwiGALL20C = 16
PwiHFRK21C = 15
PwiKOTN21C = 14
PwiKOTN21C_1 = 15
PwiLKGC20C = 15
PwiLOGN20C = 16
PwiLSTW20C = 16
PwiMDSN20C = 16
PwiMFWR20C = 16
PwiNFBR21C = 12
PwiNFSR21C = 16
PwiRFLC20C = 16
PwiSFBR21C = 5
PwiSFSN20C = 15
PwiSQPR20C = 11
PwiSTMP20C = 16
PwiTETR20C = 16
PwiTRUC09C = 13
PwiWBSL20C = 16
PwiWLWL20C = 15
PwoYLOR21C = 8
Number of duplicate id: 0
Number of chrom/scaffolds: 1
Number of locus: 182
Number of SNPs: 350
Proportion of missing genotypes (overall): 0.144619
Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
fstcore package v0.9.12
(OpenMP detected, using 12 threads)
Generating Identity by missingness plot
Redundancy analysis...
Error in eigen(delta1) : infinite or missing values in 'x'
In addition: Warning message:
package ‘fstcore’ was built under R version 4.1.3
Computation time, overall: 5 sec
I saw in another error report that someone got the same error when attempting missing_visualization on a genlight object and you suggested it was because one of the pop groups had a size of 1 but that is not the case here. I have 33 populations (strata) and the smallest size is 5.
Here are the results from running session_info()
devtools::session_info()
Session info ----------------------------------------------------------------------------------------
setting value
version R version 4.1.0 (2021-05-18)
os Windows 10 x64 (build 19044)
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz America/Denver
date 2022-07-13
rstudio 2022.02.3+492 Prairie Trillium (desktop)
pandoc NA
Packages --------------------------------------------------------------------------------------------
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.1.0)
ade4 1.7-19 2022-04-19 [1] CRAN (R 4.1.3)
adegenet 2.1.7 2022-06-06 [1] CRAN (R 4.1.3)
adegraphics 1.0-16 2021-09-16 [1] CRAN (R 4.1.3)
adephylo 1.1-11 2017-12-18 [1] CRAN (R 4.1.3)
adespatial 0.3-16 2022-03-31 [1] CRAN (R 4.1.3)
ape 5.6-2 2022-03-02 [1] CRAN (R 4.1.3)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.1)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
BiocGenerics 0.40.0 2021-10-26 [1] Bioconductor
Biostrings 2.62.0 2021-10-26 [1] Bioconductor
bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.2)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.2)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.1)
boot 1.3-28 2021-05-03 [2] CRAN (R 4.1.0)
broom 1.0.0 2022-07-01 [1] CRAN (R 4.1.3)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.2)
callr 3.7.1 2022-07-13 [1] CRAN (R 4.1.0)
car 3.1-0 2022-06-15 [1] CRAN (R 4.1.3)
carData 3.0-5 2022-01-06 [1] CRAN (R 4.1.2)
class 7.3-19 2021-05-03 [2] CRAN (R 4.1.0)
classInt 0.4-7 2022-06-10 [1] CRAN (R 4.1.3)
cli 3.3.0 2022-04-25 [1] CRAN (R 4.1.3)
cluster 2.1.2 2021-04-17 [2] CRAN (R 4.1.0)
codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.0)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.1.3)
cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.1.0)
crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.3)
data.table 1.14.2 2021-09-27 [1] CRAN (R 4.1.2)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.1.3)
deldir 1.0-6 2021-10-23 [1] CRAN (R 4.1.1)
devtools 2.4.3 2021-11-30 [1] CRAN (R 4.1.2)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.3)
dplyr 1.0.9 2022-04-28 [1] CRAN (R 4.1.3)
e1071 1.7-11 2022-06-07 [1] CRAN (R 4.1.3)
EFGLmh 0.1.0 2021-06-14 [1] Github (delomast/EFGLmh@dbb9612)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.3)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.1.3)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
fst 0.9.8 2022-02-08 [1] CRAN (R 4.1.3)
fstcore * 0.9.12 2022-03-23 [1] CRAN (R 4.1.3)
gdsfmt 1.30.0 2021-10-26 [1] Bioconductor
generics 0.1.3 2022-07-05 [1] CRAN (R 4.1.3)
GenomeInfoDb 1.30.1 2022-01-30 [1] Bioconductor
GenomeInfoDbData 1.2.7 2022-02-14 [1] Bioconductor
GenomicRanges 1.46.1 2021-11-18 [1] Bioconductor
ggplot2 3.3.6 2022-05-03 [1] CRAN (R 4.1.3)
ggpubr 0.4.0 2020-06-27 [1] CRAN (R 4.1.0)
ggsignif 0.6.3 2021-09-09 [1] CRAN (R 4.1.2)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.3)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.2)
grur * 0.1.4 2022-07-12 [1] Github (d31c423)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.2)
htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.2)
httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.1.2)
httr 1.4.3 2022-05-04 [1] CRAN (R 4.1.3)
igraph 1.3.2 2022-06-13 [1] CRAN (R 4.1.3)
interp 1.1-2 2022-05-10 [1] CRAN (R 4.1.3)
IRanges 2.28.0 2021-10-26 [1] Bioconductor
jpeg 0.1-9 2021-07-24 [1] CRAN (R 4.1.1)
KernSmooth 2.23-20 2021-05-03 [2] CRAN (R 4.1.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.1.2)
lattice 0.20-44 2021-05-02 [2] CRAN (R 4.1.0)
latticeExtra 0.6-30 2022-07-04 [1] CRAN (R 4.1.3)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.3)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.2)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3)
MASS 7.3-54 2021-05-03 [2] CRAN (R 4.1.0)
Matrix 1.3-3 2021-05-04 [2] CRAN (R 4.1.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.2)
mgcv 1.8-35 2021-04-18 [2] CRAN (R 4.1.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.1.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
nlme 3.1-152 2021-02-04 [2] CRAN (R 4.1.0)
permute 0.9-7 2022-01-27 [1] CRAN (R 4.1.2)
phylobase 0.8.10 2020-03-01 [1] CRAN (R 4.1.3)
pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.3.0 2022-06-27 [1] CRAN (R 4.1.3)
plyr 1.8.7 2022-03-24 [1] CRAN (R 4.1.3)
png 0.1-7 2013-12-03 [1] CRAN (R 4.1.1)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.7.0 2022-07-07 [1] CRAN (R 4.1.3)
progress 1.2.2 2019-05-16 [1] CRAN (R 4.1.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.0)
proxy 0.4-27 2022-06-09 [1] CRAN (R 4.1.3)
ps 1.7.1 2022-06-18 [1] CRAN (R 4.1.3)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2)
radiator * 1.2.2 2022-07-12 [1] Github (thierrygosselin/radiator@6efdf14)
raster 3.5-21 2022-06-27 [1] CRAN (R 4.1.3)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.1.3)
Rcpp 1.0.8.3 2022-03-17 [1] CRAN (R 4.1.3)
RCurl 1.98-1.7 2022-06-09 [1] CRAN (R 4.1.3)
readr 2.1.2 2022-01-30 [1] CRAN (R 4.1.2)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.2)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.1.2)
rlang 1.0.3 2022-06-27 [1] CRAN (R 4.1.3)
rncl 0.8.6 2022-03-18 [1] CRAN (R 4.1.3)
RNeXML 2.4.7 2022-05-13 [1] CRAN (R 4.1.3)
rstatix 0.7.0 2021-02-13 [1] CRAN (R 4.1.0)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.1)
s2 1.0.7 2021-09-28 [1] CRAN (R 4.1.2)
S4Vectors 0.32.4 2022-04-03 [1] Bioconductor
scales 1.2.0 2022-04-13 [1] CRAN (R 4.1.3)
SeqArray 1.34.0 2021-10-26 [1] Bioconductor
seqinr 4.2-16 2022-05-19 [1] CRAN (R 4.1.3)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
sf 1.0-7 2022-03-07 [1] CRAN (R 4.1.3)
shiny 1.7.1 2021-10-02 [1] CRAN (R 4.1.2)
sp 1.5-0 2022-06-05 [1] CRAN (R 4.1.3)
spData 2.0.1 2021-10-14 [1] CRAN (R 4.1.2)
spdep 1.2-4 2022-04-18 [1] CRAN (R 4.1.3)
stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
terra 1.5-34 2022-06-09 [1] CRAN (R 4.1.3)
tibble 3.1.7 2022-05-03 [1] CRAN (R 4.1.3)
tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.1.3)
tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.3)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.1.3)
units 0.8-0 2022-02-05 [1] CRAN (R 4.1.2)
UpSetR 1.4.0 2019-05-22 [1] CRAN (R 4.1.3)
usethis 2.1.6 2022-05-25 [1] CRAN (R 4.1.3)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.3)
uuid 1.1-0 2022-04-19 [1] CRAN (R 4.1.3)
vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.3)
vegan 2.6-2 2022-04-17 [1] CRAN (R 4.1.3)
vroom 1.5.7 2021-11-30 [1] CRAN (R 4.1.2)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3)
wk 0.6.0 2022-01-03 [1] CRAN (R 4.1.2)
XML 3.99-0.10 2022-06-09 [1] CRAN (R 4.1.3)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.2)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.0)
XVector 0.34.0 2021-10-26 [1] Bioconductor
zlibbioc 1.40.0 2021-10-26 [1] Bioconductor
I've attached my strata file and the first 100 lines of my vcf file as txt files
trunc.populations.snps.vcf.txt
Whitefish.strata.tsv.txt
Thanks for any help,
Kat
I tried to used missing_visualization to get summary of genomic data.
However, I can not produce the plot & table, such as heatmap, missing summary table, manhattan and violin plots.
Here is my code.
library(grur) missing_visualization(data ="populations.snps.vcf", strata = "popmap.tsv", parallel.core = 1)
And here is my error message
Analysing percentage missing ...
Error: Column positions must be scalar
Call rlang::last_error()
to see a backtrace
Computation time, overall: 24 sec
############################ missing_visualization #############################
rlang::last_error()
<error> message: Column positions must be scalar class:
rlang_error`
backtrace:
%<>%
(...) ] with 7 more callsrlang::last_trace()
to see the full backtrace`How can I solve this problem?
Is there any dependent package need to install?
Attached is the file I used.
Hi Dr. Gosselin,
I downloaded sticklebacks_Danish.vcf from https://datadryad.org/stash/dataset/doi:10.5061%2Fdryad.kp11q and the strata.stickleback.tsv from https://www.dropbox.com/s/ely3wp4j4tulkrc/strata.stickleback.tsv?dl=0. I executed the following code
library("grur")
ibm <- grur::missing_visualization(
data = "sticklebacks_Danish.vcf",
strata = "strata.stickleback.tsv", parallel.core = 1L)
And I get the following error
Analysing percentage missing ...
Error in purrr::map()
:
i In index: 1.
Caused by error in env_get()
:
! object 'term' not found
Run rlang::last_error()
to see where the error occurred.
Warning message:
package ‘fstcore’ was built under R version 4.1.3
rlang::last_error()
Backtrace:
I tried again with a dataset of mine from Stacks population output and a popmap turned into a strata file and I got the same result.
The entire output I got is below
################################################################################
######################## grur::missing_visualization ###########################
################################################################################
Execution date/time: 20230106@1055
::grurmissing_visualization function call arguments:
data = sticklebacks_Danish.vcf
strata = strata.stickleback.tsv
strata.select = POP_ID
distance.method = euclidean
ind.missing.geno.threshold = 2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90
filename = NULL
parallel.core = 1
write.plot = TRUE
Default "..." arguments assigned in ::grurmissing_visualization:
path.folder = NULL
Folder created: missing_visualization_20230106@1055
File written: [email protected]
Importing data
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Reading VCF...
Data summary:
number of samples: 177
number of markers: 31802
Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 0 / 0 / 0
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 0 / 0 / 0
Number of chromosome/contig/scaffold: 196
Number of locus: 17095
Number of markers: 31802
Number of strata: 8
Number of individuals: 177
Number of ind/strata:
HAD = 21
HAL = 19
KIB = 17
KRO = 20
MOS = 20
MAR = 20
NOR = 20
ODD = 40
Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: [email protected]
File written: individuals qc info and stats summary
File written: individuals qc plot
Informations:
Number of populations: 8
Number of individuals: 177
Number of ind/pop:
HAD = 21
HAL = 19
KIB = 17
KRO = 20
MOS = 20
MAR = 20
NOR = 20
ODD = 40
Number of duplicate id: 0
Number of chrom/scaffolds: 196
Number of locus: 17095
Number of SNPs: 31802
Proportion of missing genotypes (overall): 0.022169
Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
fstcore package v0.9.12
(OpenMP detected, using 12 threads)
Generating Identity by missingness plot
Redundancy analysis...
Redundancy Analysis using strata: POP_ID
RDA model formula: data.pcoa$vectors ~ POP_ID
Permutation test for Redundancy Analysis using strata: POP_ID
Hypothesis based on the strata provided
Null Hypothesis (H0): No pattern of missingness in the data between strata
Alternative Hypothesis (H1): Presence of pattern(s) of missingness in the data between strata
STRATA VARIANCE P_VALUE
1 POP_ID 0.00597 0.000999
note: low p-value -> reject the null hypothesis
Analysing percentage missing ...
Error in purrr::map()
:
i In index: 1.
Caused by error in env_get()
:
! object 'term' not found
Run rlang::last_error()
to see where the error occurred.
Warning message:
package ‘fstcore’ was built under R version 4.1.3
Computation time, overall: 46 sec
############################ missing_visualization #############################
Hi Thierry,
I'm trying to impute a set of ddRAD SNPs (not a huge set, see details in the output), but it fails, giving the following output:
GBS_data$imputed.data <- grur_imputations(data = GBS_data$tidy.data, parallel.core=16)
###############################################################################
########################### grur::grur_imputations ############################
###############################################################################
Imputation method: rf
Hierarchical levels: populations
On-the-fly-imputations options:
number of trees to grow: 50
minimum terminal node size: 1
non-negative integer value used to specify random splitting: 10
number of iterations: 10
Number of CPUs: 16
Note: If you have speed issues: follow grur's vignette on parallel computing
Number of populations: 4
Number of individuals: 95
Number of markers: 355251
Proportion of missing genotypes before imputations: 0.44623
Scanning dataset for population(s) with monomorphic marker(s)...
Simple strawman imputations conducted on 605994 markers/pops combo
On-the-fly-imputations using Random Forests algorith
Imputations computed by populations, take a break...
Error: segfault from C stack overflow
The data was imported using genomic_converter()
from the radiator package (I initially tried to to the imputation on-the-fly during the import, but got the same segfault error, so separated the 2 processes).
I'm using R 3.4.0 on Ubuntu (see complete sessionInfo()
output below).
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.10
Matrix products: default
BLAS: /home/ibar/.Renv/versions/3.4.0/lib/R/lib/libRblas.so
LAPACK: /home/ibar/.Renv/versions/3.4.0/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 grur_0.0.6
loaded via a namespace (and not attached):
[1] amap_0.8-14 colorspace_1.3-2 seqinr_3.4-5
[4] deldir_0.1-14 htmlTable_1.9 base64enc_0.1-3
[7] listenv_0.6.0 gsl_1.9-10.3 DT_0.2
[10] mvtnorm_1.0-6 ranger_0.8.0 codetools_0.2-15
[13] splines_3.4.0 knitr_1.17 pegas_0.10
[16] polyclip_1.6-1 ade4_1.7-8 Formula_1.2-2
[19] swfscMisc_1.2 cluster_2.0.6 apex_1.0.2
[22] stabledist_0.7-1 copula_0.999-18 shiny_1.0.5
[25] readr_1.1.1 compiler_3.4.0 randomForestSRC_2.5.0
[28] strataG_2.0.2 backports_1.1.0 assertthat_0.2.0
[31] Matrix_1.2-9 lazyeval_0.2.0 acepack_1.4.1
[34] htmltools_0.3.6 tools_3.4.0 igraph_1.1.2
[37] coda_0.19-1 gtable_0.2.0 glue_1.1.1
[40] reshape2_1.4.2 dplyr_0.7.2 maps_3.2.0
[43] gmodels_2.16.2 spatstat_1.52-1 fastmatch_1.1-0
[46] Rcpp_0.12.12 RJSONIO_1.3-0 spdep_0.6-15
[49] gdata_2.18.0 ape_4.1 nlme_3.1-131
[52] pinfsc50_1.1.0 stringr_1.2.0 globals_0.10.2
[55] mapdata_2.2-6 mime_0.5 phangorn_2.2.0
[58] gtools_3.5.0 goftest_1.1-1 stringdist_0.9.4.6
[61] future_1.6.0 SNPRelate_1.11.2 radiator_0.0.4
[64] LearnBayes_2.15 MASS_7.3-47 scales_0.5.0
[67] spatstat.utils_1.7-1 hms_0.3 gdsfmt_1.12.0
[70] parallel_3.4.0 expm_0.999-2 RColorBrewer_1.1-2
[73] gridExtra_2.2.1 ggplot2_2.2.1 purrrlyr_0.0.2
[76] UpSetR_1.3.3 rpart_4.1-11 latticeExtra_0.6-28
[79] stringi_1.1.5 pcaPP_1.9-72 checkmate_1.8.3
[82] permute_0.9-4 boot_1.3-19 rlang_0.1.2
[85] pkgconfig_2.0.1 lattice_0.20-35 tensor_1.5
[88] purrr_0.2.3 bindr_0.1 htmlwidgets_0.9
[91] tidyselect_0.2.0 plyr_1.8.4 magrittr_1.5
[94] R6_2.2.2 Hmisc_4.0-3 ADGofTest_0.3
[97] foreign_0.8-67 mgcv_1.8-17 abind_1.4-5
[100] survival_2.41-3 sp_1.2-5 nnet_7.3-12
[103] pspline_1.0-18 tibble_1.3.4 shinyFiles_0.6.2
[106] xgboost_0.6-4 rmetasim_3.0.5 vcfR_1.5.0
[109] adegenet_2.0.1 grid_3.4.0 data.table_1.10.4
[112] vegan_2.4-4 digest_0.6.12 pbmcapply_1.2.4
[115] xtable_1.8-2 numDeriv_2016.8-1 tidyr_0.7.1
[118] httpuv_1.3.5 stats4_3.4.0 munsell_0.4.3
[121] fst_0.7.2 viridisLite_0.2.0 quadprog_1.5-5
BTW, both grur and radiator import a plethora of dependencies which: a) specially for grur, makes the installation more complicated b) actually exceeds the default DLL limit (100), so requires to modify the R_MAX_NUM_DLLS
variable (only possible in R>=3.4).
Thanks, Ido
I just install this package today (Ubuntu v16.04, R v3.4.1), and the install seemed to go successfully. When I ran this code:
library(grur) ibm <- missing_visualization(data = "../Inputs/OL-c85-t88-Breps-m50x68-maf025-u.vcf",strata = "../Making_Files/OL-c85-t88-Breps.pop")
I received this message:
Folder created:
missing_visualization_20180313@1706
Importing data
VCF is biallelic
Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/home/ksilliman/R/x86_64-pc-linux-gnu-library/3.4/tidyselect/libs/tidyselect.so: maximal number of DLLs reached...
Thanks!
Hi Thierry,
I have filtered a .vcf dataset so to create a version with low missingness of sites and samples. I then run the following lines to impute genotypes and output .rad and .tped files, with and without rf imputation:
test2 <- genomic_converter(data = "UMBELLA_Erumb1_samples_gt_90pct_snps_gt_40pct_covered.recode.vcf",output="plink",filename = "test_2",strata="strata_eu_40.tsv", imputation.method = "rf")
The run finishes with the following results section:
############################### RESULTS ###############################
Data format of input: vcf.file
Biallelic data
Number of common markers: 156
Number of chromosome/contig/scaffold: 113
Number of individuals 551
Number of populations 56
I then check the imputation accuracy, but all 156 markers are dropped, evidently for not being shared:
> imp_acc <- imputations_accuracy("test_2.rad","test_2_imputed.rad")
Data provided still contains missing genotypes,
accuracy will be mesured on common non-missing genotypes
Removing 156 markers not in common between datasets
$Information dropped from the analysis
[1] "Erumb1_s01095803__85__621" "Erumb1_s00047668__13__4936"
[3] "Erumb1_s00648045__68__686" "Erumb1_s01308175__88__428"
[5] "Erumb1_s02281967__111__51" "Erumb1_s02289709__112__65" .....
Misclassification Error: by populations
A tibble: 0 x 2
... with 2 variables: POP_ID , ME
Misclassification Error: by individuals
A tibble: 0 x 2
... with 2 variables: INDIVIDUALS , ME
Misclassification Error: by markers
A tibble: 0 x 2
... with 2 variables: MARKERS , ME
Misclassification Error: overall
[1] NaN
I think this is a bug. Do you? If you want, I can provide you with the datasets. Just let me know.
Best,
Peter
Hi Thierry,
I'm running missing_visualization on these data sets:
dat1
/// GENIND OBJECT /////////
// 264 individuals; 12,091 loci; 24,182 alleles; size: 30 Mb
// Basic content
@tab: 264 x 24182 matrix of allele counts
@loc.n.all: number of alleles per locus (range: 2-2)
@loc.fac: locus factor for the 24182 columns of @tab
@all.names: list of allele names for each locus
@ploidy: ploidy of each individual (range: 2-2)
@type: codom
@call: .local(x = x, i = i, j = j, loc = ..1, drop = drop)
// Optional content
@pop: population of each individual (group size range: 21-31)
head(strata)
INDIVIDUALS STRATA library
1 GM1 GM Pw1
2 GM26 GM Pw1
3 GM40 GM Pw1
4 GM31 GM Pw1
5 GM15 GM Pw1
6 GM24 GM Pw1
My strata file has multiple "STRATA" (populations) and libraries (>10 for each).
Here is my call & output:
miss.dat1 <- missing_visualization(dat1, strata=strata)
#######################################################################
#################### grur::missing_visualization ######################
#######################################################################
Folder created:
missing_visualization_20180424@1454
Importing data
Alleles names for each markers will be converted to factors and padded with 0
Scanning for monomorphic markers...
Number of markers before = 12091
Number of monomorphic markers removed = 0
Tidy genomic data:
Number of markers: 12091
Number of chromosome/contig/scaffold: no chromosome info
Number of individuals: 264
Number of populations: 1
Informations:
Number of populations: 1
Number of individuals: 264
Number of ind/pop:
NA
Number of duplicate id: 0
Number of SNPs: 12091
Proportion of missing genotypes (overall): 0.298188
Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
Generating Identity by missingness plot
Error in seq.default(h[1], h[2], length.out = n) :
'to' must be a finite number
In addition: There were 42 warnings (use warnings() to see them)
Any ideas what might be causing the seq.default error?
Thanks!
Brenna
Hello, I created a filtered vcf using radiator::filter_rad(), and would like to examine missingness using grur::missing_visualization.
However, when I ran the following command:
ibm = grur::missing_visualization(data='./filter_rad_20230531@1305/13_filtered/[email protected]', strata='strata_pb.txt')
I got the following error:
Error: 'generate_id_stats' is not an exported object from 'namespace:radiator'
Is there a simple fix that I am missing?
Bug to fix: rf_pred
option
Hi Thierry,
I am opening this issue here for grur because I wanted to document that I am getting the same error message as in the currently open issue in radiator. When I do:
miss_eu <- grur::missing_visualization( data = "UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz", strata = "strata_eu.tsv", strata.select = c("POP_ID","FLOWCELL","variety2", "MACHINE","year"), filename = "UMBELLA_erumb1_gt_90pct.RData")
I get:
#######################################################################
#################### grur::missing_visualization ######################
#######################################################################
Folder created:
missing_visualization_20181205@1819
Importing data
Show Traceback
Error in stringi::stri_replace_all_fixed(str = as.character(x), pattern = c("_", : object 'input' not found
The traceback is:
| stringi::stri_replace_all_fixed(str = as.character(x), pattern = c("_", ":", " "), replacement = c("-", "-", ""), vectorize_all = FALSE)
-- | --
radiator::clean_ind_names(input$INDIVIDUALS)
| radiator::tidy_genomic_data(data = data, vcf.metadata = FALSE, blacklist.id = blacklist.id, blacklist.genotype = blacklist.genotype, whitelist.markers = whitelist.markers, monomorphic.out = monomorphic.out, snp.ld = snp.ld, common.markers = common.markers, strata = strata, ...
-- | --
| grur::missing_visualization(data = "UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz", strata = "strata_eu.tsv", strata.select = c("POP_ID", "FLOWCELL", "variety2", "MACHINE", "year"), filename = "UMBELLA_erumb1_gt_90pct.RData")
-- | --
Looks like the problem originates in radiator. missingness_visulaization() worked fine on 16112018. Maybe this is due to commit c9ffe79 of radiator?
Hi Thierry,
As I've mentioned in my other Issue, grur
requires quite a lot of dependencies, which in my case raised an maximal number of DLLs reached
error when loaded with a few other packages.
It seems that the issue is with the Rdynload.c
of the base R code: #define MAX_NUM_DLLS 100
.
In R versions >3.4, you can set a different max number of DLLs using and environmental variable R_MAX_NUM_DLLS
. (taken from this SO thread)
From the release notes:
The maximum number of DLLs that can be loaded into R e.g. via dyn.load() can now be increased by setting the environment variable R_MAX_NUM_DLLS before starting R.
I'm pretty sure some of the packages are not entirely needed, such as maps
, mapdata
, and non-standard tidyverse
packages, such as glue
, tidyselect
, purrrlyr
, reshape2
(isn't it a part of tidyr
now?), plyr
(mostly replaced by dplyr
), etc.
Thanks, Ido
Hi Dr. Gosselin,
I'm trying tor work through the missing data analysis vignette using the sample data and I'm running into problems. First of all, the vcf file created by the line
writeBin(httr::content(httr::GET("http://datadryad.org/bitstream/handle/10255/dryad.97237/sticklebacks_Danish.vcf?sequence=1"), "raw"), "stickleback_ferchaud_2015.vcf")
generated an empty file. Therefore, I went to Dryad and downloaded sticklebacks_Danish.vcf instead. I also downloaded strata.stickleback.tsv for the strata. I executed this line of code
ibm <- grur::missing_visualization(data = "sticklebacks_Danish.vcf", strata = "strata.stickleback.tsv")
Here is the output where the error message occurs:
Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: [email protected]
Error in dplyr::mutate()
:
! Problem while computing MISSING_PROP = round(...)
.
Caused by error in .DynamicClusterCall()
:
! One of the nodes produced an error: Can not open file 'C:\Test\missing_visualization_20230105@1248\[email protected]'. The process cannot access the file because it is being used by another process.
Run rlang::last_error()
to see where the error occurred.
rlang::last_error()
Backtrace:
Any help would be appreciated.
Kat
Hi Thierry,
Any suggestions for troubleshooting this error?
> genlight
/// GENLIGHT OBJECT /////////
// 51 genotypes, 13,936 binary SNPs, size: 1.9 Mb
200845 (28.26 %) missing data
// Basic content
@gen: list of 51 SNPbin
// Optional content
@ind.names: 51 individual labels
@loc.names: 13936 locus labels
@chromosome: factor storing chromosomes of the SNPs
@position: integer storing positions of the SNPs
@pop: population of each individual (group size range: 1-17)
@other: a list containing: elements without names
Proportion of missing genotypes (overall): 0.282587
> missing_dat <- grur::missing_visualization(genlight)
#######################################################################
#################### grur::missing_visualization ######################
#######################################################################
Folder created:
missing_visualization_20180412@1213
Importing data
Scanning for monomorphic markers...
Number of markers before = 13936
Number of monomorphic markers removed = 0
Tidy genomic data:
Number of markers: 13936
Number of chromosome/contig/scaffold: 1
Number of individuals: 51
Number of populations: 15
Informations:
Number of populations: 15
Number of individuals: 51
Number of ind/pop:
Arbon,_ID = 5
Ashton,_ID = 1
Conrad,_MT = 2
Denton,_MT = 2
Hermiston,_OR = 17
Kalispell,_MT = 1
Kimberly = 1
Kimberly,_ID = 3
McAmmon,_ID = 2
Neely,_ID = 2
Picabo,_ID = 4
Rexbug,_ID = 2
Ririe,_ID = 4
Soda_Springs,_ID = 3
Townsend,_MT = 2
Number of duplicate id: 0
Number of chrom/scaffolds: 1
Number of locus: 13936
Number of SNPs: 13936
Proportion of missing genotypes (overall): 0.282587
Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
Generating Identity by missingness plot
Error in eigen(delta1) : infinite or missing values in 'x'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.