GithubHelp home page GithubHelp logo

thierrygosselin / grur Goto Github PK

View Code? Open in Web Editor NEW
7.0 5.0 0.0 7.64 MB

grur: an R package tailored for RADseq data imputations

Home Page: https://thierrygosselin.github.io/grur/

R 100.00%
genomics genomics-visualization genomic-data-analysis imputation radseq radseq-data gbs machine-learning random-forest boosting-algorithms pca-analysis missing-data

grur's People

Contributors

anne-laureferchaud avatar ericarcher avatar thierrygosselin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

grur's Issues

Missing_visualization: Error in UseMethod("ungroup") : no applicable method

Hi Thierry,
I get the following error when running missing_visualization on the tutorial file "example_vcf2dadi_ferchaud_2015.vcf" and on a .vcf of my own:

$ ibm <- grur::missing_visualization(data = "example_vcf2dadi_ferchaud_2015.vcf", strata = "strata.stickleback.tsv")
Folder created:
missing_visualization_20181209@1523

Importing data

Reading VCF...
Generated a filters parameters file: [email protected]

Number of SNPs: 31802
Number of samples: 177

conversion timing: 1 sec

VCF: biallelic SNPs
Cleaning VCF sample names

Synchronizing sample IDs in VCF and strata...
Reads assembly: reference-assisted
Filters parameters file: updated

Number of chromosome/contig/scaffold: 196
Number of locus: 17095
Number of markers: 31802
Number of individuals: 177

Working time: 1 sec

Deprecated function, update your code to use: filter_monomorphic

Scanning for monomorphic markers...
Number of markers before/blacklisted/after: 31802/0/31802

Tidy genomic data:
Number of markers: 31802
Number of chromosome/contig/scaffold: 196
Number of individuals: 177
Number of populations: 8

Informations:
Number of populations: 8
Number of individuals: 177
Number of ind/pop:
HAD = 21
HAL = 19
KIB = 17
KRO = 20
MOS = 20
MAR = 20
NOR = 20
ODD = 40

Number of duplicate id: 0
Number of chrom/scaffolds: 196
Number of locus: 17095
Number of SNPs: 31802

Proportion of missing genotypes (overall): 0.022169

Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
Generating Identity by missingness plot

Redundancy analysis...

Error in UseMethod("ungroup") :
no applicable method for 'ungroup' applied to an object of class "character"
In addition: Warning message:
Trying to compute distinct() for variables not found in the data:

  • strata.select
    This is an error, but only a warning is raised for compatibility reasons.
    The operation will return the input unchanged.
    ################################################
    The ibm object is not generated. No blacklists or PCoA statistics are generated. I am on Mac, with updated RStudio 1.1.383. The command was run from the RStudio console. Can anyone replicate this behavior, or is it just my system?? Any suggestions?

version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 5.1
year 2018
month 07
day 02
svn rev 74947
language R
version.string R version 3.5.1 (2018-07-02)
nickname Feather Spray

Error in eigen(delta1) : infinite or missing values in x

Hi Dr. Gosselin,
I used the radiator package to read in my vcf
vcf <- read_vcf(data = "populations.snps.vcf", strata = "Whitefish.strata.tsv", parallel.core = 1L)

I then attempted to use the grur package to perform the identity by missingness analysis and got an error.
ibm.whitefish<-missing_visualization(
vcf,
strata = "Whitefish.strata.tsv",
distance.method = "euclidean",
ind.missing.geno.threshold = c(2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90),
filename = NULL,
parallel.core = 1L,
write.plot = TRUE
)

Here is the output

Default "..." arguments assigned in missing_visualization:
path.folder = NULL

Folder created: missing_visualization_20220713@1009
File written: [email protected]

Importing data
File written: individuals qc info and stats summary
File written: individuals qc plot

Informations:
Number of populations: 33
Number of individuals: 545
Number of ind/pop:
CclPEND21C = 15
PabBERL20C = 16
PgeBERL19C = 32
PspBERL20C = 16
PwiBGLR2021C = 26
PwiBHNR20C = 14
PwiBIGW0321C = 47
PwiBOIS20C = 18
PwiBRUN21C = 15
PwiCLFR21C = 16
PwiCLWR20C = 16
PwiEFSW14C = 16
PwiGALL20C = 16
PwiHFRK21C = 15
PwiKOTN21C = 14
PwiKOTN21C_1 = 15
PwiLKGC20C = 15
PwiLOGN20C = 16
PwiLSTW20C = 16
PwiMDSN20C = 16
PwiMFWR20C = 16
PwiNFBR21C = 12
PwiNFSR21C = 16
PwiRFLC20C = 16
PwiSFBR21C = 5
PwiSFSN20C = 15
PwiSQPR20C = 11
PwiSTMP20C = 16
PwiTETR20C = 16
PwiTRUC09C = 13
PwiWBSL20C = 16
PwiWLWL20C = 15
PwoYLOR21C = 8

Number of duplicate id: 0
Number of chrom/scaffolds: 1
Number of locus: 182
Number of SNPs: 350

Proportion of missing genotypes (overall): 0.144619
Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
fstcore package v0.9.12
(OpenMP detected, using 12 threads)
Generating Identity by missingness plot
Redundancy analysis...
Error in eigen(delta1) : infinite or missing values in 'x'
In addition: Warning message:
package ‘fstcore’ was built under R version 4.1.3

Computation time, overall: 5 sec

I saw in another error report that someone got the same error when attempting missing_visualization on a genlight object and you suggested it was because one of the pop groups had a size of 1 but that is not the case here. I have 33 populations (strata) and the smallest size is 5.

Here are the results from running session_info()
devtools::session_info()

  • Session info ----------------------------------------------------------------------------------------
    setting value
    version R version 4.1.0 (2021-05-18)
    os Windows 10 x64 (build 19044)
    system x86_64, mingw32
    ui RStudio
    language (EN)
    collate English_United States.1252
    ctype English_United States.1252
    tz America/Denver
    date 2022-07-13
    rstudio 2022.02.3+492 Prairie Trillium (desktop)
    pandoc NA

  • Packages --------------------------------------------------------------------------------------------
    package * version date (UTC) lib source
    abind 1.4-5 2016-07-21 [1] CRAN (R 4.1.0)
    ade4 1.7-19 2022-04-19 [1] CRAN (R 4.1.3)
    adegenet 2.1.7 2022-06-06 [1] CRAN (R 4.1.3)
    adegraphics 1.0-16 2021-09-16 [1] CRAN (R 4.1.3)
    adephylo 1.1-11 2017-12-18 [1] CRAN (R 4.1.3)
    adespatial 0.3-16 2022-03-31 [1] CRAN (R 4.1.3)
    ape 5.6-2 2022-03-02 [1] CRAN (R 4.1.3)
    assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.1)
    backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
    BiocGenerics 0.40.0 2021-10-26 [1] Bioconductor
    Biostrings 2.62.0 2021-10-26 [1] Bioconductor
    bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.2)
    bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.2)
    bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.1)
    boot 1.3-28 2021-05-03 [2] CRAN (R 4.1.0)
    broom 1.0.0 2022-07-01 [1] CRAN (R 4.1.3)
    cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.2)
    callr 3.7.1 2022-07-13 [1] CRAN (R 4.1.0)
    car 3.1-0 2022-06-15 [1] CRAN (R 4.1.3)
    carData 3.0-5 2022-01-06 [1] CRAN (R 4.1.2)
    class 7.3-19 2021-05-03 [2] CRAN (R 4.1.0)
    classInt 0.4-7 2022-06-10 [1] CRAN (R 4.1.3)
    cli 3.3.0 2022-04-25 [1] CRAN (R 4.1.3)
    cluster 2.1.2 2021-04-17 [2] CRAN (R 4.1.0)
    codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.0)
    colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.1.3)
    cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.1.0)
    crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.3)
    data.table 1.14.2 2021-09-27 [1] CRAN (R 4.1.2)
    DBI 1.1.3 2022-06-18 [1] CRAN (R 4.1.3)
    deldir 1.0-6 2021-10-23 [1] CRAN (R 4.1.1)
    devtools 2.4.3 2021-11-30 [1] CRAN (R 4.1.2)
    digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.3)
    dplyr 1.0.9 2022-04-28 [1] CRAN (R 4.1.3)
    e1071 1.7-11 2022-06-07 [1] CRAN (R 4.1.3)
    EFGLmh 0.1.0 2021-06-14 [1] Github (delomast/EFGLmh@dbb9612)
    ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
    fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.3)
    farver 2.1.1 2022-07-06 [1] CRAN (R 4.1.3)
    fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
    fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
    fst 0.9.8 2022-02-08 [1] CRAN (R 4.1.3)
    fstcore * 0.9.12 2022-03-23 [1] CRAN (R 4.1.3)
    gdsfmt 1.30.0 2021-10-26 [1] Bioconductor
    generics 0.1.3 2022-07-05 [1] CRAN (R 4.1.3)
    GenomeInfoDb 1.30.1 2022-01-30 [1] Bioconductor
    GenomeInfoDbData 1.2.7 2022-02-14 [1] Bioconductor
    GenomicRanges 1.46.1 2021-11-18 [1] Bioconductor
    ggplot2 3.3.6 2022-05-03 [1] CRAN (R 4.1.3)
    ggpubr 0.4.0 2020-06-27 [1] CRAN (R 4.1.0)
    ggsignif 0.6.3 2021-09-09 [1] CRAN (R 4.1.2)
    glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.3)
    gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.2)
    grur * 0.1.4 2022-07-12 [1] Github (d31c423)
    gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
    hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.2)
    htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.2)
    httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.1.2)
    httr 1.4.3 2022-05-04 [1] CRAN (R 4.1.3)
    igraph 1.3.2 2022-06-13 [1] CRAN (R 4.1.3)
    interp 1.1-2 2022-05-10 [1] CRAN (R 4.1.3)
    IRanges 2.28.0 2021-10-26 [1] Bioconductor
    jpeg 0.1-9 2021-07-24 [1] CRAN (R 4.1.1)
    KernSmooth 2.23-20 2021-05-03 [2] CRAN (R 4.1.0)
    labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
    later 1.3.0 2021-08-18 [1] CRAN (R 4.1.2)
    lattice 0.20-44 2021-05-02 [2] CRAN (R 4.1.0)
    latticeExtra 0.6-30 2022-07-04 [1] CRAN (R 4.1.3)
    lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.3)
    lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.2)
    magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3)
    MASS 7.3-54 2021-05-03 [2] CRAN (R 4.1.0)
    Matrix 1.3-3 2021-05-04 [2] CRAN (R 4.1.0)
    memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.2)
    mgcv 1.8-35 2021-04-18 [2] CRAN (R 4.1.0)
    mime 0.12 2021-09-28 [1] CRAN (R 4.1.1)
    munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
    nlme 3.1-152 2021-02-04 [2] CRAN (R 4.1.0)
    permute 0.9-7 2022-01-27 [1] CRAN (R 4.1.2)
    phylobase 0.8.10 2020-03-01 [1] CRAN (R 4.1.3)
    pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
    pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.2)
    pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
    pkgload 1.3.0 2022-06-27 [1] CRAN (R 4.1.3)
    plyr 1.8.7 2022-03-24 [1] CRAN (R 4.1.3)
    png 0.1-7 2013-12-03 [1] CRAN (R 4.1.1)
    prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
    processx 3.7.0 2022-07-07 [1] CRAN (R 4.1.3)
    progress 1.2.2 2019-05-16 [1] CRAN (R 4.1.0)
    promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.0)
    proxy 0.4-27 2022-06-09 [1] CRAN (R 4.1.3)
    ps 1.7.1 2022-06-18 [1] CRAN (R 4.1.3)
    purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
    R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2)
    radiator * 1.2.2 2022-07-12 [1] Github (thierrygosselin/radiator@6efdf14)
    raster 3.5-21 2022-06-27 [1] CRAN (R 4.1.3)
    RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.1.3)
    Rcpp 1.0.8.3 2022-03-17 [1] CRAN (R 4.1.3)
    RCurl 1.98-1.7 2022-06-09 [1] CRAN (R 4.1.3)
    readr 2.1.2 2022-01-30 [1] CRAN (R 4.1.2)
    remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.2)
    reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.1.2)
    rlang 1.0.3 2022-06-27 [1] CRAN (R 4.1.3)
    rncl 0.8.6 2022-03-18 [1] CRAN (R 4.1.3)
    RNeXML 2.4.7 2022-05-13 [1] CRAN (R 4.1.3)
    rstatix 0.7.0 2021-02-13 [1] CRAN (R 4.1.0)
    rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.1)
    s2 1.0.7 2021-09-28 [1] CRAN (R 4.1.2)
    S4Vectors 0.32.4 2022-04-03 [1] Bioconductor
    scales 1.2.0 2022-04-13 [1] CRAN (R 4.1.3)
    SeqArray 1.34.0 2021-10-26 [1] Bioconductor
    seqinr 4.2-16 2022-05-19 [1] CRAN (R 4.1.3)
    sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
    sf 1.0-7 2022-03-07 [1] CRAN (R 4.1.3)
    shiny 1.7.1 2021-10-02 [1] CRAN (R 4.1.2)
    sp 1.5-0 2022-06-05 [1] CRAN (R 4.1.3)
    spData 2.0.1 2021-10-14 [1] CRAN (R 4.1.2)
    spdep 1.2-4 2022-04-18 [1] CRAN (R 4.1.3)
    stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
    stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
    terra 1.5-34 2022-06-09 [1] CRAN (R 4.1.3)
    tibble 3.1.7 2022-05-03 [1] CRAN (R 4.1.3)
    tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.1.3)
    tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.3)
    tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.1.3)
    units 0.8-0 2022-02-05 [1] CRAN (R 4.1.2)
    UpSetR 1.4.0 2019-05-22 [1] CRAN (R 4.1.3)
    usethis 2.1.6 2022-05-25 [1] CRAN (R 4.1.3)
    utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.3)
    uuid 1.1-0 2022-04-19 [1] CRAN (R 4.1.3)
    vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.3)
    vegan 2.6-2 2022-04-17 [1] CRAN (R 4.1.3)
    vroom 1.5.7 2021-11-30 [1] CRAN (R 4.1.2)
    withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3)
    wk 0.6.0 2022-01-03 [1] CRAN (R 4.1.2)
    XML 3.99-0.10 2022-06-09 [1] CRAN (R 4.1.3)
    xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.2)
    xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.0)
    XVector 0.34.0 2021-10-26 [1] Bioconductor
    zlibbioc 1.40.0 2021-10-26 [1] Bioconductor

I've attached my strata file and the first 100 lines of my vcf file as txt files
trunc.populations.snps.vcf.txt
Whitefish.strata.tsv.txt

Thanks for any help,
Kat

Column positions must be scalar

I tried to used missing_visualization to get summary of genomic data.
However, I can not produce the plot & table, such as heatmap, missing summary table, manhattan and violin plots.

Here is my code.
library(grur) missing_visualization(data ="populations.snps.vcf", strata = "popmap.tsv", parallel.core = 1)

And here is my error message

Analysing percentage missing ...
Error: Column positions must be scalar
Call rlang::last_error() to see a backtrace

Computation time, overall: 24 sec
############################ missing_visualization #############################

rlang::last_error()
<error> message: Column positions must be scalar class: rlang_error`
backtrace:

  1. grur::missing_visualization(...)
  2. purrr::map(...)
  3. grur:::.f(.x[[i]], ...)
  4. [ %<>%(...) ] with 7 more calls
  5. dplyr:::rename.data.frame(., STRATA_SELECT = data[[!!(strata.select)]])
  6. tidyselect::vars_rename(names(.data), !!!enquos(...))
  7. tidyselect:::vars_rename_eval(quos, .vars)
  8. purrr::map2_chr(renamed, names(quos), validate_renamed_var, vars)
  9. tidyselect:::.f(.x[[1L]], .y[[1L]], ...)
  10. rlang::switch_type(...)
    Call rlang::last_trace() to see the full backtrace`

How can I solve this problem?
Is there any dependent package need to install?

Attached is the file I used.

population.zip

Error in purrr::map()

Hi Dr. Gosselin,
I downloaded sticklebacks_Danish.vcf from https://datadryad.org/stash/dataset/doi:10.5061%2Fdryad.kp11q and the strata.stickleback.tsv from https://www.dropbox.com/s/ely3wp4j4tulkrc/strata.stickleback.tsv?dl=0. I executed the following code

library("grur")
ibm <- grur::missing_visualization(
data = "sticklebacks_Danish.vcf",
strata = "strata.stickleback.tsv", parallel.core = 1L)

And I get the following error
Analysing percentage missing ...
Error in purrr::map():
i In index: 1.
Caused by error in env_get():
! object 'term' not found
Run rlang::last_error() to see where the error occurred.
Warning message:
package ‘fstcore’ was built under R version 4.1.3

rlang::last_error()
Backtrace:

  1. grur::missing_visualization(...)
  2. tidyr:::pivot_wider.data.frame(...)
  3. tidyr::build_wider_spec(...)
  4. tidyselect::eval_select(enquo(names_from), data)
  5. tidyselect:::eval_select_impl(...)
  6. tidyselect:::vars_select_eval(...)
  7. tidyselect:::walk_data_tree(expr, data_mask, context_mask)
  8. tidyselect:::eval_sym(expr, data_mask, context_mask)
  9. rlang::env_get(env, name, default = missing_arg(), inherit = TRUE)

I tried again with a dataset of mine from Stacks population output and a popmap turned into a strata file and I got the same result.

The entire output I got is below

################################################################################
######################## grur::missing_visualization ###########################
################################################################################
Execution date/time: 20230106@1055

::grurmissing_visualization function call arguments:
data = sticklebacks_Danish.vcf
strata = strata.stickleback.tsv
strata.select = POP_ID
distance.method = euclidean
ind.missing.geno.threshold = 2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90
filename = NULL
parallel.core = 1
write.plot = TRUE

Default "..." arguments assigned in ::grurmissing_visualization:
path.folder = NULL

Folder created: missing_visualization_20230106@1055
File written: [email protected]

Importing data
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’
Found more than one class "Annotated" in cache; using the first, from namespace 'RNeXML'
Also defined by ‘S4Vectors’

Reading VCF...

Data summary:
number of samples: 177
number of markers: 31802

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 0 / 0 / 0

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 0 / 0 / 0

Number of chromosome/contig/scaffold: 196
Number of locus: 17095
Number of markers: 31802
Number of strata: 8
Number of individuals: 177

Number of ind/strata:
HAD = 21
HAL = 19
KIB = 17
KRO = 20
MOS = 20
MAR = 20
NOR = 20
ODD = 40

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: [email protected]
File written: individuals qc info and stats summary
File written: individuals qc plot

Informations:
Number of populations: 8
Number of individuals: 177
Number of ind/pop:
HAD = 21
HAL = 19
KIB = 17
KRO = 20
MOS = 20
MAR = 20
NOR = 20
ODD = 40

Number of duplicate id: 0
Number of chrom/scaffolds: 196
Number of locus: 17095
Number of SNPs: 31802

Proportion of missing genotypes (overall): 0.022169

Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
fstcore package v0.9.12
(OpenMP detected, using 12 threads)
Generating Identity by missingness plot
Redundancy analysis...
Redundancy Analysis using strata: POP_ID
RDA model formula: data.pcoa$vectors ~ POP_ID
Permutation test for Redundancy Analysis using strata: POP_ID

Hypothesis based on the strata provided
Null Hypothesis (H0): No pattern of missingness in the data between strata
Alternative Hypothesis (H1): Presence of pattern(s) of missingness in the data between strata

A tibble: 1 x 3

STRATA VARIANCE P_VALUE

1 POP_ID 0.00597 0.000999
note: low p-value -> reject the null hypothesis

Analysing percentage missing ...
Error in purrr::map():
i In index: 1.
Caused by error in env_get():
! object 'term' not found
Run rlang::last_error() to see where the error occurred.
Warning message:
package ‘fstcore’ was built under R version 4.1.3

Computation time, overall: 46 sec
############################ missing_visualization #############################

Segfault from C stack overflow during imputation

Hi Thierry,

I'm trying to impute a set of ddRAD SNPs (not a huge set, see details in the output), but it fails, giving the following output:

GBS_data$imputed.data <- grur_imputations(data = GBS_data$tidy.data, parallel.core=16)

###############################################################################
########################### grur::grur_imputations ############################
###############################################################################
Imputation method: rf
Hierarchical levels: populations
On-the-fly-imputations options:
    number of trees to grow: 50
    minimum terminal node size: 1
    non-negative integer value used to specify random splitting: 10
    number of iterations: 10
Number of CPUs: 16
Note: If you have speed issues: follow grur's vignette on parallel computing


Number of populations: 4
Number of individuals: 95
Number of markers: 355251

Proportion of missing genotypes before imputations: 0.44623
Scanning dataset for population(s) with monomorphic marker(s)...
    Simple strawman imputations conducted on 605994 markers/pops combo
On-the-fly-imputations using Random Forests algorith
    Imputations computed by populations, take a break...
Error: segfault from C stack overflow

The data was imported using genomic_converter() from the radiator package (I initially tried to to the imputation on-the-fly during the import, but got the same segfault error, so separated the 2 processes).
I'm using R 3.4.0 on Ubuntu (see complete sessionInfo() output below).

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.10

Matrix products: default
BLAS: /home/ibar/.Renv/versions/3.4.0/lib/R/lib/libRblas.so
LAPACK: /home/ibar/.Renv/versions/3.4.0/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] bindrcpp_0.2 grur_0.0.6

loaded via a namespace (and not attached):
  [1] amap_0.8-14           colorspace_1.3-2      seqinr_3.4-5
  [4] deldir_0.1-14         htmlTable_1.9         base64enc_0.1-3
  [7] listenv_0.6.0         gsl_1.9-10.3          DT_0.2
 [10] mvtnorm_1.0-6         ranger_0.8.0          codetools_0.2-15
 [13] splines_3.4.0         knitr_1.17            pegas_0.10
 [16] polyclip_1.6-1        ade4_1.7-8            Formula_1.2-2
 [19] swfscMisc_1.2         cluster_2.0.6         apex_1.0.2
 [22] stabledist_0.7-1      copula_0.999-18       shiny_1.0.5
 [25] readr_1.1.1           compiler_3.4.0        randomForestSRC_2.5.0
 [28] strataG_2.0.2         backports_1.1.0       assertthat_0.2.0
 [31] Matrix_1.2-9          lazyeval_0.2.0        acepack_1.4.1
 [34] htmltools_0.3.6       tools_3.4.0           igraph_1.1.2
 [37] coda_0.19-1           gtable_0.2.0          glue_1.1.1
 [40] reshape2_1.4.2        dplyr_0.7.2           maps_3.2.0
 [43] gmodels_2.16.2        spatstat_1.52-1       fastmatch_1.1-0
 [46] Rcpp_0.12.12          RJSONIO_1.3-0         spdep_0.6-15
 [49] gdata_2.18.0          ape_4.1               nlme_3.1-131
 [52] pinfsc50_1.1.0        stringr_1.2.0         globals_0.10.2
 [55] mapdata_2.2-6         mime_0.5              phangorn_2.2.0
 [58] gtools_3.5.0          goftest_1.1-1         stringdist_0.9.4.6
 [61] future_1.6.0          SNPRelate_1.11.2      radiator_0.0.4
 [64] LearnBayes_2.15       MASS_7.3-47           scales_0.5.0
 [67] spatstat.utils_1.7-1  hms_0.3               gdsfmt_1.12.0
[70] parallel_3.4.0        expm_0.999-2          RColorBrewer_1.1-2
 [73] gridExtra_2.2.1       ggplot2_2.2.1         purrrlyr_0.0.2
 [76] UpSetR_1.3.3          rpart_4.1-11          latticeExtra_0.6-28
 [79] stringi_1.1.5         pcaPP_1.9-72          checkmate_1.8.3
 [82] permute_0.9-4         boot_1.3-19           rlang_0.1.2
 [85] pkgconfig_2.0.1       lattice_0.20-35       tensor_1.5
 [88] purrr_0.2.3           bindr_0.1             htmlwidgets_0.9
 [91] tidyselect_0.2.0      plyr_1.8.4            magrittr_1.5
 [94] R6_2.2.2              Hmisc_4.0-3           ADGofTest_0.3
 [97] foreign_0.8-67        mgcv_1.8-17           abind_1.4-5
[100] survival_2.41-3       sp_1.2-5              nnet_7.3-12
[103] pspline_1.0-18        tibble_1.3.4          shinyFiles_0.6.2
[106] xgboost_0.6-4         rmetasim_3.0.5        vcfR_1.5.0
[109] adegenet_2.0.1        grid_3.4.0            data.table_1.10.4
[112] vegan_2.4-4           digest_0.6.12         pbmcapply_1.2.4
[115] xtable_1.8-2          numDeriv_2016.8-1     tidyr_0.7.1
[118] httpuv_1.3.5          stats4_3.4.0          munsell_0.4.3
[121] fst_0.7.2             viridisLite_0.2.0     quadprog_1.5-5

BTW, both grur and radiator import a plethora of dependencies which: a) specially for grur, makes the installation more complicated b) actually exceeds the default DLL limit (100), so requires to modify the R_MAX_NUM_DLLS variable (only possible in R>=3.4).

Thanks, Ido

Error running missing_visualization: Maximal number of DLLs reached...

I just install this package today (Ubuntu v16.04, R v3.4.1), and the install seemed to go successfully. When I ran this code:

library(grur) ibm <- missing_visualization(data = "../Inputs/OL-c85-t88-Breps-m50x68-maf025-u.vcf",strata = "../Making_Files/OL-c85-t88-Breps.pop")

I received this message:
Folder created:
missing_visualization_20180313@1706
Importing data
VCF is biallelic
Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/home/ksilliman/R/x86_64-pc-linux-gnu-library/3.4/tidyselect/libs/tidyselect.so: maximal number of DLLs reached...

Full traceback:
screenshot from 2018-03-13 17-10-53

Thanks!

imputations_accuracy() lists all markers as not in common and drops all from analysis

Hi Thierry,
I have filtered a .vcf dataset so to create a version with low missingness of sites and samples. I then run the following lines to impute genotypes and output .rad and .tped files, with and without rf imputation:

test2 <- genomic_converter(data = "UMBELLA_Erumb1_samples_gt_90pct_snps_gt_40pct_covered.recode.vcf",output="plink",filename = "test_2",strata="strata_eu_40.tsv", imputation.method = "rf")

The run finishes with the following results section:

############################### RESULTS ###############################
Data format of input: vcf.file
Biallelic data
Number of common markers: 156
Number of chromosome/contig/scaffold: 113
Number of individuals 551
Number of populations 56

I then check the imputation accuracy, but all 156 markers are dropped, evidently for not being shared:

> imp_acc <- imputations_accuracy("test_2.rad","test_2_imputed.rad")

Data provided still contains missing genotypes,
accuracy will be mesured on common non-missing genotypes
Removing 156 markers not in common between datasets
$Information dropped from the analysis
$Information dropped from the analysis$markers.dropped
[1] "Erumb1_s01095803__85__621" "Erumb1_s00047668__13__4936"
[3] "Erumb1_s00648045__68__686" "Erumb1_s01308175__88__428"
[5] "Erumb1_s02281967__111__51" "Erumb1_s02289709__112__65" .....

Misclassification Error: by populations
A tibble: 0 x 2
... with 2 variables: POP_ID , ME

Misclassification Error: by individuals
A tibble: 0 x 2
... with 2 variables: INDIVIDUALS , ME

Misclassification Error: by markers
A tibble: 0 x 2
... with 2 variables: MARKERS , ME

Misclassification Error: overall
[1] NaN

I think this is a bug. Do you? If you want, I can provide you with the datasets. Just let me know.
Best,
Peter

Error in seq.default with missing_visualization

Hi Thierry,

I'm running missing_visualization on these data sets:

dat1
/// GENIND OBJECT /////////

// 264 individuals; 12,091 loci; 24,182 alleles; size: 30 Mb

// Basic content
@tab: 264 x 24182 matrix of allele counts
@loc.n.all: number of alleles per locus (range: 2-2)
@loc.fac: locus factor for the 24182 columns of @tab
@all.names: list of allele names for each locus
@ploidy: ploidy of each individual (range: 2-2)
@type: codom
@call: .local(x = x, i = i, j = j, loc = ..1, drop = drop)

// Optional content
@pop: population of each individual (group size range: 21-31)

head(strata)
INDIVIDUALS STRATA library
1 GM1 GM Pw1
2 GM26 GM Pw1
3 GM40 GM Pw1
4 GM31 GM Pw1
5 GM15 GM Pw1
6 GM24 GM Pw1

My strata file has multiple "STRATA" (populations) and libraries (>10 for each).

Here is my call & output:

miss.dat1 <- missing_visualization(dat1, strata=strata)
#######################################################################
#################### grur::missing_visualization ######################
#######################################################################
Folder created:
missing_visualization_20180424@1454

Importing data
Alleles names for each markers will be converted to factors and padded with 0
Scanning for monomorphic markers...
Number of markers before = 12091
Number of monomorphic markers removed = 0

Tidy genomic data:
Number of markers: 12091
Number of chromosome/contig/scaffold: no chromosome info
Number of individuals: 264
Number of populations: 1

Informations:
Number of populations: 1
Number of individuals: 264
Number of ind/pop:
NA

Number of duplicate id: 0
Number of SNPs: 12091

Proportion of missing genotypes (overall): 0.298188

Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
Generating Identity by missingness plot
Error in seq.default(h[1], h[2], length.out = n) :
'to' must be a finite number
In addition: There were 42 warnings (use warnings() to see them)

Any ideas what might be causing the seq.default error?
Thanks!
Brenna

grur::missing_visualization error

Hello, I created a filtered vcf using radiator::filter_rad(), and would like to examine missingness using grur::missing_visualization.

However, when I ran the following command:

ibm = grur::missing_visualization(data='./filter_rad_20230531@1305/13_filtered/[email protected]', strata='strata_pb.txt')

I got the following error:

Error: 'generate_id_stats' is not an exported object from 'namespace:radiator'

Is there a simple fix that I am missing?

missing_visualization() does't load .vcf.gz

Hi Thierry,
I am opening this issue here for grur because I wanted to document that I am getting the same error message as in the currently open issue in radiator. When I do:
miss_eu <- grur::missing_visualization( data = "UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz", strata = "strata_eu.tsv", strata.select = c("POP_ID","FLOWCELL","variety2", "MACHINE","year"), filename = "UMBELLA_erumb1_gt_90pct.RData")

I get:

#######################################################################
#################### grur::missing_visualization ######################
#######################################################################
Folder created:
missing_visualization_20181205@1819

Importing data
Show Traceback
Error in stringi::stri_replace_all_fixed(str = as.character(x), pattern = c("_", : object 'input' not found

The traceback is:

  1. | stringi::stri_replace_all_fixed(str = as.character(x), pattern = c("_", ":", " "), replacement = c("-", "-", ""), vectorize_all = FALSE)
    -- | --

  2. radiator::clean_ind_names(input$INDIVIDUALS)

  3. | radiator::tidy_genomic_data(data = data, vcf.metadata = FALSE, blacklist.id = blacklist.id, blacklist.genotype = blacklist.genotype, whitelist.markers = whitelist.markers, monomorphic.out = monomorphic.out, snp.ld = snp.ld, common.markers = common.markers, strata = strata, ...
    -- | --

  4. | grur::missing_visualization(data = "UMBELLA_Erumb1_samples_gt_50pct_covered.recode.vcf.gz", strata = "strata_eu.tsv", strata.select = c("POP_ID", "FLOWCELL", "variety2", "MACHINE", "year"), filename = "UMBELLA_erumb1_gt_90pct.RData")
    -- | --

Looks like the problem originates in radiator. missingness_visulaization() worked fine on 16112018. Maybe this is due to commit c9ffe79 of radiator?

Package dependencies

Hi Thierry,

As I've mentioned in my other Issue, grur requires quite a lot of dependencies, which in my case raised an maximal number of DLLs reached error when loaded with a few other packages.
It seems that the issue is with the Rdynload.c of the base R code: #define MAX_NUM_DLLS 100.
In R versions >3.4, you can set a different max number of DLLs using and environmental variable R_MAX_NUM_DLLS. (taken from this SO thread)
From the release notes:

The maximum number of DLLs that can be loaded into R e.g. via dyn.load() can now be increased by setting the environment variable R_MAX_NUM_DLLS before starting R.

I'm pretty sure some of the packages are not entirely needed, such as maps, mapdata, and non-standard tidyverse packages, such as glue, tidyselect, purrrlyr, reshape2 (isn't it a part of tidyr now?), plyr (mostly replaced by dplyr), etc.

Thanks, Ido

Missing data analysis vignette

Hi Dr. Gosselin,
I'm trying tor work through the missing data analysis vignette using the sample data and I'm running into problems. First of all, the vcf file created by the line

writeBin(httr::content(httr::GET("http://datadryad.org/bitstream/handle/10255/dryad.97237/sticklebacks_Danish.vcf?sequence=1"), "raw"), "stickleback_ferchaud_2015.vcf")

generated an empty file. Therefore, I went to Dryad and downloaded sticklebacks_Danish.vcf instead. I also downloaded strata.stickleback.tsv for the strata. I executed this line of code

ibm <- grur::missing_visualization(data = "sticklebacks_Danish.vcf", strata = "strata.stickleback.tsv")

Here is the output where the error message occurs:

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: [email protected]
Error in dplyr::mutate():
! Problem while computing MISSING_PROP = round(...).
Caused by error in .DynamicClusterCall():
! One of the nodes produced an error: Can not open file 'C:\Test\missing_visualization_20230105@1248\[email protected]'. The process cannot access the file because it is being used by another process.
Run rlang::last_error() to see where the error occurred.

rlang::last_error()
Backtrace:

  1. grur::missing_visualization(...)
  2. SeqArray::seqMissing(gdsfile = gds, per.variant = FALSE, parallel = parallel.core)
  3. SeqArray::seqParallel(...)
  4. SeqArray::seqParallel(...)
  5. SeqArray:::.DynamicClusterCall(...)
  6. base::stop("One of the nodes produced an error: ", as.character(dv))

Any help would be appreciated.
Kat

Error in eigen(delta1) : infinite or missing values in 'x'

Hi Thierry,

Any suggestions for troubleshooting this error?

> genlight
/// GENLIGHT OBJECT /////////

// 51 genotypes, 13,936 binary SNPs, size: 1.9 Mb
200845 (28.26 %) missing data

// Basic content
@gen: list of 51 SNPbin

// Optional content
@ind.names: 51 individual labels
@loc.names: 13936 locus labels
@chromosome: factor storing chromosomes of the SNPs
@position: integer storing positions of the SNPs
@pop: population of each individual (group size range: 1-17)
@other: a list containing: elements without names

Proportion of missing genotypes (overall): 0.282587

> missing_dat <- grur::missing_visualization(genlight)
#######################################################################
#################### grur::missing_visualization ######################
#######################################################################
Folder created:
missing_visualization_20180412@1213

Importing data
Scanning for monomorphic markers...
Number of markers before = 13936
Number of monomorphic markers removed = 0

Tidy genomic data:
Number of markers: 13936
Number of chromosome/contig/scaffold: 1
Number of individuals: 51
Number of populations: 15

Informations:
Number of populations: 15
Number of individuals: 51
Number of ind/pop:
Arbon,_ID = 5
Ashton,_ID = 1
Conrad,_MT = 2
Denton,_MT = 2
Hermiston,_OR = 17
Kalispell,_MT = 1
Kimberly = 1
Kimberly,_ID = 3
McAmmon,_ID = 2
Neely,_ID = 2
Picabo,_ID = 4
Rexbug,_ID = 2
Ririe,_ID = 4
Soda_Springs,_ID = 3
Townsend,_MT = 2

Number of duplicate id: 0
Number of chrom/scaffolds: 1
Number of locus: 13936
Number of SNPs: 13936

Proportion of missing genotypes (overall): 0.282587

Identity-by-missingness (IBM) analysis using
Principal Coordinate Analysis (PCoA)...
Generating Identity by missingness plot
Error in eigen(delta1) : infinite or missing values in 'x'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.