repvar

The goal of repvar is to find the minimum number of samples that will represent all variables in a data set. This was built for population genetic data, but is generalizable to any discrete data type that can be represented as an integer matrix.

Installation

This package is not currently on CRAN, but you can install it like so:

# install.packages("remotes") # or devtools or ghit
remotes::install_github("zkamvar/repvar")

Example

Here is a basic example of how you can identify the minimum set. We will use the pre-packaged monilinia data set from Everhart and Scherm 2016.

options(width = 120)

library("repvar")
data("monilinia")
dim(monilinia)
#> [1] 264  95
loci <- sapply(strsplit(colnames(monilinia), "[.]"), "[", 1)
rpv_image(monilinia, f = loci) # show the data

# Shuffle the data set 200 times to find an optimal number of samples
set.seed(2018)
id_list <- rpv_find(monilinia, n = 200, cut = TRUE, progress = FALSE)
id_list
#> [[1]]
#>  [1] "A233" "A610" "A154" "A603" "A666" "A163" "A293" "A339" "A590" "A071" "A085" "A218" "A269" "A074" "A182" "A417"
#> [17] "A681" "A176" "A366" "A489" "A216" "A172" "A488" "A406" "A390" "A039" "A010" "A016" "A692" "A129"
#> 
#> [[2]]
#>  [1] "A233" "A610" "A154" "A603" "A666" "A163" "A293" "A339" "A590" "A071" "A085" "A218" "A269" "A074" "A182" "A417"
#> [17] "A681" "A176" "A367" "A489" "A191" "A172" "A488" "A408" "A390" "A404" "A387" "A016" "A692" "A571"
#> 
#> [[3]]
#>  [1] "A233" "A610" "A154" "A603" "A666" "A163" "A293" "A339" "A590" "A071" "A085" "A218" "A269" "A074" "A182" "A417"
#> [17] "A681" "A176" "A367" "A522" "A191" "A172" "A488" "A408" "A390" "A547" "A385" "A480" "A692" "A088"
lengths(id_list)
#> [1] 30 30 30
rpv_image(monilinia, f = loci, highlight = id_list[[1]])

Here, the yellow bands highlight the IDs that we found.

Real-world example

Because you get a list of ids, it's good to see which ones are actually useful. For this, you can calculate entropy. We will use the tidyverse to first create a table of samples and data, calculate entropy for each row, and then join them together. In general, we will want higher entropy values. For this, we will load three tidyverse packages:

library("tibble")
library("tidyr")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Generate and filter the possible data sets ------------------------------
set.seed(2018 - 03 - 07)
res <- rpv_find(monilinia, n = 10000, cut = TRUE, progress = FALSE) %>%
  tibble::enframe(name = "index", value = "ids") %>%            # create a data frame of of list columns
  dplyr::mutate(n = lengths(ids)) %>%                           # count the number of indices in each row
  dplyr::rowwise() %>%                                          # set the data frame to be computed by row
  dplyr::mutate(dat = list(monilinia[ids, ,drop = FALSE])) %>%  # add the original data set to each row
  dplyr::mutate(nall = sum(colSums(dat, na.rm = TRUE) > 0))     # count the number of columns

# calculate entropy for each data set -------------------------------------
# The statistics returns a data frame, but because we've embedded the data,
# we must calculate this separately and then merge it later.
entro <- res %>% 
  dplyr::mutate(e = list(rpv_stats(dat))) %>% # calculate stats for each row
  dplyr::select(index, e) %>%               # retain only stats and index
  tidyr::unnest()                           # spread out the columns

# sort by E5, and missingness. ----------------------------
res_sort <- res %>%
  dplyr::inner_join(entro, by = "index") %>%
  dplyr::arrange(-E5, missing)
res_sort
#> # A tibble: 50 x 10
#>    index ids            n dat              nall    eH     G    E5 lambda missing
#>    <int> <list>     <int> <list>          <int> <dbl> <dbl> <dbl>  <dbl>   <dbl>
#>  1    46 <chr [30]>    30 <int [30 × 95]>    95  55.9  40.6 0.721  0.978 0.00877
#>  2    12 <chr [30]>    30 <int [30 × 95]>    95  55.7  40.4 0.721  0.978 0.00877
#>  3    28 <chr [30]>    30 <int [30 × 95]>    95  55.8  40.4 0.719  0.978 0.00877
#>  4    26 <chr [30]>    30 <int [30 × 95]>    95  55.5  40.2 0.719  0.978 0.00772
#>  5    17 <chr [30]>    30 <int [30 × 95]>    95  55.4  40.1 0.718  0.978 0.00772
#>  6    24 <chr [30]>    30 <int [30 × 95]>    95  55.4  40.1 0.718  0.978 0.00772
#>  7    41 <chr [30]>    30 <int [30 × 95]>    95  55.1  39.8 0.718  0.977 0.00772
#>  8    23 <chr [30]>    30 <int [30 × 95]>    95  55.1  39.8 0.717  0.977 0.00772
#>  9    48 <chr [30]>    30 <int [30 × 95]>    95  55.5  40.1 0.717  0.978 0.00596
#> 10    27 <chr [30]>    30 <int [30 × 95]>    95  55.3  39.9 0.717  0.977 0.00772
#> # ... with 40 more rows

From here, we can see the samples

cat(res_sort$ids[[1]], sep = ", ")
#> A233, A610, A154, A603, A666, A163, A293, A339, A590, A071, A085, A218, A269, A074, A182, A417, A681, A176, A367, A489, A191, A172, A488, A406, A390, A404, A385, A016, A692, A168

We can use this to then visualize the distributon of the sub-sampled data:

mc <- colSums(monilinia, na.rm = TRUE)
barplot(sort(mc), las = 3)
rc <- colSums(res_sort$dat[[1]], na.rm = TRUE)
barplot(rc[order(mc)], xaxt = "n", add = TRUE, col = "blue")
legend("topleft", fill = c("grey", "blue"), legend = c("full data", "subsampled"))

Calculating rpv_stats by groups

We can group our variables as well.

f <- gsub("[.][0-9]+", "", colnames(monilinia))
f <- factor(f, unique(f))
entromean <- res %>% 
  dplyr::mutate(e = list(rpv_stats(dat, f = f))) %>% # calculate stats for each row
  dplyr::select(index, e) %>%               # retain only stats and index
  tidyr::unnest() %>%                       # spread out the columns
  dplyr::group_by(index) %>%
  dplyr::summarize_if(is.numeric, mean)     # calculate mean over all loci

# sort by E5, and missingness. ----------------------------
resmean_sort <- res %>%
  dplyr::inner_join(entromean, by = "index") %>%
  dplyr::arrange(-E5, missing)
resmean_sort
#> # A tibble: 50 x 10
#>    index ids            n dat              nall    eH     G    E5 lambda missing
#>    <int> <list>     <int> <list>          <int> <dbl> <dbl> <dbl>  <dbl>   <dbl>
#>  1    46 <chr [30]>    30 <int [30 × 95]>    95  4.67  3.60 0.716  0.702 0.0103 
#>  2    36 <chr [30]>    30 <int [30 × 95]>    95  4.64  3.62 0.716  0.698 0.00513
#>  3    26 <chr [30]>    30 <int [30 × 95]>    95  4.64  3.59 0.715  0.699 0.00769
#>  4    28 <chr [30]>    30 <int [30 × 95]>    95  4.67  3.62 0.715  0.700 0.0103 
#>  5    17 <chr [30]>    30 <int [30 × 95]>    95  4.62  3.60 0.714  0.698 0.00769
#>  6    27 <chr [30]>    30 <int [30 × 95]>    95  4.63  3.60 0.713  0.697 0.00769
#>  7    24 <chr [30]>    30 <int [30 × 95]>    95  4.63  3.59 0.713  0.698 0.00769
#>  8     3 <chr [30]>    30 <int [30 × 95]>    95  4.62  3.59 0.713  0.696 0.00769
#>  9    20 <chr [30]>    30 <int [30 × 95]>    95  4.62  3.60 0.713  0.696 0.00513
#> 10    22 <chr [30]>    30 <int [30 × 95]>    95  4.61  3.57 0.712  0.694 0.00769
#> # ... with 40 more rows

The factor can be used to color our barplot

rd1 <- colSums(resmean_sort$dat[[1]], na.rm = TRUE) 
barplot(rd1[order(mc)], las = 3, col = f[order(mc)])

Aaaand it doesn't work -_-

library("repvar")
set.seed(2018)
x <- replicate(200, sample(nrow(monilinia)))[, 75]
colSums(monilinia[get_minimum_set(monilinia[x, ]), ], na.rm = TRUE)
#>  CHMFc4.224  CHMFc4.231  CHMFc4.238   CHMFc5.85   CHMFc5.97  CHMFc5.111 
#>          10          17           3           4          25           0 
#>  CHMFc5.105  CHMFc5.107  CHMFc5.113 CHMFc12.163 CHMFc12.159 CHMFc12.169 
#>           1           0           0           9          20           0 
#>     SEA.132     SEA.156     SEA.160     SEA.128     SEA.136     SEA.152 
#>          17           5           4           0           1           1 
#>     SEA.144     SEA.140     SEA.164     SEA.176     SEA.148     SED.133 
#>           0           2           0           0           0          14 
#>     SED.129     SED.145     SED.109     SED.137     SED.141     SED.149 
#>           4           3           1           4           4           0 
#>     SED.125     SED.187     SEE.156     SEE.148     SEE.160     SEE.152 
#>           0           0          28           2           0           0 
#>     SEE.164     SEG.144     SEG.126     SEG.138     SEG.120     SEG.154 
#>           0          14           5          10           0           0 
#>     SEG.149     SEG.132     SEG.159     SEI.116     SEI.113     SEI.119 
#>           0           0           0          17           6           5 
#>     SEI.122     SEI.110     SEI.103     SEI.100     SEL.143     SEL.147 
#>           0           1           0           1          12          11 
#>     SEL.139     SEL.151     SEL.135     SEN.227     SEN.231     SEN.235 
#>           5           1           0          10           4           6 
#>     SEN.223     SEN.219     SEN.273     SEN.256     SEN.215     SEN.248 
#>           7           2           0           0           1           0 
#>     SEP.257     SEP.261     SEP.273     SEP.237     SEP.265     SEP.247 
#>          10          14           0           0           3           2 
#>     SEP.245     SEP.254     SEQ.142     SEQ.136     SEQ.134     SEQ.138 
#>           0           0           2           6          10           2 
#>     SEQ.152     SEQ.140     SEQ.132     SEQ.154     SEQ.146     SEQ.144 
#>           4           1           2           1           0           1 
#>     SEQ.130     SER.145     SER.153     SER.149     SER.141     SER.137 
#>           0           4           2          15           4           2 
#>     SER.157     SER.165     SER.133     SER.182     SER.147 
#>           1           0           2           0           0
all(colSums(monilinia[get_minimum_set(monilinia[x, ]), ], na.rm = TRUE) > 0)
#> [1] FALSE

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Chicago             
#>  date     2018-02-16
#> Packages -----------------------------------------------------------------
#>  package   * version    date       source                          
#>  backports   1.1.2      2017-12-13 CRAN (R 3.4.3)                  
#>  base      * 3.4.3      2017-12-07 local                           
#>  compiler    3.4.3      2017-12-07 local                           
#>  datasets  * 3.4.3      2017-12-07 local                           
#>  devtools    1.13.4     2017-11-09 CRAN (R 3.4.2)                  
#>  digest      0.6.15     2018-01-28 cran (@0.6.15)                  
#>  evaluate    0.10.1     2017-06-24 CRAN (R 3.4.1)                  
#>  formatR     1.5        2017-04-25 CRAN (R 3.4.0)                  
#>  graphics  * 3.4.3      2017-12-07 local                           
#>  grDevices * 3.4.3      2017-12-07 local                           
#>  htmltools   0.3.6      2017-04-28 CRAN (R 3.4.0)                  
#>  knitr       1.19       2018-01-29 CRAN (R 3.4.3)                  
#>  magrittr    1.5        2014-11-22 CRAN (R 3.4.0)                  
#>  memoise     1.1.0      2017-04-21 CRAN (R 3.4.0)                  
#>  methods   * 3.4.3      2017-12-07 local                           
#>  Rcpp        0.12.15    2018-01-20 cran (@0.12.15)                 
#>  repvar    * 0.1.0      2018-02-16 local                           
#>  rmarkdown   1.8        2017-11-17 cran (@1.8)                     
#>  rprojroot   1.3-2      2018-01-03 CRAN (R 3.4.3)                  
#>  stats     * 3.4.3      2017-12-07 local                           
#>  stringi     1.1.6      2017-11-17 CRAN (R 3.4.2)                  
#>  stringr     1.2.0      2017-02-18 CRAN (R 3.4.0)                  
#>  tools       3.4.3      2017-12-07 local                           
#>  utils     * 3.4.3      2017-12-07 local                           
#>  withr       2.1.1.9000 2018-01-09 Github (jimhester/withr@df18523)
#>  yaml        2.1.16     2017-12-12 CRAN (R 3.4.3)

zkamvar / repvar Goto Github PK

repvar's Introduction

repvar

Installation

Example

Real-world example

Calculating rpv_stats by groups

repvar's People

Contributors

Watchers

repvar's Issues

Aaaand it doesn't work -_-

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs