GithubHelp home page GithubHelp logo

zkamvar / repvar Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 2.9 MB

R package to find samples that represent all variables

Home Page: https://zkamvar.github.io/repvar

License: Other

R 98.04% Makefile 1.96%
r genetics microsatellite alleles

repvar's Introduction

repvar

lifecycle Travis build status Coverage status

The goal of repvar is to find the minimum number of samples that will represent all variables in a data set. This was built for population genetic data, but is generalizable to any discrete data type that can be represented as an integer matrix.

Installation

This package is not currently on CRAN, but you can install it like so:

# install.packages("remotes") # or devtools or ghit
remotes::install_github("zkamvar/repvar")

Example

Here is a basic example of how you can identify the minimum set. We will use the pre-packaged monilinia data set from Everhart and Scherm 2016.

options(width = 120)

library("repvar")
data("monilinia")
dim(monilinia)
#> [1] 264  95
loci <- sapply(strsplit(colnames(monilinia), "[.]"), "[", 1)
rpv_image(monilinia, f = loci) # show the data

# Shuffle the data set 200 times to find an optimal number of samples
set.seed(2018)
id_list <- rpv_find(monilinia, n = 200, cut = TRUE, progress = FALSE)
id_list
#> [[1]]
#>  [1] "A233" "A610" "A154" "A603" "A666" "A163" "A293" "A339" "A590" "A071" "A085" "A218" "A269" "A074" "A182" "A417"
#> [17] "A681" "A176" "A366" "A489" "A216" "A172" "A488" "A406" "A390" "A039" "A010" "A016" "A692" "A129"
#> 
#> [[2]]
#>  [1] "A233" "A610" "A154" "A603" "A666" "A163" "A293" "A339" "A590" "A071" "A085" "A218" "A269" "A074" "A182" "A417"
#> [17] "A681" "A176" "A367" "A489" "A191" "A172" "A488" "A408" "A390" "A404" "A387" "A016" "A692" "A571"
#> 
#> [[3]]
#>  [1] "A233" "A610" "A154" "A603" "A666" "A163" "A293" "A339" "A590" "A071" "A085" "A218" "A269" "A074" "A182" "A417"
#> [17] "A681" "A176" "A367" "A522" "A191" "A172" "A488" "A408" "A390" "A547" "A385" "A480" "A692" "A088"
lengths(id_list)
#> [1] 30 30 30
rpv_image(monilinia, f = loci, highlight = id_list[[1]])

Here, the yellow bands highlight the IDs that we found.

Real-world example

Because you get a list of ids, it's good to see which ones are actually useful. For this, you can calculate entropy. We will use the tidyverse to first create a table of samples and data, calculate entropy for each row, and then join them together. In general, we will want higher entropy values. For this, we will load three tidyverse packages:

library("tibble")
library("tidyr")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# Generate and filter the possible data sets ------------------------------
set.seed(2018 - 03 - 07)
res <- rpv_find(monilinia, n = 10000, cut = TRUE, progress = FALSE) %>%
  tibble::enframe(name = "index", value = "ids") %>%            # create a data frame of of list columns
  dplyr::mutate(n = lengths(ids)) %>%                           # count the number of indices in each row
  dplyr::rowwise() %>%                                          # set the data frame to be computed by row
  dplyr::mutate(dat = list(monilinia[ids, ,drop = FALSE])) %>%  # add the original data set to each row
  dplyr::mutate(nall = sum(colSums(dat, na.rm = TRUE) > 0))     # count the number of columns

# calculate entropy for each data set -------------------------------------
# The statistics returns a data frame, but because we've embedded the data,
# we must calculate this separately and then merge it later.
entro <- res %>% 
  dplyr::mutate(e = list(rpv_stats(dat))) %>% # calculate stats for each row
  dplyr::select(index, e) %>%               # retain only stats and index
  tidyr::unnest()                           # spread out the columns

# sort by E5, and missingness. ----------------------------
res_sort <- res %>%
  dplyr::inner_join(entro, by = "index") %>%
  dplyr::arrange(-E5, missing)
res_sort
#> # A tibble: 50 x 10
#>    index ids            n dat              nall    eH     G    E5 lambda missing
#>    <int> <list>     <int> <list>          <int> <dbl> <dbl> <dbl>  <dbl>   <dbl>
#>  1    46 <chr [30]>    30 <int [30 × 95]>    95  55.9  40.6 0.721  0.978 0.00877
#>  2    12 <chr [30]>    30 <int [30 × 95]>    95  55.7  40.4 0.721  0.978 0.00877
#>  3    28 <chr [30]>    30 <int [30 × 95]>    95  55.8  40.4 0.719  0.978 0.00877
#>  4    26 <chr [30]>    30 <int [30 × 95]>    95  55.5  40.2 0.719  0.978 0.00772
#>  5    17 <chr [30]>    30 <int [30 × 95]>    95  55.4  40.1 0.718  0.978 0.00772
#>  6    24 <chr [30]>    30 <int [30 × 95]>    95  55.4  40.1 0.718  0.978 0.00772
#>  7    41 <chr [30]>    30 <int [30 × 95]>    95  55.1  39.8 0.718  0.977 0.00772
#>  8    23 <chr [30]>    30 <int [30 × 95]>    95  55.1  39.8 0.717  0.977 0.00772
#>  9    48 <chr [30]>    30 <int [30 × 95]>    95  55.5  40.1 0.717  0.978 0.00596
#> 10    27 <chr [30]>    30 <int [30 × 95]>    95  55.3  39.9 0.717  0.977 0.00772
#> # ... with 40 more rows

From here, we can see the samples

cat(res_sort$ids[[1]], sep = ", ")
#> A233, A610, A154, A603, A666, A163, A293, A339, A590, A071, A085, A218, A269, A074, A182, A417, A681, A176, A367, A489, A191, A172, A488, A406, A390, A404, A385, A016, A692, A168

We can use this to then visualize the distributon of the sub-sampled data:

mc <- colSums(monilinia, na.rm = TRUE)
barplot(sort(mc), las = 3)
rc <- colSums(res_sort$dat[[1]], na.rm = TRUE)
barplot(rc[order(mc)], xaxt = "n", add = TRUE, col = "blue")
legend("topleft", fill = c("grey", "blue"), legend = c("full data", "subsampled"))

Calculating rpv_stats by groups

We can group our variables as well.

f <- gsub("[.][0-9]+", "", colnames(monilinia))
f <- factor(f, unique(f))
entromean <- res %>% 
  dplyr::mutate(e = list(rpv_stats(dat, f = f))) %>% # calculate stats for each row
  dplyr::select(index, e) %>%               # retain only stats and index
  tidyr::unnest() %>%                       # spread out the columns
  dplyr::group_by(index) %>%
  dplyr::summarize_if(is.numeric, mean)     # calculate mean over all loci

# sort by E5, and missingness. ----------------------------
resmean_sort <- res %>%
  dplyr::inner_join(entromean, by = "index") %>%
  dplyr::arrange(-E5, missing)
resmean_sort
#> # A tibble: 50 x 10
#>    index ids            n dat              nall    eH     G    E5 lambda missing
#>    <int> <list>     <int> <list>          <int> <dbl> <dbl> <dbl>  <dbl>   <dbl>
#>  1    46 <chr [30]>    30 <int [30 × 95]>    95  4.67  3.60 0.716  0.702 0.0103 
#>  2    36 <chr [30]>    30 <int [30 × 95]>    95  4.64  3.62 0.716  0.698 0.00513
#>  3    26 <chr [30]>    30 <int [30 × 95]>    95  4.64  3.59 0.715  0.699 0.00769
#>  4    28 <chr [30]>    30 <int [30 × 95]>    95  4.67  3.62 0.715  0.700 0.0103 
#>  5    17 <chr [30]>    30 <int [30 × 95]>    95  4.62  3.60 0.714  0.698 0.00769
#>  6    27 <chr [30]>    30 <int [30 × 95]>    95  4.63  3.60 0.713  0.697 0.00769
#>  7    24 <chr [30]>    30 <int [30 × 95]>    95  4.63  3.59 0.713  0.698 0.00769
#>  8     3 <chr [30]>    30 <int [30 × 95]>    95  4.62  3.59 0.713  0.696 0.00769
#>  9    20 <chr [30]>    30 <int [30 × 95]>    95  4.62  3.60 0.713  0.696 0.00513
#> 10    22 <chr [30]>    30 <int [30 × 95]>    95  4.61  3.57 0.712  0.694 0.00769
#> # ... with 40 more rows

The factor can be used to color our barplot

rd1 <- colSums(resmean_sort$dat[[1]], na.rm = TRUE) 
barplot(rd1[order(mc)], las = 3, col = f[order(mc)])

repvar's People

Contributors

zkamvar avatar

Watchers

 avatar  avatar

repvar's Issues

Aaaand it doesn't work -_-

library("repvar")
set.seed(2018)
x <- replicate(200, sample(nrow(monilinia)))[, 75]
colSums(monilinia[get_minimum_set(monilinia[x, ]), ], na.rm = TRUE)
#>  CHMFc4.224  CHMFc4.231  CHMFc4.238   CHMFc5.85   CHMFc5.97  CHMFc5.111 
#>          10          17           3           4          25           0 
#>  CHMFc5.105  CHMFc5.107  CHMFc5.113 CHMFc12.163 CHMFc12.159 CHMFc12.169 
#>           1           0           0           9          20           0 
#>     SEA.132     SEA.156     SEA.160     SEA.128     SEA.136     SEA.152 
#>          17           5           4           0           1           1 
#>     SEA.144     SEA.140     SEA.164     SEA.176     SEA.148     SED.133 
#>           0           2           0           0           0          14 
#>     SED.129     SED.145     SED.109     SED.137     SED.141     SED.149 
#>           4           3           1           4           4           0 
#>     SED.125     SED.187     SEE.156     SEE.148     SEE.160     SEE.152 
#>           0           0          28           2           0           0 
#>     SEE.164     SEG.144     SEG.126     SEG.138     SEG.120     SEG.154 
#>           0          14           5          10           0           0 
#>     SEG.149     SEG.132     SEG.159     SEI.116     SEI.113     SEI.119 
#>           0           0           0          17           6           5 
#>     SEI.122     SEI.110     SEI.103     SEI.100     SEL.143     SEL.147 
#>           0           1           0           1          12          11 
#>     SEL.139     SEL.151     SEL.135     SEN.227     SEN.231     SEN.235 
#>           5           1           0          10           4           6 
#>     SEN.223     SEN.219     SEN.273     SEN.256     SEN.215     SEN.248 
#>           7           2           0           0           1           0 
#>     SEP.257     SEP.261     SEP.273     SEP.237     SEP.265     SEP.247 
#>          10          14           0           0           3           2 
#>     SEP.245     SEP.254     SEQ.142     SEQ.136     SEQ.134     SEQ.138 
#>           0           0           2           6          10           2 
#>     SEQ.152     SEQ.140     SEQ.132     SEQ.154     SEQ.146     SEQ.144 
#>           4           1           2           1           0           1 
#>     SEQ.130     SER.145     SER.153     SER.149     SER.141     SER.137 
#>           0           4           2          15           4           2 
#>     SER.157     SER.165     SER.133     SER.182     SER.147 
#>           1           0           2           0           0
all(colSums(monilinia[get_minimum_set(monilinia[x, ]), ], na.rm = TRUE) > 0)
#> [1] FALSE
Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Chicago             
#>  date     2018-02-16
#> Packages -----------------------------------------------------------------
#>  package   * version    date       source                          
#>  backports   1.1.2      2017-12-13 CRAN (R 3.4.3)                  
#>  base      * 3.4.3      2017-12-07 local                           
#>  compiler    3.4.3      2017-12-07 local                           
#>  datasets  * 3.4.3      2017-12-07 local                           
#>  devtools    1.13.4     2017-11-09 CRAN (R 3.4.2)                  
#>  digest      0.6.15     2018-01-28 cran (@0.6.15)                  
#>  evaluate    0.10.1     2017-06-24 CRAN (R 3.4.1)                  
#>  formatR     1.5        2017-04-25 CRAN (R 3.4.0)                  
#>  graphics  * 3.4.3      2017-12-07 local                           
#>  grDevices * 3.4.3      2017-12-07 local                           
#>  htmltools   0.3.6      2017-04-28 CRAN (R 3.4.0)                  
#>  knitr       1.19       2018-01-29 CRAN (R 3.4.3)                  
#>  magrittr    1.5        2014-11-22 CRAN (R 3.4.0)                  
#>  memoise     1.1.0      2017-04-21 CRAN (R 3.4.0)                  
#>  methods   * 3.4.3      2017-12-07 local                           
#>  Rcpp        0.12.15    2018-01-20 cran (@0.12.15)                 
#>  repvar    * 0.1.0      2018-02-16 local                           
#>  rmarkdown   1.8        2017-11-17 cran (@1.8)                     
#>  rprojroot   1.3-2      2018-01-03 CRAN (R 3.4.3)                  
#>  stats     * 3.4.3      2017-12-07 local                           
#>  stringi     1.1.6      2017-11-17 CRAN (R 3.4.2)                  
#>  stringr     1.2.0      2017-02-18 CRAN (R 3.4.0)                  
#>  tools       3.4.3      2017-12-07 local                           
#>  utils     * 3.4.3      2017-12-07 local                           
#>  withr       2.1.1.9000 2018-01-09 Github (jimhester/withr@df18523)
#>  yaml        2.1.16     2017-12-12 CRAN (R 3.4.3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.