GithubHelp home page GithubHelp logo

polkas / micefast Goto Github PK

View Code? Open in Web Editor NEW
17.0 3.0 2.0 10.87 MB

R enviroment - fast imputations :dragon:

Home Page: https://polkas.github.io/miceFast/

R 61.90% C++ 38.10%
r package imputation rcpp rcpparmadillo cpp grouping weighting vif fast-imputations

micefast's Introduction

miceFast

Maciej Nasinski

Check the miceFast website for more details

R build status CRAN codecov Dependencies

Fast imputations under the object-oriented programming paradigm. Moreover there are offered a few functions built to work with popular R packages such as 'data.table' or 'dplyr'. The biggest improvement in time performance could be achieve for a calculation where a grouping variable have to be used. A single evaluation of a quantitative model for the multiple imputations is another major enhancement. A new major improvement is one of the fastest predictive mean matching in the R world because of presorting and binary search.

Performance benchmarks (check performance_validity.R file at extdata).

Advanced Usage - Vignette

Installation

install.packages('miceFast')

or

# install.packages("devtools")
devtools::install_github("polkas/miceFast")

Recommended to download boosted BLAS library, even x100 faster:

  • Windows Users recommended to download MRO MKL: https://mran.microsoft.com/download
  • Linux users recommended to download Optimized BLAS (linear algebra) library: sudo apt-get install libopenblas-dev
  • Apple vecLib BLAS:
cd /Library/Frameworks/R.framework/Resources/lib
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib

Quick Implementation

library(miceFast)

set.seed(1234)
data(air_miss)

# plot NA structure
upset_NA(air_miss, 6)

naive_fill_NA(air_miss)

# Check out the vignette for an advance usage
# There is required a thorough examination

# Other packages - popular simple solutions
# Hmisc
data.frame(Map(function(x) Hmisc::impute(x, 'random'), air_miss))

#mice
mice::complete(mice::mice(air_miss, printFlag = FALSE))

Quick Reference Table

Function Description
new(miceFast) OOP instance with bunch of methods - check out vignette
fill_NA() imputation - lda,lm_pred,lm_bayes,lm_noise
fill_NA_N() multiple imputation - pmm,lm_bayes,lm_noise
VIF() Variance inflation factor
naive_fill_NA() auto imputations
compare_imp() comparing imputations
upset_NA() visualize NA structure - UpSetR::upset

Summing up, miceFast offer a relevant reduction of a calculations time for:

  • Linear Discriminant Analysis around (x5)
  • where a grouping variable have to be used (around x10 depending on data dimensions and number of groups and even more than x100 although compared to data.table only a few k faster or even the same) because of pre-sorting by grouping variable
  • multiple imputations is faster around x(a number of multiple imputations) because the core of a model is evaluated only ones.
  • Variance inflation factors (VIF) (x5) because the unnecessary linear regression is not evaluated - we need only inverse of X'X
  • Predictive mean matching (PMM) (x3) because of pre-sorting and binary search (mice algorithm was improved too).

Environment: R 4.2.1 Mac M1

If you are interested about the procedure of testing performance and validity check performance_validity.R file at the extdata folder.

micefast's People

Contributors

ol-oxy avatar polkas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

micefast's Issues

Incomplete predictor fields cause NAs in prediction

Hi there,

I'm not sure if this is intended behaviour, but when I'm trying to predict a field I only get predictions when all the other fields for that record are complete (when using fill_NA or fill_NA_N). Here's my reproducible example. I'm expecting air_miss$Solar.R_imp[5] not to be NA. This gets filled when using naive_fill_NA() but your documentation suggests not to use that function:

library(miceFast)
library(data.table)
library(dplyr)

data(air_miss)

air_miss <- air_miss %>% 
  select(Ozone:Temp) %>% 
  head(10)


air_miss[, Solar.R_imp := fill_NA(.SD,
                                  model = "lm_bayes",
                                  posit_y = "Solar.R",
                                  posit_x = c("Ozone", "Wind", "Temp"))]

print(air_miss)

>     Ozone Solar.R Wind Temp Solar.R_imp
>  1:    41     190  7.4   67      190.00
>  2:    36     118  8.0   72      118.00
>  3:    12     149 12.6   74      149.00
>  4:    18     313 11.5   62      313.00
>  5:    NA      NA 14.3   56          NA
>  6:    28      NA 14.9   66    -1187.08
>  7:    23     299  8.6   65      299.00
>  8:    19      99 13.8   59       99.00
>  9:     8      19 20.1   61       19.00
> 10:    NA     194  8.6   69      194.00

naive_fill_NA(air_miss)

>        Ozone  Solar.R Wind Temp Solar.R_imp
>  1: 41.00000 190.0000  7.4   67    190.0000
>  2: 36.00000 118.0000  8.0   72    118.0000
>  3: 12.00000 149.0000 12.6   74    149.0000
>  4: 18.00000 313.0000 11.5   62    313.0000
>  5: 15.28918 144.9681 14.3   56    312.1653
>  6: 28.00000 501.6784 14.9   66  -1187.0801
>  7: 23.00000 299.0000  8.6   65    299.0000
>  8: 19.00000  99.0000 13.8   59     99.0000
>  9:  8.00000  19.0000 20.1   61     19.0000
> 10: 21.29695 194.0000  8.6   69    194.0000

Here's my session info:

- Session info -------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/London               
 date     2020-07-09                  

- Packages -----------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
 codetools     0.2-16  2018-12-24 [1] CRAN (R 4.0.2)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
 data.table  * 1.12.8  2019-12-09 [1] CRAN (R 4.0.0)
 dplyr       * 1.0.0   2020-05-29 [1] CRAN (R 4.0.0)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
 generics      0.0.2   2018-11-29 [1] CRAN (R 4.0.0)
 glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
 lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
 miceFast    * 0.6.1   2020-07-06 [1] CRAN (R 4.0.2)
 pillar        1.4.4   2020-05-05 [1] CRAN (R 4.0.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
 Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
 rlang         0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
 rstudioapi    0.11    2020-02-07 [1] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
 tibble        3.0.1   2020-04-20 [1] CRAN (R 4.0.0)
 tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
 vctrs         0.3.1   2020-06-05 [1] CRAN (R 4.0.0)
 withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)

Any help would be great.
Thank you

Plot for NA values

For each observation NA at once, for big sets sample frac
For all set

Lack of an 'auto' function

Many users expecting less accurate but easier to implement solutions. auto_fill_NA could be a well suited proposition.

Incomplete predictor fields cause NAs in prediction

Hi there,

I'm not sure if this is intended behaviour, but when I'm trying to predict a field I only get predictions when all the other fields for that record are complete (when using fill_NA or fill_NA_N). Here's my reproducible example. I'm expecting air_miss$Solar.R_imp[5] not to be NA. This gets filled when using naive_fill_NA() but your documentation suggests not to use that function:

library(miceFast)
library(data.table)
library(dplyr)

data(air_miss)

air_miss <- air_miss %>% 
  select(Ozone:Temp) %>% 
  head(10)


air_miss[, Solar.R_imp := fill_NA(.SD,
                                  model = "lm_bayes",
                                  posit_y = "Solar.R",
                                  posit_x = c("Ozone", "Wind", "Temp"))]

print(air_miss)

>     Ozone Solar.R Wind Temp Solar.R_imp
>  1:    41     190  7.4   67      190.00
>  2:    36     118  8.0   72      118.00
>  3:    12     149 12.6   74      149.00
>  4:    18     313 11.5   62      313.00
>  5:    NA      NA 14.3   56          NA
>  6:    28      NA 14.9   66    -1187.08
>  7:    23     299  8.6   65      299.00
>  8:    19      99 13.8   59       99.00
>  9:     8      19 20.1   61       19.00
> 10:    NA     194  8.6   69      194.00

naive_fill_NA(air_miss)

>        Ozone  Solar.R Wind Temp Solar.R_imp
>  1: 41.00000 190.0000  7.4   67    190.0000
>  2: 36.00000 118.0000  8.0   72    118.0000
>  3: 12.00000 149.0000 12.6   74    149.0000
>  4: 18.00000 313.0000 11.5   62    313.0000
>  5: 15.28918 144.9681 14.3   56    312.1653
>  6: 28.00000 501.6784 14.9   66  -1187.0801
>  7: 23.00000 299.0000  8.6   65    299.0000
>  8: 19.00000  99.0000 13.8   59     99.0000
>  9:  8.00000  19.0000 20.1   61     19.0000
> 10: 21.29695 194.0000  8.6   69    194.0000

Here's my session info:

- Session info -------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/London               
 date     2020-07-09                  

- Packages -----------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
 codetools     0.2-16  2018-12-24 [1] CRAN (R 4.0.2)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
 data.table  * 1.12.8  2019-12-09 [1] CRAN (R 4.0.0)
 dplyr       * 1.0.0   2020-05-29 [1] CRAN (R 4.0.0)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
 generics      0.0.2   2018-11-29 [1] CRAN (R 4.0.0)
 glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
 lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
 miceFast    * 0.6.1   2020-07-06 [1] CRAN (R 4.0.2)
 pillar        1.4.4   2020-05-05 [1] CRAN (R 4.0.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
 Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
 rlang         0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
 rstudioapi    0.11    2020-02-07 [1] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
 tibble        3.0.1   2020-04-20 [1] CRAN (R 4.0.0)
 tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
 vctrs         0.3.1   2020-06-05 [1] CRAN (R 4.0.0)
 withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)

Any help would be great.
Thank you

Faster implementation of PMM even more than 1000x faster than the mice solution

vector is pre-sorted and then a binary search


#include <Rcpp.h>
#include <algorithm>
using namespace std;
using namespace Rcpp;

int findCrossOver(NumericVector arr, double low, double high, double x) 
{ 
  if (arr[high] <= x) // x is greater than all 
    return high; 
  if (arr[low] > x) // x is smaller than all 
    return low; 
  
  // Find the middle point 
  int mid = (low + high)/2; /* low + (high - low)/2 */
  
  /* If x is same as middle element, then return mid */
  if (arr[mid] <= x && arr[mid+1] > x) 
    return mid; 
  
  /* If x is greater than arr[mid], then either arr[mid + 1] 
   is ceiling of x or ceiling lies in arr[mid+1...high] */
  if(arr[mid] < x) 
    return findCrossOver(arr, mid+1, high, x); 
  
  return findCrossOver(arr, low, mid - 1, x); 
} 


double Kclosestrand(NumericVector arr, double x, int k) 
{ 
  int n = arr.size();
  // Find the crossover point 
  int l = findCrossOver(arr, 0, n-1, x); 
  int r = l; // Right index to search 
  int count = 0; // To keep track of count of elements already printed 
  NumericVector resus(k);

  // If x is present in arr[], then reduce left index 
  // Assumption: all elements in arr[] are distinct 
  if (arr[l] == x) l--; 
  
  // Compare elements on left and right of crossover 
  // point to find the k closest elements 
  while (l >= 0 && r < n && count < k) 
  { 
    if (x - arr[l] < arr[r] - x) 
      resus[count] = arr[l--]; 
    else
      resus[count] = arr[r++]; 
    count++; 
  } 
  
  // If there are no more elements on right side, then 
  // print left elements 
  while (count < k && l >= 0) 
    resus[count] = arr[l--], count++; 
  
  // If there are no more elements on left side, then 
  // print right elements 
  while (count < k && r < n) 
   resus[count] = arr[r++], count++; 
  
  int goal = rand()%k;
  
  return resus[goal];
 
}


// [[Rcpp::export]]
NumericVector neibo(NumericVector y, NumericVector miss, int k) {
  int n_y = y.size();
  k = (k <= n_y) ? k : n_y;
  k = (k >= 1) ? k : 1;
  
  NumericVector y_new = clone(y);
  
  sort(y_new.begin(),y_new.end());
  
  unsigned int n_miss = miss.size();
  
  NumericVector resus(n_miss);
  
  for(int i=0; i<n_miss ;i++){
    double mm = miss[i];
    resus[i] = Kclosestrand(y_new,mm,k);
  } 
    
  return resus ;
  
}

/* Driver program to check above functions */

/*** R

vals = rnorm(100)

ss = rnorm(100)

neibo(vals,ss,2)[1:10]

vals[mice:::matcher(vals,ss,2)][1:10]

microbenchmark::microbenchmark(neibo(vals,ss,2),
                               
                               mice:::matcher(vals,ss,2)
)

vals = rnorm(10000)

ss = rnorm(1000)

neibo(vals,ss,2)[1:10]

vals[mice:::matcher(vals,ss,2)][1:10]

microbenchmark::microbenchmark(neibo(vals,ss,2),
                               
                               mice:::matcher(vals,ss,2)
)


vals = rnorm(10000)

ss = rnorm(1000)

neibo(vals,ss,200)[1:10]

vals[mice:::matcher(vals,ss,200)][1:10]

microbenchmark::microbenchmark(neibo(vals,ss,200),
                               
                               mice:::matcher(vals,ss,200)
)


vals = 1:10000

ss = 1:100

neibo(vals,ss,2)[1:10]

vals[mice:::matcher(vals,ss,2)][1:10]

microbenchmark::microbenchmark(neibo(vals,ss,2),
                               
                               mice:::matcher(vals,ss,2)
)
*/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.