GithubHelp home page GithubHelp logo

boennecd / parglm Goto Github PK

View Code? Open in Web Editor NEW
10.0 2.0 2.0 841 KB

R package that provides a parallel estimation method for generalized linear models

R 28.62% C++ 70.87% C 0.51%
generalized-linear-models parallel-computing

parglm's People

Contributors

boennecd avatar eddelbuettel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

parglm's Issues

adding control= parameter turns off nthreads= when nthreads= gets passed directly to parglm()

hi, hopefully i'm reading my windows task manager correctly.. not sure if you care about this user error, but it tripped me up and i finally realized why my processes were running so slowly

x <- mtcars[ rep( seq( nrow( mtcars ) ) , 100000 ) , ]

library(parglm)

# does multithread
system.time( parglm::parglm( mpg ~ wt + qsec + cyl + disp + hp + drat + gear + am + vs , data = x , nthreads = 16 ) )

# does not multithread
system.time( parglm::parglm( mpg ~ wt + qsec + cyl + disp + hp + drat + gear + am + vs , data = x , nthreads = 16 , control = list(maxit = 25) ) )

# does multithread
system.time( parglm::parglm( mpg ~ wt + qsec + cyl + disp + hp + drat + gear + am + vs , data = x , control = list(maxit = 25, nthreads = 16) ) )

Could qr return more closely match glm?

Feature request rather than a bug!

Currently as noted in the documentation, the return from parglm is largely identical to base glm apart from the qr element, which means some use of the predict() function fails.

Is it possible to change the returned list so the qr element permits use of predict?

Dependent Variable as Factor Crashes Model

Attempting to run a parglm for a model with the output configured as factors fails to run correctly and produces the following error:
Error in Summary.factor(c(1= NaN,2= NaN,3= NaN,4 = NaN, : ‘sum’ not meaningful for factors In addition: Warning messages: 1: In Ops.factor(y, mu) : ‘-’ not meaningful for factors 2: In Ops.factor(weights, y) : ‘*’ not meaningful for factors

Running the same model with speedglm or glm doesn't return an error

Dataset <- data.frame(
  y = c(0,1,1,0,0,0,0,0,1),
  lot1 = c(118,58,42,35,27,25,21,19,18),
  lot2 = c(69,35,26,21,18,16,13,12,12))

Dataset$y <- factor(Dataset$y)

parglm(y ~ . , data = Dataset, family = "binomial", control = parglm.control((nthreads = 4L))) 
speedglm(y ~ . , data = Dataset, family = binomial()) 
glm(y ~ . , data = Dataset, family = "binomial")

lots of variables causes crash (not error)

hi, the crashing seems inconsistent.. i've gotten this instead of the crash sometimes

error: chol(): decomposition failed
Error in parallelglm(X = x, Ys = y, family = paste0(family$family, "_",  :
  chol(): decomposition failed

here's the code to reproduce the crash, maybe run it a few times and see if it crashes locally for you?

library(parglm)

sessionInfo()

variables <- 10000

x <- data.frame( a0 = sample( 1:1000 , 1000 ) )

for( i in seq( variables ) ) x[ , paste0( 'a' , i ) ] <- sample( 1:1000 , 1000 )
this_formula <- as.formula( paste0( 'a0 ~ ' , paste0( "a" , seq( variables ) , collapse = " + " ) ) )
parglm( this_formula , data = x , nthreads = parallel::detectCores() )

command window output, you can see the crash (not error) when it just returns to the C:\Users\anthonyd> prompt at the very bottom

Microsoft Windows [Version 10.0.17134.619]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\Users\anthonyd>"c:\Program Files\r\R-3.6.0\bin\x64\Rterm.exe"

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>
>
>
>
> library(parglm)
>
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] parglm_0.1.3

loaded via a namespace (and not attached):
[1] compiler_3.6.0  Matrix_1.2-17   Rcpp_1.0.1      grid_3.6.0
[5] lattice_0.20-38
>
>
>
> variables <- 10000
>
> x <- data.frame( a0 = sample( 1:1000 , 1000 ) )
>
> for( i in seq( variables ) ) x[ , paste0( 'a' , i ) ] <- sample( 1:1000 , 10$
> this_formula <- as.formula( paste0( 'a0 ~ ' , paste0( "a" , seq( variables )$
> parglm( this_formula , data = x , nthreads = parallel::detectCores() )

C:\Users\anthonyd>

library must be loaded for parglm to work

hi, not sure if you care that users must always load the library..

fresh R session:

# works
glm( mpg ~ gear , data = mtcars )

# fails
parglm::parglm( mpg ~ gear , data = mtcars )
# Error in glm(formula = mpg ~ gear, data = mtcars, method = parglm.fit,  : 
  # object 'parglm.fit' not found
  
library(parglm)

# works
parglm( mpg ~ gear , data = mtcars )

thanks!

dimnames crash at 125 dependent variables?

hi, here's code to reproduce an error on my windows laptop that works with glm() but fails with parglm()

library(parglm)

sessionInfo()

for( variables in 1:500 ){

	x <- data.frame( a0 = sample( 1:1000 , 1000 ) )

	for( i in seq( variables ) ) x[ , paste0( 'a' , i ) ] <- sample( 1:1000 , 1000 )
	this_formula <- as.formula( paste0( 'a0 ~ ' , paste0( "a" , seq( variables ) , collapse = " + " ) ) )
	glm( this_formula , data = x )
	parglm( this_formula , data = x , nthreads = parallel::detectCores() )

}

# Error in dimnames(Rmat) <- list(xnames, xnames) : 
  # length of 'dimnames' [1] not equal to array extent

print( variables )
# [1] 125

in case it's helpful, here's the entirety of the command window with sessionInfo()

Microsoft Windows [Version 10.0.17134.619]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\Users\anthonyd>"c:\Program Files\r\R-3.6.0\bin\x64\Rterm.exe"

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>
> library(parglm)
>
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] parglm_0.1.3

loaded via a namespace (and not attached):
[1] compiler_3.6.0  Matrix_1.2-17   Rcpp_1.0.1      grid_3.6.0
[5] lattice_0.20-38
>
> for( variables in 1:500 ){
+
+ x <- data.frame( a0 = sample( 1:1000 , 1000 ) )
+
+ for( i in seq( variables ) ) x[ , paste0( 'a' , i ) ] <- sample( 1:1000 , 10$
+ this_formula <- as.formula( paste0( 'a0 ~ ' , paste0( "a" , seq( variables )$
+ glm( this_formula , data = x )
+ parglm( this_formula , data = x , nthreads = parallel::detectCores() )
+
+ }
Error in dimnames(Rmat) <- list(xnames, xnames) :
  length of 'dimnames' [1] not equal to array extent
> print( variables )
[1] 125

is parglm support stepwise ?

is parglm support stepwise calculate like glm?

> clotting <- data.frame(
+   u = c(5,10,15,20,30,40,60,80,100),
+   lot1 = c(118,58,42,35,27,25,21,19,18),
+   lot2 = c(69,35,26,21,18,16,13,12,12))
> f1 <- glm   (lot1 ~ log(u), data = clotting, family = Gamma)
> f2 <- parglm(lot1 ~ log(u), data = clotting, family = Gamma,
+              control = parglm.control(nthreads = 1L))
> all.equal(coef(f1), coef(f2))
[1] TRUE
> step(f1)
Start:  AIC=37.99
lot1 ~ log(u)

         Df Deviance     AIC
<none>        0.0167   37.99
- log(u)  1   3.5128 1465.27

Call:  glm(formula = lot1 ~ log(u), family = Gamma, data = clotting)

Coefficients:
(Intercept)       log(u)  
   -0.01655      0.01534  

Degrees of Freedom: 8 Total (i.e. Null);  7 Residual
Null Deviance:	    3.513 
Residual Deviance: 0.01673 	AIC: 37.99
> step(f2)
Start:  AIC=37.99
lot1 ~ log(u)

Error in glm.control(epsilon = 1e-08, maxit = 25, trace = FALSE, nthreads = 1L,  : 
  unused arguments (nthreads = 1, block_size = NULL, method = "LINPACK")
> 

crash on powerful windows computer

pretty simple code to reproduce:

library(parglm)
this_df <- data.frame( a = sample( 1:1000000 , 20 ) / 100 , b = 1 )
parglm( a ~ b , data = this_df , nthreads = parallel::detectCores() ) # sometimes you need to run this line five or six times

this only occurs on a powerful computer, not on my local machine

Microsoft Windows [Version 6.3.9600]
(c) 2013 Microsoft Corporation. All rights reserved.

C:\Users\AnthonyD>"c:\Program Files\r\R-3.5.2\bin\x64\Rterm.exe"

R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.2
> parallel::detectCores()
[1] 64
> library(parglm)
> this_df <- data.frame( a = sample( 1:1000000 , 20 ) / 100 , b = 1 )
> parglm( a ~ b , data = this_df , nthreads = parallel::detectCores() )

C:\Users\AnthonyD>

Allow for box constraints on coefficients?

I was wondering if it would be possible by any chance to support extra arguments lower.limits and upper.limits with vectors of lower & upper box constraints on the coefficients. I would like to fit a nonnegative Poisson GLM, but in the absence of nonnegativity constraints this always blows up. In my experience, just imposing these constraints in each IRLS iteration by clipping coefficients to the allowed range works quite well. Otherwise, the osqp quadratic programming solver is also quite efficient and would allow formal box constraints & works both for dense or sparse covariate matrices :

constrainedLS_osqp <- function(y, X,
                               lower=rep(0, ncol(X)), upper=rep(Inf, ncol(X)),
                               x.start = NULL, y.start = NULL) {
  require(osqp)
  require(Matrix)
  XtX = crossprod(X,X)
  Xty = crossprod(X,y)
  
  settings = osqpSettings(verbose = FALSE, eps_abs = 1e-8, eps_rel = 1e-8, linsys_solver = 0L,
                          warm_start = FALSE)
  pff = .sparseDiagonal(ncol(X))
  model <- osqp(XtX, -Xty, pff, l=lower, u=upper, pars=settings)
  
  if (!is.null(x.start)) model$WarmStart(x=x.start, y=y.start)

  coefs = model$Solve()$x # fitted coefficients

  coefs = pmax(lower, pmin(coefs, upper) ) # fitted coefficients sometimes go very slightly outside constraint zone due to numerical inaccuracies in solver - this is fixed here via clipping

  return(coefs)
}

tweedie family and out of core

1 Does Parglm support tweedie family?
2 about big dataset can not fit in memory, does parglm support it? Like speedglm function shglm?

Allow for sparse (dgCMatrix) covariate matrix X in parglm.fit?

Was just wondering if it would be a lot of hassle to allow for X to be sparse in parglm.fit? Now I believe a sparse covariate matrix is always coerced to dense... RcppEigen e.g. has a nice interface to fast sparse & dense solvers from the Eigen C++ library, e.g. the Cholesky one works very well & is very fast (but it also has e.g. a sparse least squares conjugate gradient solver). The Armadillo ones have the downside that they fall back on the installed BLAS, and that timings will be massively different depending on whether one e.g. has an R version compiled against Intel MKL installed or not (and with Microsoft R Open that came with Intel MKL being phased out, access is becoming more difficult; OpenBlas will now be the easier alternative).

This was what I was using to solve a least square system using the Eigen solvers for the sparse & dense case:

// [[Rcpp::depends(RcppEigen)]]
#include <Rcpp.h>
#include <RcppEigen.h>

// Solve Ax = b using Cholesky decomposition for sparse or dense covariate matrix A
// adapted from https://github.com/nk027/sanic/blob/master/src/solve_Cholesky.cpp

// Solve Ax = b using sparse LLT (pivot = 0) or LDLT (pivot = 1) Cholesky decomposition
// [[Rcpp::export]]
Rcpp::List solve_sparse_chol(
    const Eigen::MappedSparseMatrix<double> A,
    const Eigen::Map<Eigen::MatrixXd> b,
    unsigned int pivot = 1, unsigned int ordering = 0) {

  Eigen::SimplicialLDLT < Eigen::SparseMatrix<double>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
  if(ordering == 1) { // use NaturalOrdering
    Eigen::SimplicialLDLT < Eigen::SparseMatrix<double>, Eigen::Lower, Eigen::NaturalOrdering<int> > solver;
  } else if(ordering > 1) {
    Rcpp::warning("No valid ordering requested -- using default.");
  }

  if(pivot == 0) {
    Eigen::SimplicialLLT < Eigen::SparseMatrix<double>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
    if(ordering == 1) { // use NaturalOrdering
      Eigen::SimplicialLLT < Eigen::SparseMatrix<double>, Eigen::Lower, Eigen::NaturalOrdering<int> > solver;
    } else if(ordering > 1) {
      Rcpp::warning("No valid ordering requested -- using default.");
    }
  } else if(pivot > 1) {
    Rcpp::warning("No valid pivoting scheme requested -- using default.");
  }

  solver.compute(A);

  if(solver.info() != Eigen::Success) {
    // solving failed
    return Rcpp::List::create(Rcpp::Named("status") = false);
  }

  Eigen::MatrixXd x = solver.solve(b);
  if(solver.info() != Eigen::Success) {
    // solving failed
    return Rcpp::List::create(Rcpp::Named("status") = false);
  }

  return Rcpp::List::create(Rcpp::Named("status") = true,
                            Rcpp::Named("coefficients") = x);
}


// Solve Ax = b using dense LLT (pivot = 0) or LDLT (pivot = 1) Cholesky decomposition
// [[Rcpp::export]]
Rcpp::List solve_dense_chol(
    const Eigen::Map <Eigen::MatrixXd> A,
    const Eigen::Map <Eigen::MatrixXd> b,
    unsigned int pivot = 1) {

  Eigen::LDLT <Eigen::MatrixXd> solver;
  if(pivot == 0) {
    Eigen::LLT <Eigen::MatrixXd> solver;
  } else if(pivot > 1) {
    Rcpp::warning("No valid pivoting scheme requested -- using default.");
  }

  solver.compute(A);

  if(solver.info() != Eigen::Success) {
    // solving failed
    return Rcpp::List::create(Rcpp::Named("status") = false);
  }

  Eigen::MatrixXd x = solver.solve(b);
  if(solver.info() != Eigen::Success) {
    // solving failed
    return Rcpp::List::create(Rcpp::Named("status") = false);
  }

  return Rcpp::List::create(Rcpp::Named("status") = true,
                            Rcpp::Named("coefficients") = x);
}

larger data set still crashes (not error)

hi, ten million records x 501 columns still crashes-- :-/ thank you for your time with these!

library(parglm)

sessionInfo()

variables <- 500L
n <- 10000000L

set.seed(1)
x <- cbind(data.frame(a0 = sample.int(n, n)),
		   replicate(variables, sample(n, n)))

parglm(a0 ~ ., data = x, nthreads = 3L)

console output:

C:\Users\AnthonyD>"c:\Program Files\r\R-3.6.0\bin\x64\Rterm.exe"

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(parglm)
> variables <- 500L
> n <- 10000000L
>
> set.seed(1)
> x <- cbind(data.frame(a0 = sample.int(n, n)),
+            replicate(variables, sample(n, n)))
>
> parglm(a0 ~ ., data = x, nthreads = 3L)

C:\Users\AnthonyD>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.