boennecd / parglm Goto Github PK
View Code? Open in Web Editor NEWR package that provides a parallel estimation method for generalized linear models
R package that provides a parallel estimation method for generalized linear models
hi, hopefully i'm reading my windows task manager correctly.. not sure if you care about this user error, but it tripped me up and i finally realized why my processes were running so slowly
x <- mtcars[ rep( seq( nrow( mtcars ) ) , 100000 ) , ]
library(parglm)
# does multithread
system.time( parglm::parglm( mpg ~ wt + qsec + cyl + disp + hp + drat + gear + am + vs , data = x , nthreads = 16 ) )
# does not multithread
system.time( parglm::parglm( mpg ~ wt + qsec + cyl + disp + hp + drat + gear + am + vs , data = x , nthreads = 16 , control = list(maxit = 25) ) )
# does multithread
system.time( parglm::parglm( mpg ~ wt + qsec + cyl + disp + hp + drat + gear + am + vs , data = x , control = list(maxit = 25, nthreads = 16) ) )
Feature request rather than a bug!
Currently as noted in the documentation, the return from parglm is largely identical to base glm apart from the qr element, which means some use of the predict() function fails.
Is it possible to change the returned list so the qr element permits use of predict?
Attempting to run a parglm for a model with the output configured as factors fails to run correctly and produces the following error:
Error in Summary.factor(c(
1= NaN,
2= NaN,
3= NaN,
4 = NaN, : ‘sum’ not meaningful for factors In addition: Warning messages: 1: In Ops.factor(y, mu) : ‘-’ not meaningful for factors 2: In Ops.factor(weights, y) : ‘*’ not meaningful for factors
Running the same model with speedglm or glm doesn't return an error
Dataset <- data.frame(
y = c(0,1,1,0,0,0,0,0,1),
lot1 = c(118,58,42,35,27,25,21,19,18),
lot2 = c(69,35,26,21,18,16,13,12,12))
Dataset$y <- factor(Dataset$y)
parglm(y ~ . , data = Dataset, family = "binomial", control = parglm.control((nthreads = 4L)))
speedglm(y ~ . , data = Dataset, family = binomial())
glm(y ~ . , data = Dataset, family = "binomial")
hi, the crashing seems inconsistent.. i've gotten this instead of the crash sometimes
error: chol(): decomposition failed
Error in parallelglm(X = x, Ys = y, family = paste0(family$family, "_", :
chol(): decomposition failed
here's the code to reproduce the crash, maybe run it a few times and see if it crashes locally for you?
library(parglm)
sessionInfo()
variables <- 10000
x <- data.frame( a0 = sample( 1:1000 , 1000 ) )
for( i in seq( variables ) ) x[ , paste0( 'a' , i ) ] <- sample( 1:1000 , 1000 )
this_formula <- as.formula( paste0( 'a0 ~ ' , paste0( "a" , seq( variables ) , collapse = " + " ) ) )
parglm( this_formula , data = x , nthreads = parallel::detectCores() )
command window output, you can see the crash (not error) when it just returns to the C:\Users\anthonyd>
prompt at the very bottom
Microsoft Windows [Version 10.0.17134.619]
(c) 2018 Microsoft Corporation. All rights reserved.
C:\Users\anthonyd>"c:\Program Files\r\R-3.6.0\bin\x64\Rterm.exe"
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
>
>
>
> library(parglm)
>
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] parglm_0.1.3
loaded via a namespace (and not attached):
[1] compiler_3.6.0 Matrix_1.2-17 Rcpp_1.0.1 grid_3.6.0
[5] lattice_0.20-38
>
>
>
> variables <- 10000
>
> x <- data.frame( a0 = sample( 1:1000 , 1000 ) )
>
> for( i in seq( variables ) ) x[ , paste0( 'a' , i ) ] <- sample( 1:1000 , 10$
> this_formula <- as.formula( paste0( 'a0 ~ ' , paste0( "a" , seq( variables )$
> parglm( this_formula , data = x , nthreads = parallel::detectCores() )
C:\Users\anthonyd>
hi, not sure if you care that users must always load the library..
fresh R session:
# works
glm( mpg ~ gear , data = mtcars )
# fails
parglm::parglm( mpg ~ gear , data = mtcars )
# Error in glm(formula = mpg ~ gear, data = mtcars, method = parglm.fit, :
# object 'parglm.fit' not found
library(parglm)
# works
parglm( mpg ~ gear , data = mtcars )
thanks!
hi, here's code to reproduce an error on my windows laptop that works with glm()
but fails with parglm()
library(parglm)
sessionInfo()
for( variables in 1:500 ){
x <- data.frame( a0 = sample( 1:1000 , 1000 ) )
for( i in seq( variables ) ) x[ , paste0( 'a' , i ) ] <- sample( 1:1000 , 1000 )
this_formula <- as.formula( paste0( 'a0 ~ ' , paste0( "a" , seq( variables ) , collapse = " + " ) ) )
glm( this_formula , data = x )
parglm( this_formula , data = x , nthreads = parallel::detectCores() )
}
# Error in dimnames(Rmat) <- list(xnames, xnames) :
# length of 'dimnames' [1] not equal to array extent
print( variables )
# [1] 125
in case it's helpful, here's the entirety of the command window with sessionInfo()
Microsoft Windows [Version 10.0.17134.619]
(c) 2018 Microsoft Corporation. All rights reserved.
C:\Users\anthonyd>"c:\Program Files\r\R-3.6.0\bin\x64\Rterm.exe"
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
> library(parglm)
>
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] parglm_0.1.3
loaded via a namespace (and not attached):
[1] compiler_3.6.0 Matrix_1.2-17 Rcpp_1.0.1 grid_3.6.0
[5] lattice_0.20-38
>
> for( variables in 1:500 ){
+
+ x <- data.frame( a0 = sample( 1:1000 , 1000 ) )
+
+ for( i in seq( variables ) ) x[ , paste0( 'a' , i ) ] <- sample( 1:1000 , 10$
+ this_formula <- as.formula( paste0( 'a0 ~ ' , paste0( "a" , seq( variables )$
+ glm( this_formula , data = x )
+ parglm( this_formula , data = x , nthreads = parallel::detectCores() )
+
+ }
Error in dimnames(Rmat) <- list(xnames, xnames) :
length of 'dimnames' [1] not equal to array extent
> print( variables )
[1] 125
is parglm support stepwise calculate like glm?
> clotting <- data.frame(
+ u = c(5,10,15,20,30,40,60,80,100),
+ lot1 = c(118,58,42,35,27,25,21,19,18),
+ lot2 = c(69,35,26,21,18,16,13,12,12))
> f1 <- glm (lot1 ~ log(u), data = clotting, family = Gamma)
> f2 <- parglm(lot1 ~ log(u), data = clotting, family = Gamma,
+ control = parglm.control(nthreads = 1L))
> all.equal(coef(f1), coef(f2))
[1] TRUE
> step(f1)
Start: AIC=37.99
lot1 ~ log(u)
Df Deviance AIC
<none> 0.0167 37.99
- log(u) 1 3.5128 1465.27
Call: glm(formula = lot1 ~ log(u), family = Gamma, data = clotting)
Coefficients:
(Intercept) log(u)
-0.01655 0.01534
Degrees of Freedom: 8 Total (i.e. Null); 7 Residual
Null Deviance: 3.513
Residual Deviance: 0.01673 AIC: 37.99
> step(f2)
Start: AIC=37.99
lot1 ~ log(u)
Error in glm.control(epsilon = 1e-08, maxit = 25, trace = FALSE, nthreads = 1L, :
unused arguments (nthreads = 1, block_size = NULL, method = "LINPACK")
>
pretty simple code to reproduce:
library(parglm)
this_df <- data.frame( a = sample( 1:1000000 , 20 ) / 100 , b = 1 )
parglm( a ~ b , data = this_df , nthreads = parallel::detectCores() ) # sometimes you need to run this line five or six times
this only occurs on a powerful computer, not on my local machine
Microsoft Windows [Version 6.3.9600]
(c) 2013 Microsoft Corporation. All rights reserved.
C:\Users\AnthonyD>"c:\Program Files\r\R-3.5.2\bin\x64\Rterm.exe"
R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.2
> parallel::detectCores()
[1] 64
> library(parglm)
> this_df <- data.frame( a = sample( 1:1000000 , 20 ) / 100 , b = 1 )
> parglm( a ~ b , data = this_df , nthreads = parallel::detectCores() )
C:\Users\AnthonyD>
I was wondering if it would be possible by any chance to support extra arguments lower.limits
and upper.limits
with vectors of lower & upper box constraints on the coefficients. I would like to fit a nonnegative Poisson GLM, but in the absence of nonnegativity constraints this always blows up. In my experience, just imposing these constraints in each IRLS iteration by clipping coefficients to the allowed range works quite well. Otherwise, the osqp
quadratic programming solver is also quite efficient and would allow formal box constraints & works both for dense or sparse covariate matrices :
constrainedLS_osqp <- function(y, X,
lower=rep(0, ncol(X)), upper=rep(Inf, ncol(X)),
x.start = NULL, y.start = NULL) {
require(osqp)
require(Matrix)
XtX = crossprod(X,X)
Xty = crossprod(X,y)
settings = osqpSettings(verbose = FALSE, eps_abs = 1e-8, eps_rel = 1e-8, linsys_solver = 0L,
warm_start = FALSE)
pff = .sparseDiagonal(ncol(X))
model <- osqp(XtX, -Xty, pff, l=lower, u=upper, pars=settings)
if (!is.null(x.start)) model$WarmStart(x=x.start, y=y.start)
coefs = model$Solve()$x # fitted coefficients
coefs = pmax(lower, pmin(coefs, upper) ) # fitted coefficients sometimes go very slightly outside constraint zone due to numerical inaccuracies in solver - this is fixed here via clipping
return(coefs)
}
1 Does Parglm support tweedie family?
2 about big dataset can not fit in memory, does parglm support it? Like speedglm function shglm?
Was just wondering if it would be a lot of hassle to allow for X to be sparse in parglm.fit? Now I believe a sparse covariate matrix is always coerced to dense... RcppEigen e.g. has a nice interface to fast sparse & dense solvers from the Eigen C++ library, e.g. the Cholesky one works very well & is very fast (but it also has e.g. a sparse least squares conjugate gradient solver). The Armadillo ones have the downside that they fall back on the installed BLAS, and that timings will be massively different depending on whether one e.g. has an R version compiled against Intel MKL installed or not (and with Microsoft R Open that came with Intel MKL being phased out, access is becoming more difficult; OpenBlas will now be the easier alternative).
This was what I was using to solve a least square system using the Eigen solvers for the sparse & dense case:
// [[Rcpp::depends(RcppEigen)]]
#include <Rcpp.h>
#include <RcppEigen.h>
// Solve Ax = b using Cholesky decomposition for sparse or dense covariate matrix A
// adapted from https://github.com/nk027/sanic/blob/master/src/solve_Cholesky.cpp
// Solve Ax = b using sparse LLT (pivot = 0) or LDLT (pivot = 1) Cholesky decomposition
// [[Rcpp::export]]
Rcpp::List solve_sparse_chol(
const Eigen::MappedSparseMatrix<double> A,
const Eigen::Map<Eigen::MatrixXd> b,
unsigned int pivot = 1, unsigned int ordering = 0) {
Eigen::SimplicialLDLT < Eigen::SparseMatrix<double>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
if(ordering == 1) { // use NaturalOrdering
Eigen::SimplicialLDLT < Eigen::SparseMatrix<double>, Eigen::Lower, Eigen::NaturalOrdering<int> > solver;
} else if(ordering > 1) {
Rcpp::warning("No valid ordering requested -- using default.");
}
if(pivot == 0) {
Eigen::SimplicialLLT < Eigen::SparseMatrix<double>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
if(ordering == 1) { // use NaturalOrdering
Eigen::SimplicialLLT < Eigen::SparseMatrix<double>, Eigen::Lower, Eigen::NaturalOrdering<int> > solver;
} else if(ordering > 1) {
Rcpp::warning("No valid ordering requested -- using default.");
}
} else if(pivot > 1) {
Rcpp::warning("No valid pivoting scheme requested -- using default.");
}
solver.compute(A);
if(solver.info() != Eigen::Success) {
// solving failed
return Rcpp::List::create(Rcpp::Named("status") = false);
}
Eigen::MatrixXd x = solver.solve(b);
if(solver.info() != Eigen::Success) {
// solving failed
return Rcpp::List::create(Rcpp::Named("status") = false);
}
return Rcpp::List::create(Rcpp::Named("status") = true,
Rcpp::Named("coefficients") = x);
}
// Solve Ax = b using dense LLT (pivot = 0) or LDLT (pivot = 1) Cholesky decomposition
// [[Rcpp::export]]
Rcpp::List solve_dense_chol(
const Eigen::Map <Eigen::MatrixXd> A,
const Eigen::Map <Eigen::MatrixXd> b,
unsigned int pivot = 1) {
Eigen::LDLT <Eigen::MatrixXd> solver;
if(pivot == 0) {
Eigen::LLT <Eigen::MatrixXd> solver;
} else if(pivot > 1) {
Rcpp::warning("No valid pivoting scheme requested -- using default.");
}
solver.compute(A);
if(solver.info() != Eigen::Success) {
// solving failed
return Rcpp::List::create(Rcpp::Named("status") = false);
}
Eigen::MatrixXd x = solver.solve(b);
if(solver.info() != Eigen::Success) {
// solving failed
return Rcpp::List::create(Rcpp::Named("status") = false);
}
return Rcpp::List::create(Rcpp::Named("status") = true,
Rcpp::Named("coefficients") = x);
}
hi, ten million records x 501 columns still crashes-- :-/ thank you for your time with these!
library(parglm)
sessionInfo()
variables <- 500L
n <- 10000000L
set.seed(1)
x <- cbind(data.frame(a0 = sample.int(n, n)),
replicate(variables, sample(n, n)))
parglm(a0 ~ ., data = x, nthreads = 3L)
console output:
C:\Users\AnthonyD>"c:\Program Files\r\R-3.6.0\bin\x64\Rterm.exe"
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(parglm)
> variables <- 500L
> n <- 10000000L
>
> set.seed(1)
> x <- cbind(data.frame(a0 = sample.int(n, n)),
+ replicate(variables, sample(n, n)))
>
> parglm(a0 ~ ., data = x, nthreads = 3L)
C:\Users\AnthonyD>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.