tidymodels / broom Goto Github PK
View Code? Open in Web Editor NEWConvert statistical analysis objects from R into tidy format
Home Page: https://broom.tidymodels.org
License: Other
Convert statistical analysis objects from R into tidy format
Home Page: https://broom.tidymodels.org
License: Other
Would you consider expanding this package to work on objects of type lmerMod (from lme4)?
I've added two additional S3 methods along with the original tidy
. Each of them tidies data from a model, but in a different way. The new methods are
augment
: add columns to the original data, containing (e.g.) predictions and residualsglance
: construct a one-row data.frame summary, containing (e.g.) R^2 or devianceI've added a new branch, augment-glance with this change. It can be installed with
library(devtools)
install_github("broom", "dgrtwo", "augment-glance")
The new vignettes explain these three methods and how they interrelate.
Before I merge this into the main branch I'd be very interested in any feedback. While the package is very new, I would like advice from those interested in using it. (For instance, are augment
and glance
helpful names for these functions? Do the vignettes make the difference clear? Which methods should be implemented next?)
I have had some trouble tidying avolist
objects. When performing a repeated measures analysis of variance the optput object is usually a list of aov
objects for the different error strata.
datafilename <- "http://personality-project.org/r/datasets/R.appendix5.data"
aov_example <- read.table(datafilename, header = T)
aov_example <- aov(Recall ~ (Task*Valence*Gender*Dosage) + Error(Subject/(Task*Valence)) +(Gender*Dosage), aov_example)
class(aov_example)
[1] "aovlist" "listof"
(example stolen from http://personality-project.org/r/#anova)
To-date, it is not possible to tidy these objects directly because fix_data_frame()
doesn't take lists.
library("broom")
tidy(aov_example)
Error in as.data.frame.default(x) :
cannot coerce class "c("aovlist", "listof")" to a data.frame
So I tried to use lapply()
, but this throws an error:
lapply(aov_example, tidy)
Error in
colnames<-
(*tmp*
, value = c("df", "sumsq", "meansq", "statistic", :
Attribute 'names' [5] must be of same length as vector [3]
Is there an easy fix for this? It would be convenient if tidy()
accepted aovlist
-objects.
I think it would be useful if some version of the data.frame
that represents the result also include a column with the number of observations that were used to build the model.
An easy way to access would be using e.g. nobs()
for stats::lm()
, but I'm sure other models have similar reporting of this.
It would be even more useful if it could include the actual number of observations for logicalTRUE
or factorLEVEL
.
If an ill-specified model, in this case with linear dependend columns, is estimated by lm()
, lm()
will just drop the respective coefficient for estimation and return NA
as the estimate for that coefficient. Thus, the user should notice his mistake. If the user does not look at the raw output of lm()
(or summary(lm_object)
or has a skript running, he gets no notification (the script is likely to fail at some other point). If one uses uses broom::tidy()
(without looking at the original output of lm()
, as I do often, because I like the layout), the ill-specified model is somehow hidden from the users.
Can we have a warning for such cases?
See the example below for lm()
and also for plm()
(from package plm
):
library(plm)
data(Grunfeld, package="plm")
pGrunfeld <- pdata.frame(Grunfeld, index = c("firm", "year"), drop.index = F)
# make a duplicate column (i.e. a linear dependend column)
Grunfeld$capital2 <- Grunfeld$capital
pGrunfeld$capital2 <- pGrunfeld$capital
# lm model
mod_lm <- lm(inv ~ value + capital + capital2, data=Grunfeld)
mod_lm # NA coefficent is displayed
broom::tidy(mod_lm) # NA coefficient is silently dropped, user is not aware of his mistake
# plm model
mod_plm <- plm(inv ~ value + capital + capital2, data=pGrunfeld, model="pooling")
mod_plm # NA is not displayed
broom::tidy(mod_plm) # fails (outside of broom), which is somehow good as the user gets notified about the ill-specified model
I'm getting the following error message when I try to tidy a coxph fit from the survival package:
Error in seq_len(nrow(x)) :
argument must be coercible to non-negative integer
In addition: Warning message:
In seq_len(nrow(x)) : first element used of 'length.out' argument
Here is code to replicate the problem, note I first confirm that tidy works on a regression fit and they attempt it on a coxph fit
library(survival)
library(broom)
# Create data frame for the survival analysis
demoDat <- structure(list(daysFromInvestigation = c(1L, 6L, 7L, 5L, 5L,
2L, 3L, 1L, 2L, 5L, 1L, 4L, 5L, 7L, 3L, 1L, 6L, 4L, 2L, 6L, 6L,
7L, 7L, 6L, 4L, 2L, 2L, 4L, 0L, 5L, 5L, 4L, 4L, 1L, 7L, 7L, 1L,
2L, 0L, 7L, 6L, 7L, 1L, 1L, 2L, 2L, 2L, 1L, 7L, 1L, 2L, 3L, 1L,
1L, 2L), Sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L,
1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("F ",
"M "), class = "factor"), dead = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1)), row.names = c(NA, -55L), class = "data.frame", .Names = c("daysFromInvestigation",
"Sex", "dead"))
# To be sure that I've got the packages loaded and things work, demonstrate with a glm regression
demoFit <- glm(daysFromInvestigation ~ Sex, data=demoDat)
print(demoFit)
tidy(demoFit)
# Works great
# Now do coxph
demoCoxFit <- coxph(Surv(demoDat$daysFromInvestigation, demoDat$dead) ~ Sex, data = demoDat)
print(demoCoxFit)
Which produces the following:
Call:
coxph(formula = Surv(demoDat$daysFromInvestigation, demoDat$dead) ~
Sex, data = demoDat)
coef exp(coef) se(coef) z p
SexM 0.2 1.22 0.34 0.588 0.56
Likelihood ratio test=0.36 on 1 df, p=0.549 n= 55, number of events= 55
But when I run tidy on the fit, I get the error message at the beginning of this issue report
tidy(demoCoxFit)
Due an implementation of predict()
for panel models, which I cannot follow, augment() with panel models calculates wrong fitted values. In fact, the values are only wrong for within
and random
models as they have individual intercepts. There is an implementation for fitted()
but that is not exported and it produces strange values (as I think).
I think, we need a work around here. Here is my test file to demonstrate the numbers:
[see below for a pull request for that:
4ce62e7 and 97cb53f should be a new pull request for #73]
require(broom)
require(plm)
data("Grunfeld")
# Use lm() of pooled OLS and fixed effects
lm_pool <- lm(inv ~ value + capital, data = Grunfeld)
lm_fe <- lm(inv ~ value + capital + factor(firm), data = Grunfeld)
# Use plm() for pooled OLS and fixed effects
plm_pool <- plm(inv ~ value + capital, data=Grunfeld, model = "pooling", index=c("firm", "year"))
plm_fe <- plm(inv ~ value + capital, data=Grunfeld, model = "within", index=c("firm", "year"))
plm_re <- plm(inv ~ value + capital, data=Grunfeld, model = "random", index=c("firm", "year"))
# calculate augmented data.frame
aug_lm <- augment(lm_pool)
aug_plm_pool <- augment(plm_pool)
aug_plm_fe <- augment(plm_fe)
aug_plm_re <- augment(plm_re)
# Is column .fitted correct?
all(abs((aug_plm_pool$.fitted + aug_plm_pool$.resid) - model.frame(plm_pool)[, 1]) < 0.000000001)
all(abs((aug_plm_fe$.fitted + aug_plm_fe$.resid) - model.frame(plm_fe)[, 1]) < 0.000000001)
all(abs((aug_plm_re$.fitted + aug_plm_re$.resid) - model.frame(plm_re)[, 1]) < 0.000000001)
# calculate fitted values "by hand"
pred_plm_pool <- model.frame(plm_pool)[, 1] - residuals(plm_pool)
pred_plm_fe <- model.frame(plm_fe)[, 1] - residuals(plm_fe)
pred_plm_re <- model.frame(plm_re)[, 1] - residuals(plm_re)
# test hand calculated fitted values
all(abs((pred_plm_pool + aug_plm_pool$.resid) - model.frame(plm_pool)[, 1]) < 0.000000001)
all(abs((pred_plm_fe + aug_plm_fe$.resid) - model.frame(plm_fe)[, 1]) < 0.000000001)
all(abs((pred_plm_re + aug_plm_re$.resid) - model.frame(plm_re)[, 1]) < 0.000000001)
Deep in the internals of building the ggplot2 book, I get:
Error in tidy.lm(.[[object]][[1]], ...) : could not find function "is"
Calls: render_chapter ... do_ -> do_.grouped_df -> eval -> eval -> func -> tidy.lm
Execution halted
Any ideas? It's from a line like coefs <- models %>% tidy(mod)
where mod
is a linear model.
There's a good article on yhat: Reducing your R memory footprint by 7000x.
Do you think this is the package for a generic trimming function to sit in? If so, I'd be happy to help.
If an input data.frame
to fix_data_frame
has a column named "a"
, it will get renamed to a.1
:
library(broom)
df <- data.frame(a=1:10, b=letters[1:10])
rownames(df) <- tail(LETTERS, 10)
(fdf <- fix_data_frame(df))
## term a.1 b
## 1 Q 1 a
## 2 R 2 b
## 3 S 3 c
## ...
Redefining fix_data_frame
to something like this can fix it.
fix_data_frame <- function(x, newnames=NULL, newcol="term") {
if (is.character(newnames)) {
## Shouldn't we check that length(neqnames) == ncol(x) ?
x <- setNames(x, newnames)
}
if (all(rownames(x) == seq_len(nrow(x)))) {
# don't need to move rownames into a new column
ret <- data.frame(x, stringsAsFactors = FALSE)
}
else {
if (newcol %in% names(x)) {
nc <- tail(make.names(c(names(x), newcol), unique=TRUE), 1L)
warning("The column name for the rownames already exists, \n ",
"the new column name will be: ", nc)
newcol <- nc
}
ret <- data.frame(a=rownames(x), stringsAsFactors=FALSE)
names(ret) <- newcol
ret <- cbind(ret, x)
}
broom:::unrowname(ret)
}
which will provide
fix_data_frame(df)
term a b
## 1 Q 1 a
## 2 R 2 b
## 3 S 3 c
I can provide a patch we can be fixed up if you agree.
Note, I changed some of the rest of the function to simplify it. Also curious if we shouldn't check that the newnames
param is an appropriate length.
This package is a great idea! Unfortunately for me it doesn't work with the model class I currently work with the most, S3 objects of class mer from package lme4.0 and all versions of lme4 prior to 1.1-7. Would be really great if it did.... I might try taking a crack at it myself, but could take quite a while.
This package is a brilliant idea; I do this manually all the time these days. Have you taken a look at supporting the output from optim()
? Seems like it would be relatively straight forward to return a data frame with a column for each parameter in par
, the minimized value, convergence, etc.
Sometimes one wants to rbind such data.frames from different models that have different numbers of parameters (e.g. particularly nested models, where parameters are fixed in simpler versions), so that would probably still be left up to the user. Still, optim is widely used and this would be a great way to facilitate comparing across models, as you've already illustrated with less generic models.
felm supports multiple lhs. This estimates the model to each lhs, speeding up computations compared to an lapply loop. Should broom methods support multiple lhs for felm too?
tidy would row_bind results for multiple lhs, augment would create more columns (.fitted_nameofvar1, .fitted_nameofvar2 etc), and glance would create a data.frame with nrow = length(lhs).
I want to write a pull request, but wanted to check with you first.
Would you consider expanding this package to work on objects of class lme (from nlme) and objects of type geeglm (from geepack)?
Really like the idea for this package. For those working in health research, results of logistic regression models are normally presented in exponentiated form with 95% confidence intervals rather than the coefficient and standard error. It would be great if there were an option in the tidy() function, or a separate function, that would produce this output.
Hi!
It would be cool to have tidy output for the lavaan-package (https://github.com/yrosseel/lavaan) which is used for structural equation modeling and cfa.
Thanks =)
@dgrtwo, I like your slide explaining the modern R workflow:
I would like to make a few modifications for an upcoming presentation:
readr
and readxl
Do you have the source for the image, so I could make the changes? Or would you be interested in releasing the updated slide? Will attribute, of course.
Hi,
I have noticed, broom
returns the scaled coefficients for lm.ridge
objects. E.g.:
library(MASS)
names(longley)[1] <- "y"
fit <- lm.ridge(y ~ ., longley, lambda = seq(0.001, .05, .001))
library(broom)
td <- tidy(fit)
head(td)
## lambda GCV term estimate
## 1 0.001 0.1240 GNP 23.02
## 2 0.002 0.1217 GNP 21.27
## 3 0.003 0.1205 GNP 19.88
## 4 0.004 0.1199 GNP 18.75
## 5 0.005 0.1196 GNP 17.80
## 6 0.006 0.1196 GNP 16.99
head(coef(fit))
## GNP Unemployed Armed.Forces Population Year Employed
## 0.001 1895.97527 0.2392348 0.03100610 0.009372158 -1.643803 -0.87657471 0.10560725
## 0.002 1166.33337 0.2209952 0.02719073 0.008243201 -1.565026 -0.50108472 0.03029054
## 0.003 635.78843 0.2066111 0.02440554 0.007514565 -1.496246 -0.22885815 -0.01475570
## 0.004 236.65772 0.1948539 0.02230066 0.007043302 -1.434886 -0.02473192 -0.04056629
## 0.005 -71.53274 0.1849806 0.02066688 0.006744636 -1.379323 0.13231532 -0.05366319
## 0.006 -314.43247 0.1765137 0.01937157 0.006565392 -1.328460 0.25560068 -0.05811937
In broom
's code, there is something to extract the scales as well, but that's only if one value for lambda is given. Information about scales is lost for a range of lambdas (with > 1 elements):
tidy.ridgelm <- function(x, ...) {
if (is.numeric(x$x2)) {
# only one choice of lambda
ret <- data.frame(lambda = x$lambda, term = names(x$coef),
estimate = x$coef,
scale = x$scales, xm = x$xm)
return(unrowname(ret))
}
# otherwise, multiple lambdas/coefs/etc, have to tidy
co <- data.frame(t(x$coef), lambda = x$lambda, GCV = x$GCV)
cotidy <- tidyr::gather(co, term, estimate, -lambda, -GCV)
cotidy
}
I think, the if
-statement in the above code should be something like if (length(x$lambda)) == 1
to catch the case of only one value for lambda (not a range with 2 or more values). There does not seem to exist a value/vector x2
in an lm.ridge object (x$x2
). Also: Do we need to extract the column means (x$xm
)? That is also only done for the case #(lambda range) == 1.
As I am new to ridge regression, I am not sure if the scaled or unscaled coefficients should be returned by broom. However, given that coef(lm.ridge_object) returns the unscaled coefficients (with lambda=0 you get the (unscaled) OLS estimates), I think broom should return the unscaled ones too?
[Also there is a NULL in code line 36 in ridgelm_tidiers.R.]
Hi.
oneway.test
output (htest
S3 class) contains 2 values for the parameter (df). In result we have two rows instead one.
> tidy(oneway.test(extra ~ group, data = sleep))
statistic p.value parameter
1 3.463 0.07939 1.00
2 3.463 0.07939 17.78
Warning
In data.frame(statistic = 3.46262676078045, p.value = 0.0793941401873581, :
row names were found from a short variable and have been discarded
Calls: tidy ... as.data.frame -> as.data.frame.list -> eval -> eval -> data.frame
Hi,
according to the help page augment does this:
Given an R statistical model or other non-tidy object, add columns to the original dataset such as predictions, residuals and cluster assignments.
But even if you provide the original data, augment (at least for lm) will return the dataset after na.omit
, i.e. it is a reduced dataset without NA-rows.
I would prefer if augment really did augment the original data (with NA for the other rows) but it would be nice if you at least changed the documentation to make clear that it returns the reduced data. :)
Ciao,
Stefan
Applying tidy
to the results of anova
on more than one model yields a data frame where all the column names are off by one (and the last column has name NA
):
m1 <- lm(mpg ~ wt + qsec + disp, mtcars)
m2 <- lm(mpg ~ wt, mtcars)
a <- anova(m1, m2)
Now print(a)
gives
Analysis of Variance Table
Model 1: mpg ~ wt + qsec + disp
Model 2: mpg ~ wt
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 195.46
2 30 278.32 -2 -82.859 5.9348 0.007099 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
but tidy(a)
gives
df sumsq meansq statistic p.value NA
1 28 195.4626 NA NA NA NA
2 30 278.3219 -2 -82.85933 5.934796 0.007099496
It looks like tidy.anova
assumes a
has five columns, but it has six; the method misses the third column of a
(the Df
column). Apologies for not putting together a pull request, and thanks for this fantastic package.
It will be nice if broom functions can be applied to rms class models from frank harrell's rms package. It is a pivotal package providing fantastic summary functions in the field of clinical prediction modelling.
Thanks
I don't get a column with of residuals, I only get the standardized residuals, not the residuals, as in the example.
When I follow the example on the github page:
lmfit <- lm(mpg ~ wt, mtcars)
head(augment(lmfit))
I get this:
.rownames | mpg | wt | .fitted | .se.fit | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 2.620 | 23.28261 | 0.6335798 | 0.0432690 | 3.067494 | 0.0132741 | -0.7661677 |
Mazda RX4 Wag | 21.0 | 2.875 | 21.91977 | 0.5714319 | 0.0351968 | 3.093068 | 0.0017240 | -0.3074305 |
Datsun 710 | 22.8 | 2.320 | 24.88595 | 0.7359177 | 0.0583757 | 3.072127 | 0.0154394 | -0.7057525 |
Hornet 4 Drive | 21.4 | 3.215 | 20.10265 | 0.5384424 | 0.0312502 | 3.088268 | 0.0030206 | 0.4327511 |
Hornet Sportabout | 18.7 | 3.440 | 18.90014 | 0.5526562 | 0.0329218 | 3.097722 | 0.0000760 | -0.0668188 |
Valiant | 18.1 | 3.460 | 18.79325 | 0.5552829 | 0.0332355 | 3.095184 | 0.0009211 | -0.2314831 |
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] grid stats graphics grDevices
[5] utils datasets methods base
other attached packages:
[1] knitr_1.9 broom_0.3.7
[3] gridExtra_0.9.1 dplyr_0.4.1.9000
[5] ggplot2_1.0.1 GGally_0.5.0
loaded via a namespace (and not attached):
[1] Rcpp_0.11.6 git2r_0.10.1
[3] formatR_1.1 plyr_1.8.1
[5] bitops_1.0-6 tools_3.2.0
[7] rpart_4.1-9 digest_0.6.8
[9] memoise_0.2.1 evaluate_0.7
[11] gtable_0.1.2 psych_1.5.1
[13] shiny_0.11.1.9004 DBI_0.3.1
[15] yaml_2.1.13 parallel_3.2.0
[17] proto_0.3-10 httr_0.6.1
[19] stringr_1.0.0 rversions_1.0.0
[21] devtools_1.8.0 reshape_0.8.5
[23] R6_2.0.1 XML_3.98-1.1
[25] rmarkdown_0.6.1 reshape2_1.4.1
[27] tidyr_0.2.0.9000 magrittr_1.5
[29] scales_0.2.4 htmltools_0.2.6
[31] MASS_7.3-40 assertthat_0.1
[33] mnormt_1.5-2 mime_0.3
[35] xtable_1.7-4 colorspace_1.2-6
[37] httpuv_1.3.2 labeling_0.3
[39] stringi_0.5-1 RCurl_1.95-4.6
[41] munsell_0.4.2
packageVersion("broom")
[1] ‘0.3.7’
Am I doing something wrong?
Because the boostrap function produces a grouped_df I expected it to work with dplyr::summarize as below. Apparently not. Could you give an example?
mtcars %>%
group_by(vs) %>%
Source: local data frame [2 x 2]summarize(mean = mean(mpg))
vs mean
1 0 16.61667
2 1 24.55714
mtcars %>%
bootstrap(10) %>%
Error: corrupt 'grouped_df', contains %d rows, and %s rows in groupssummarize(mean = mean(mpg))
You can explore the branch here:
https://github.com/joranE/broom/tree/quantreg
I will split up the documentation for the rq, rqs and nlrq methods at some point; didn't realize it would be that long.
Could you please have a look?
checking examples ... ERROR
Running examples in ‘broom-Ex.R’ failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: felm_tidiers
> ### Title: Tidying methods for models with multiple group fixed effects
> ### Aliases: augment.felm felm_tidiers glance.felm tidy.felm
>
> ### ** Examples
>
> if (require("lfe", quietly = TRUE)) {
+ N=1e2
+ DT <- data.frame(
+ id = sample(5, N, TRUE),
+ v1 = sample(5, N, TRUE),
+ v2 = sample(1e6, N, TRUE),
+ v3 = sample(round(runif(100,max=100),4), N, TRUE),
+ v4 = sample(round(runif(100,max=100),4), N, TRUE)
+ )
+
+ result_felm <- felm(v2~v3, DT)
+ tidy(result_felm)
+ augment(result_felm)
+ result_felm <- felm(v2~v3|id+v1, DT)
+ tidy(result_felm, fe = TRUE)
+ augment(result_felm)
+ v1<-DT$v1
+ v2 <- DT$v2
+ v3 <- DT$v3
+ id <- DT$id
+ result_felm <- felm(v2~v3|id+v1)
+ tidy(result_felm)
+ augment(result_felm)
+ glance(result_felm)
+ }
Warning in FUN(newX[, i], ...) : non-factor id coerced to factor
Warning in FUN(newX[, i], ...) : non-factor v1 coerced to factor
Error: data_frames can not contain data.frames, matrices or arrays
Execution halted
I was going to dip my toe gently into exploring broom by trying to add an augment
method for loess
objects, but I'm not sure how you'd prefer that to be done.
Initially, it seemed like a simple matter of adding an S3 method augment.loess
that just called augment_columns
, along the lines of augment.lm
. But in keeping with base R's sometimes inconsistent argument naming predict.loess
uses the argument se
rather than se.fit
.
It's easy enough to write a completely self-contained augment.loess
function, but it wasn't clear to me looking at the rest of the package whether you preferred that type of approach, or whether you prefer code that's a bit more centralized.
lm()
supports building column-wise models, but broom
fails to clean them up.
a = matrix(1:20, nrow=10, ncol=2)
b = a + rnorm(length(a))
result = lm(b~a)
broom::tidy(result)
# Error in 1:ncol(co) : argument of length 0
#2: tidy.lm(lm(b ~ a))
#1: broom::tidy(lm(b ~ a))
Note that the output format needs to be adjusted to accommodate multiple models, either by adding columns for the indices or by reporting the term with index.
Thanks a lot for this package. It would be great if you can incorporate the coeftest class from the lmtest package. Thank you.
I think an interesting direction would be for xtable / stargazer to support the output of tidy
and glance
. For now, these packages don't use multiple dispatch and they support a relatively small set of statistical commands. But tidy
and glance
correspond exactly to the kind of operations needed to print consistently results from different models.
If this is implemented, for a given S3 object, writing a tidy
and glance
method would then directly make it compatible with xtable/stargazer.
I think the only element missing to print a table from tidy
and glance
is the name of the dependent variable.
I'm working on an R package of my own, which will provide functions to knit scientific manuscripts.
To this end, I'm building convenience functions that assemble text strings form analysis objects, such as htest
or summary.lm
etc. I am considering to use broom
to tidy those objects. The only thing that is currently keeping me from using your package is that fact that when I tidy objects I loose all information about what the estimates
actually are (differences of means, means of differences, correlation coefficients, etc.)---the same is, of course, true for other columns of the tidied output.
Are there any plans to add this information to tidied data.frames? I think retaining this information would be helpful for other purposes and programming literacy in general.
An unobtrusive way of doing this would be to simply add attributes to the data.frame from the original object (adding a row to the data.frame would be another possibility). I've thrown something together for an object from t.test()
to illustrate what I mean (this should generalize to other htest
objects with some minor adaptations):
> t_test <- t.test(extra ~ group, data = sleep)
> tidy_t_test <- tidy(t_test)
> tidy_t_test
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 -1.58 0.75 2.33 -1.860813 0.07939414 17.77647 -3.365483 0.2054832
> vars <- lapply(t_test, attr, "names")
> vars <- vars[!unlist(lapply(vars, is.null))]
> conf_level <- attr(t_test$conf.int, "conf.level") * 100
> conf_levels <- paste(c((100 - conf_level) / 2, 100 - (100 - conf_level) / 2), "%")
> attr(tidy_t_test, "vars") <- c(vars$null.value, vars$estimate, vars$statistic, "p.value", vars$parameter, conf_levels)
> str(tidy_t_test)
'data.frame': 1 obs. of 8 variables:
$ estimate : num -1.58
$ estimate1: num 0.75
$ estimate2: num 2.33
$ statistic: num -1.86
$ p.value : num 0.0794
$ parameter: num 17.8
$ conf.low : num -3.37
$ conf.high: num 0.205
- attr(*, "vars")= chr "difference in means" "mean in group 1" "mean in group 2" "t" ...
> attr(tidy_t_test, "vars")
[1] "difference in means" "mean in group 1" "mean in group 2" "t" "p.value" "df" "2.5 %"
[8] "97.5 %"
Readme.md (frontpage of repo) does not reflect all available tidiers. At least, the tieders for plm
(package plm
) and lm.ridge
(package MASS
) could be added. Think, this is really interesting information for people looking if their model is supported by broom.
suppose fitManova is the result of a manova
analysis
tidy(fitManova, test = "Wilks")
will return Wilks's Lambda but the label will still be Pillai's trace.
(I'm putting this here more as "Hey, I want to remember to do this later," rather than "Hey, you should do this." Consider this a means to solicit feedback prior to a future pull request.)
I frequently make plots/tables of the results of multiple models simultaneously, which entails wrangling them all into one data frame. It would be nice to automate the process of combining them, and the tidy
generic provides a nice way of doing that.
Hypothetical output for the "straighten
" function I'm imagining:
R> set.seed(94)
R> dat <- data.frame(y = rbinom(100, 1, 0.5), x = rnorm(100))
R> fit_lm <- lm(y ~ x, data = dat)
R> fit_glm <- glm(y ~ x, data = dat, family = binomial)
R> straighten(linear = fit_lm, logit = fit_glm)
model term estimate stderror statistic p.value
1 linear (Intercept) 0.520188 0.050336 10.33437 2.2787e-17
2 linear x -0.037276 0.051917 -0.71798 4.7448e-01
3 logit (Intercept) 0.081331 0.200702 0.40523 6.8531e-01
4 logit x -0.150389 0.208490 -0.72132 4.7071e-01
for consistency
First off thank you for this package. I'm using quite frequently.
I'm currently working on paper where I use the Matching package to do some Genetic Matching. The output is an object that I am unable to coerce to a data frame limiting my ability to even something simple like report the findings in a table.
https://cran.r-project.org/web/packages/Matching/Matching.pdf
I know, it is a slightly different use case, but: I just wanted to extract the RHS of a formula
in a data.frame rowwise and broom
came to my mind. As a formula
is neither an estimated model nor are there numbers/data to be extracted/displayed, do you think it makes sense to have a tidier to transform a formula's RHS in a "tidy" format?
In my pre-broom code, I just plugged the model fits (like R^2 of (p)lm models) in a separate column with the same value in each line, ie for each coefficient as it is just one value for the whole model. Sure, this is not tidy data.
Would be nice to have a command to extract all model fit indicators in a tidy format (one column per indicator and just one line). For comparing different models this would be really handy as one can just rbind those tidy outputs. This idea could be related to the list discussion in #40. What do you think?
Good work - can you lay down a version tag?
I ran a poisson regression model (log link) and received this error when using tidy:
Warning message: In tidy.lm(dt_poisson, exponentiate = TRUE) : Exponentiating coefficients, but model did not use a logit link function
The link function is below, and is clearly "log". Perhaps you could include both logit and log in the error handling?
$ family :List of 12
..$ family : chr "poisson"
..$ link : chr "log"
..$ linkfun :function (mu)
..$ linkinv :function (eta)
I have to say, I was impressed that the package even considered that people might exponentiate incorrectly and provides a warning. That is great.
broom
looks really great overall. I have been thinking for a long time about updating my coefplot2
package (on R-forge, but somewhat defunct there), and it looks like broom
could serve as the back-end, but I have a bunch of questions and ideas about extensions. Sorry that the following is so long, but I hope I'm starting a useful discussion. Please feel free to ignore these ideas if they're too awkward or can't be implemented without breaking compatibility too badly, but I do think they could be useful things to think about.
arm::coefplot
knows about bugs
, glm
, lm
, and polr
objects (showMethods("coefplot")
)
coefplot2
uses a coeftab()
method to extract coefficient tables: in addition to the classes that coefplot
handles, coefplot2
knows about glmmadmb
, glmmML
, mcmc
, MCMCglmm
, rjags
, merMod
, and mer
(old lme4
). Right now it extracts lower/upper confidence intervals and standard deviations if they're available; it only gets the Wald intervals for merMod
objects, leaving the interval/uncertainty estimates out for the random effects parameters. It uses
ctype: (character) one of "quad" (quadratic/Wald), "profile" (likelihood profile), "quantile" (quantiles of posterior density), "HPDinterval" (highest posterior density)
ptype: parameter type: one or more of "fixef" (default: fixed effect parameters), "ranef" (posterior modes of random effects), "vcov" (parameters of random effects variance-covariance matrices)
If some of the interface issues can be sorted out I'd be happy to contribute methods for all of these.
In general there are a lot of different kinds of coefficients one might want to retrieve from a model. I'm going to focus on two issues, parameter types and confidence intervals.
At the moment, the tidy
method for merMod
objects offers the choice of "fixed" or "random". The "fixed" case makes sense, but the other possibilities get a little bit interesting. My own most common use cases for random-effects parameters is to want to retrieve the standard deviations and correlations of the random effects, not the estimated values of the coefficients for each level of the grouping variable(s). Thus I would implement something like
library("lme4")
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
aa <- as.data.frame(VarCorr(fm1))
aa2 <- aa[1:(nrow(aa)-1),] ## skip Residual variance
## lots of choices here; lme4:::tnames could be exported if necessary
termnames <- lme4:::tnames(fm1,old=FALSE,prefix=c("sd","cor"))
data.frame(terms,estimate=aa2[,"sdcor"],std.error=NA,statistic=NA)
##
## terms estimate std.error statistic
## Subject.(Intercept) 24.74044761 NA NA
## Subject.Days 5.92213324 NA NA
## Subject.(Intercept).Days 0.06555134 NA NA
There are more decisions/possible user inputs here:
match.arg()
won't work any morecoef()
was more uniform across packages and model types,Another useful place to look is at the coef()
method for hurdle
and zeroinfl
models from the pscl
package: from ?predict.hurdle
,
coef(object, model = c("full", "count", "zero"), ...)
(where these refer to parameters determining the probability of a structural zero, the expected number of counts in non-structural-zero cases, or both)
I guess the question is how much can be passed through tidy
as optional arguments to coef
vs. trying to give the tidy
methods as unified interface ...
While many model types have well-defined and useful standard errors, others don't. In particular the Wald standard errors for GLMs can be very unreliable; profile confidence intervals are more reliable. The same is often true for random effects parameters. It would be nice to have the option to incorporate lower and upper confidence intervals in a tidy data frame, although it's a little hard to know how to get this -- would you pass a confidence-interval data frame or a likelihood profile? (For most models including glm
objects it's easy and lightweight to recompute the profile confidence intervals via confint()
, but for merMod
objects this can be an expensive operation ...)
One thing to think about here is that the default column names that R uses for confidence intervals (2.5 %
, 97.5 %
) are a nuisance -- I usually use lwr
and upr
, although these don't specify the level -- maybe lwr_0.025
, upr_0.975
?
There are sometimes additional decisions: for Bayesian models, should highest posterior density (coda::HPDinterval
) or quantiles of the marginal posterior be used?
Some things that I've wanted coefplot2
to be able to do:
arm::coefplot
does this, I think, but onlyAs shown in this question, when do
is used on an ungrouped data frame, it cannot be used with rowwise_df_tiders
:
mtcars %>% do(model = lm(mpg ~ wt, .))
#> Source: local data frame [1 x 1]
#>
#> model
#> 1 <S3:lm>
mtcars %>% do(model = lm(mpg ~ wt, .)) %>% glance(model)
#> Error in stats::complete.cases(x): invalid 'type' (list) of argument
This is because the class of the resulting tbl_df is not a rowwise_df
, but simply a tbl_df
and data.frame
.
To fix this, I'll add a special case for tidy.tbl_df
that checks if it's one row, and if so, applies the tidy.rowwise_df
operation.
Would it be possible to enhance the documentation of tidy? The current documentation gives
Usage
tidy(x, ...)Arguments
x An object to be converted into a tidy data.frame
... extra arguments
What the extra arguments might be aren't described anywhere I can find them. I know of conf.int and exponentiate from issue #5, are there others?
I was performing some regressions similar to the below example, when I noticed that p.adjust
does not seem to do anything to p.values after a glance()
has been called.
data(mtcars)
mtcars %>%
group_by(gear) %>%
do(mod = lm(mpg~wt, data=.)) %>%
glance(mod) %>%
mutate(fdr=p.adjust(p.value, method="fdr")) %>%
select(p.value, fdr)
$ Source: local data frame [3 x 3]
$ Groups: gear
$
$ gear p.value fdr
$ 1 3 0.0006048395 0.0006048395
$ 2 4 0.0010104804 0.0010104804
$ 3 5 0.0012815520 0.0012815520
Am I missing something here?
$ version R version 3.1.2 (2014-10-31)
$ system x86_64, darwin13.4.0
$ broom 0.3.6 2015-02-18 CRAN (R 3.1.2)
$ dplyr 0.4.1 2015-01-14 CRAN (R 3.1.2)
Add tidiers for objects from the vegan package for community ecology.
Current tidy broom::tidy.gam() seems to only work for package gam, and not for package mgcv. The latter does not have summary(x)$parametric.anova, but a data frame could be constructed from the output of the summary anyway.
FYI When using tidy with pairwise.t.test the following code gives an error
library(broom)
dat <- mtcars
dat$vs <- as.factor(mtcars$vs)
res <- pairwise.t.test(dat[,"mpg"], dat[,"vs"])
tidy(res)
Error in eval(expr, envir, enclos) : object 'group1' not found
Recoding the levels seems to 'fix' the issue.
vs <- as.character(mtcars$vs)
vs[vs == 1] <- "manual"
vs[vs == 0] <- "automatic"
dat$vs2 <- as.factor(vs)
res <- pair
Really like Broom. Thanks!
feature request: handling plm() objects (panel data regression) would be great; wrote my own extraction function (adapted some lm() example I found online), used the following commands to extract information [self explanatoriy, I guess]:
t(x[, "Estimate" ])
t(x[, "Pr(>|t|)" ])
t(x[ , "Std. Error" ])
t(x[ , "t-value" ])
dim(x$coefficients)[1] # number of coefficients
x$r.squared[1] # R^2
x$r.squared[2] # adj. R^2
sum(x$residuals^2) # sum of sqaures
x$fstatistic[[4]] # p-value F statistic
When the formula contains terms such as : or , it would be nice to split term in two columns. For instance
tidy(lm(y ~ x:as.factor(z)))
gives
term
(Intercept)
x*as.factor(z):1
x*as.factor(z):2
x*as.factor(z):3
but it could give two supplementary columns
term
(Intercept) NA NA
x*as.factor(z):1 x 1
x*as.factor(z):2 x 2
x*as.factor(z):3 x 3
This would allow to plot the estimate as a function of z, for instance.
I'm not sure what's the best way to do it that avoids regex.
Right now some of the behavior is handled differently across functions. Should have standardized test cases. Also needs to ensure it works with or without row names.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.