tidymodels / broom Goto Github PK

View Code? Open in Web Editor NEW

1.4K 52.0 301.0 81.04 MB

Convert statistical analysis objects from R into tidy format

Home Page: https://broom.tidymodels.org

License: Other

R 100.00%

r tidy-data modeling

broom's People

Contributors

Stargazers

Watchers

Forkers

meyera arturochian xtmgah snewhouse andremikulec montrealrusergroup chmue nishantsbi kismsu cabiling carlesla fboehm alcideschaux eemaa26 njkhan786 arijitnayak ashander pjpan cybernetics jorane parthasen kalebka55 zeehio junkka jiho bayesybrad lwjohnst86 rmc2 gavinsimpson andrewjlm sinhrks codetasks pkq ahoho jayhesselberth lselzer jrnold aayala15 joelgombin nathania larmarange muuksi jasonabr jimhester rtaph mbojan jlegewie tjmahr nistara andrewpbray paulhendricks jaredlander lepennec keniajin brainiarc7 stopcontrol englianhu mdlincoln kghub duytran16 jbiesanz omarbenites drsimonj jkgrain lingtax rlugojr marcuswalz stillmatic solertis m-sostero ashleylester talgalili erinacrandall patr1ckm remkoduursma rentrop kevinykuo mdancho84 mnel hughjonesd xmur gregce danli-ds germanium amarendaralwala89 puterleat stefanfritsch bensoltoff njtierney mcdussault apreshill muntasirmasum atantos eipi10 mrkem598 mjskay corybrunson anishsingh20 lemna dchiu911

broom's Issues

lmerMod?

Would you consider expanding this package to work on objects of type lmerMod (from lme4)?

Major revision: adding augment and glance S3 methods

I've added two additional S3 methods along with the original tidy. Each of them tidies data from a model, but in a different way. The new methods are

augment: add columns to the original data, containing (e.g.) predictions and residuals
glance: construct a one-row data.frame summary, containing (e.g.) R^2 or deviance

I've added a new branch, augment-glance with this change. It can be installed with

library(devtools)
install_github("broom", "dgrtwo", "augment-glance")

The new vignettes explain these three methods and how they interrelate.

Before I merge this into the main branch I'd be very interested in any feedback. While the package is very new, I would like advice from those interested in using it. (For instance, are augment and glance helpful names for these functions? Do the vignettes make the difference clear? Which methods should be implemented next?)

lapplying tidy() to aovlist-objects

I have had some trouble tidying avolist objects. When performing a repeated measures analysis of variance the optput object is usually a list of aov objects for the different error strata.

datafilename <- "http://personality-project.org/r/datasets/R.appendix5.data"
aov_example <- read.table(datafilename, header = T)
aov_example <- aov(Recall ~ (Task*Valence*Gender*Dosage) + Error(Subject/(Task*Valence)) +(Gender*Dosage), aov_example)
class(aov_example)

[1] "aovlist" "listof"

(example stolen from http://personality-project.org/r/#anova)

To-date, it is not possible to tidy these objects directly because fix_data_frame() doesn't take lists.

library("broom")
tidy(aov_example)

Error in as.data.frame.default(x) :
cannot coerce class "c("aovlist", "listof")" to a data.frame

So I tried to use lapply(), but this throws an error:

lapply(aov_example, tidy)

Error in colnames<-(*tmp*, value = c("df", "sumsq", "meansq", "statistic", :
Attribute 'names' [5] must be of same length as vector [3]

Is there an easy fix for this? It would be convenient if tidy() accepted aovlist-objects.

Including number of observations used to build the model

I think it would be useful if some version of the data.frame that represents the result also include a column with the number of observations that were used to build the model.

An easy way to access would be using e.g. nobs() for stats::lm(), but I'm sure other models have similar reporting of this.

It would be even more useful if it could include the actual number of observations for logicalTRUE or factorLEVEL.

warn user if lm() produced coefficents with NA (ill-specified model)

If an ill-specified model, in this case with linear dependend columns, is estimated by lm(), lm() will just drop the respective coefficient for estimation and return NA as the estimate for that coefficient. Thus, the user should notice his mistake. If the user does not look at the raw output of lm() (or summary(lm_object) or has a skript running, he gets no notification (the script is likely to fail at some other point). If one uses uses broom::tidy() (without looking at the original output of lm(), as I do often, because I like the layout), the ill-specified model is somehow hidden from the users.

Can we have a warning for such cases?

See the example below for lm() and also for plm() (from package plm):

library(plm)
data(Grunfeld, package="plm")
pGrunfeld <- pdata.frame(Grunfeld, index = c("firm", "year"), drop.index = F)

# make a duplicate column (i.e. a linear dependend column)
Grunfeld$capital2 <- Grunfeld$capital 
pGrunfeld$capital2 <- pGrunfeld$capital

# lm model
mod_lm <- lm(inv ~ value + capital + capital2, data=Grunfeld)
mod_lm # NA coefficent is displayed
broom::tidy(mod_lm) # NA coefficient is silently dropped, user is not aware of his mistake

# plm model
mod_plm <- plm(inv ~ value + capital + capital2, data=pGrunfeld, model="pooling")
mod_plm # NA is not displayed
broom::tidy(mod_plm) # fails (outside of broom), which is somehow good as the user gets notified about the ill-specified model

Error message from tidy(coxphFit)

I'm getting the following error message when I try to tidy a coxph fit from the survival package:

Error in seq_len(nrow(x)) : 
  argument must be coercible to non-negative integer
In addition: Warning message:
In seq_len(nrow(x)) : first element used of 'length.out' argument

Here is code to replicate the problem, note I first confirm that tidy works on a regression fit and they attempt it on a coxph fit

library(survival)
library(broom)

# Create data frame for the survival analysis
demoDat <- structure(list(daysFromInvestigation = c(1L, 6L, 7L, 5L, 5L, 
2L, 3L, 1L, 2L, 5L, 1L, 4L, 5L, 7L, 3L, 1L, 6L, 4L, 2L, 6L, 6L, 
7L, 7L, 6L, 4L, 2L, 2L, 4L, 0L, 5L, 5L, 4L, 4L, 1L, 7L, 7L, 1L, 
2L, 0L, 7L, 6L, 7L, 1L, 1L, 2L, 2L, 2L, 1L, 7L, 1L, 2L, 3L, 1L, 
1L, 2L), Sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 
1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("F  ", 
"M  "), class = "factor"), dead = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1)), row.names = c(NA, -55L), class = "data.frame", .Names = c("daysFromInvestigation", 
"Sex", "dead"))


# To be sure that I've got the packages loaded and things work, demonstrate with a glm regression
demoFit <- glm(daysFromInvestigation ~ Sex, data=demoDat)
print(demoFit)
tidy(demoFit)

# Works great

# Now do coxph 
demoCoxFit <- coxph(Surv(demoDat$daysFromInvestigation, demoDat$dead) ~ Sex, data = demoDat)

print(demoCoxFit)

Which produces the following:

Call:
coxph(formula = Surv(demoDat$daysFromInvestigation, demoDat$dead) ~ 
    Sex, data = demoDat)


       coef exp(coef) se(coef)     z    p
SexM    0.2      1.22     0.34 0.588 0.56

Likelihood ratio test=0.36  on 1 df, p=0.549  n= 55, number of events= 55

But when I run tidy on the fit, I get the error message at the beginning of this issue report

tidy(demoCoxFit)

augment() for plm models does not calculate fitted values correctly

Due an implementation of predict() for panel models, which I cannot follow, augment() with panel models calculates wrong fitted values. In fact, the values are only wrong for within and random models as they have individual intercepts. There is an implementation for fitted() but that is not exported and it produces strange values (as I think).

I think, we need a work around here. Here is my test file to demonstrate the numbers:

[see below for a pull request for that:
4ce62e7 and 97cb53f should be a new pull request for #73]

require(broom)
require(plm)
data("Grunfeld")

# Use lm() of pooled OLS and fixed effects
lm_pool <- lm(inv ~ value + capital, data = Grunfeld)
lm_fe   <- lm(inv ~ value + capital + factor(firm), data = Grunfeld)

# Use plm() for pooled OLS and fixed effects
plm_pool <- plm(inv ~ value + capital, data=Grunfeld, model = "pooling", index=c("firm", "year"))
plm_fe   <- plm(inv ~ value + capital, data=Grunfeld, model = "within", index=c("firm", "year"))
plm_re   <- plm(inv ~ value + capital, data=Grunfeld, model = "random", index=c("firm", "year"))

# calculate augmented data.frame
aug_lm       <- augment(lm_pool)
aug_plm_pool <- augment(plm_pool)
aug_plm_fe   <- augment(plm_fe)
aug_plm_re   <- augment(plm_re)

# Is column .fitted correct?
all(abs((aug_plm_pool$.fitted + aug_plm_pool$.resid) - model.frame(plm_pool)[, 1]) < 0.000000001)
all(abs((aug_plm_fe$.fitted   + aug_plm_fe$.resid)   - model.frame(plm_fe)[, 1])   < 0.000000001)
all(abs((aug_plm_re$.fitted   + aug_plm_re$.resid)   - model.frame(plm_re)[, 1])   < 0.000000001)

# calculate fitted values "by hand"
pred_plm_pool <- model.frame(plm_pool)[, 1] - residuals(plm_pool)
pred_plm_fe   <- model.frame(plm_fe)[, 1]   - residuals(plm_fe)
pred_plm_re   <- model.frame(plm_re)[, 1]   - residuals(plm_re)

# test hand calculated fitted values
all(abs((pred_plm_pool   + aug_plm_pool$.resid) - model.frame(plm_pool)[, 1]) < 0.000000001)
all(abs((pred_plm_fe     + aug_plm_fe$.resid)   - model.frame(plm_fe)[, 1])   < 0.000000001)
all(abs((pred_plm_re     + aug_plm_re$.resid)   - model.frame(plm_re)[, 1])   < 0.000000001)

could not find function "is"

Deep in the internals of building the ggplot2 book, I get:

Error in tidy.lm(.[[object]][[1]], ...) : could not find function "is"
Calls: render_chapter ... do_ -> do_.grouped_df -> eval -> eval -> func -> tidy.lm
Execution halted

Any ideas? It's from a line like coefs <- models %>% tidy(mod) where mod is a linear model.

Is this the package for a trimming function to sit in?

There's a good article on yhat: Reducing your R memory footprint by 7000x.

Do you think this is the package for a generic trimming function to sit in? If so, I'd be happy to help.

fix_data_frame tramples an input df that has a colname == 'a'

If an input data.frame to fix_data_frame has a column named "a", it will get renamed to a.1:

library(broom)
df <- data.frame(a=1:10, b=letters[1:10])
rownames(df) <- tail(LETTERS, 10)

(fdf <- fix_data_frame(df))
##    term a.1 b
## 1     Q   1 a
## 2     R   2 b
## 3     S   3 c
## ...

Redefining fix_data_frame to something like this can fix it.

fix_data_frame <- function(x, newnames=NULL, newcol="term") {
    if (is.character(newnames)) {
       ## Shouldn't we check that length(neqnames) == ncol(x) ?
       x <- setNames(x, newnames)
    }
    if (all(rownames(x) == seq_len(nrow(x)))) {
        # don't need to move rownames into a new column
        ret <- data.frame(x, stringsAsFactors = FALSE)
    }
    else {
        if (newcol %in% names(x)) {
          nc <- tail(make.names(c(names(x), newcol), unique=TRUE), 1L)
          warning("The column name for the rownames already exists, \n  ",
                  "the new column name will be: ", nc)
          newcol <- nc
        }
        ret <- data.frame(a=rownames(x), stringsAsFactors=FALSE)
        names(ret) <- newcol
        ret <- cbind(ret, x)
    }
    broom:::unrowname(ret)
}

which will provide

fix_data_frame(df)
      term  a b
## 1     Q  1 a
## 2     R  2 b
## 3     S  3 c

I can provide a patch we can be fixed up if you agree.

Note, I changed some of the rest of the function to simplify it. Also curious if we shouldn't check that the newnames param is an appropriate length.

request method for S3 object of class mer, package lme4.0

This package is a great idea! Unfortunately for me it doesn't work with the model class I currently work with the most, S3 objects of class mer from package lme4.0 and all versions of lme4 prior to 1.1-7. Would be really great if it did.... I might try taking a crack at it myself, but could take quite a while.

support `optim()` output?

This package is a brilliant idea; I do this manually all the time these days. Have you taken a look at supporting the output from optim()? Seems like it would be relatively straight forward to return a data frame with a column for each parameter in par, the minimized value, convergence, etc.

Sometimes one wants to rbind such data.frames from different models that have different numbers of parameters (e.g. particularly nested models, where parameters are fixed in simpler versions), so that would probably still be left up to the user. Still, optim is widely used and this would be a great way to facilitate comparing across models, as you've already illustrated with less generic models.

multiple lhs

felm supports multiple lhs. This estimates the model to each lhs, speeding up computations compared to an lapply loop. Should broom methods support multiple lhs for felm too?
tidy would row_bind results for multiple lhs, augment would create more columns (.fitted_nameofvar1, .fitted_nameofvar2 etc), and glance would create a data.frame with nrow = length(lhs).

I want to write a pull request, but wanted to check with you first.

nlme and geepack

Would you consider expanding this package to work on objects of class lme (from nlme) and objects of type geeglm (from geepack)?

Exponentiated values

Really like the idea for this package. For those working in health research, results of logistic regression models are normally presented in exponentiated form with 95% confidence intervals rather than the coefficient and standard error. It would be great if there were an option in the tidy() function, or a separate function, that would produce this output.

tidy lavaan output

Hi!
It would be cool to have tidy output for the lavaan-package (https://github.com/yrosseel/lavaan) which is used for structural equation modeling and cfa.
Thanks =)

Updated workflow figure or source

@dgrtwo, I like your slide explaining the modern R workflow:

I would like to make a few modifications for an upcoming presentation:

all black
addition of readr and readxl

Do you have the source for the image, so I could make the changes? Or would you be interested in releasing the updated slide? Will attribute, of course.

unscaled output of tidy() for lm.ridge

Hi,

I have noticed, broom returns the scaled coefficients for lm.ridge objects. E.g.:


library(MASS)
names(longley)[1] <- "y"
fit <- lm.ridge(y ~ ., longley, lambda = seq(0.001, .05, .001))
library(broom)
td <- tidy(fit)
head(td)
##   lambda    GCV term estimate
## 1  0.001 0.1240  GNP    23.02
## 2  0.002 0.1217  GNP    21.27
## 3  0.003 0.1205  GNP    19.88
## 4  0.004 0.1199  GNP    18.75
## 5  0.005 0.1196  GNP    17.80
## 6  0.006 0.1196  GNP    16.99

head(coef(fit))
##                       GNP Unemployed Armed.Forces Population        Year    Employed
## 0.001 1895.97527 0.2392348 0.03100610  0.009372158  -1.643803 -0.87657471  0.10560725
## 0.002 1166.33337 0.2209952 0.02719073  0.008243201  -1.565026 -0.50108472  0.03029054
## 0.003  635.78843 0.2066111 0.02440554  0.007514565  -1.496246 -0.22885815 -0.01475570
## 0.004  236.65772 0.1948539 0.02230066  0.007043302  -1.434886 -0.02473192 -0.04056629
## 0.005  -71.53274 0.1849806 0.02066688  0.006744636  -1.379323  0.13231532 -0.05366319
## 0.006 -314.43247 0.1765137 0.01937157  0.006565392  -1.328460  0.25560068 -0.05811937

In broom's code, there is something to extract the scales as well, but that's only if one value for lambda is given. Information about scales is lost for a range of lambdas (with > 1 elements):


tidy.ridgelm <- function(x, ...) {
    if (is.numeric(x$x2)) {
        # only one choice of lambda
        ret <- data.frame(lambda = x$lambda, term = names(x$coef),
                          estimate = x$coef,
                          scale = x$scales, xm = x$xm)
        return(unrowname(ret))
    }

    # otherwise, multiple lambdas/coefs/etc, have to tidy
    co <- data.frame(t(x$coef), lambda = x$lambda, GCV = x$GCV)
    cotidy <- tidyr::gather(co, term, estimate, -lambda, -GCV)

    cotidy
}

I think, the if-statement in the above code should be something like if (length(x$lambda)) == 1 to catch the case of only one value for lambda (not a range with 2 or more values). There does not seem to exist a value/vector x2 in an lm.ridge object (x$x2). Also: Do we need to extract the column means (x$xm)? That is also only done for the case #(lambda range) == 1.

As I am new to ridge regression, I am not sure if the scaled or unscaled coefficients should be returned by broom. However, given that coef(lm.ridge_object) returns the unscaled coefficients (with lambda=0 you get the (unscaled) OLS estimates), I think broom should return the unscaled ones too?

[Also there is a NULL in code line 36 in ridgelm_tidiers.R.]

Compability with oneway.test

Hi.

oneway.test output (htest S3 class) contains 2 values for the parameter (df). In result we have two rows instead one.

> tidy(oneway.test(extra ~ group, data = sleep))
  statistic p.value parameter
1     3.463 0.07939      1.00
2     3.463 0.07939     17.78
Warning
In data.frame(statistic = 3.46262676078045, p.value = 0.0793941401873581,  :
  row names were found from a short variable and have been discarded
Calls: tidy ... as.data.frame -> as.data.frame.list -> eval -> eval -> data.frame

Augment shouldn't drop NA rows

Hi,

according to the help page augment does this:

Given an R statistical model or other non-tidy object, add columns to the original dataset such as predictions, residuals and cluster assignments.

But even if you provide the original data, augment (at least for lm) will return the dataset after na.omit, i.e. it is a reduced dataset without NA-rows.

I would prefer if augment really did augment the original data (with NA for the other rows) but it would be nice if you at least changed the documentation to make clear that it returns the reduced data. :)

Ciao,
Stefan

tidy method for anova misses a column name

Applying tidy to the results of anova on more than one model yields a data frame where all the column names are off by one (and the last column has name NA):

m1 <- lm(mpg ~ wt + qsec + disp, mtcars)
m2 <- lm(mpg ~ wt, mtcars)
a <- anova(m1, m2)

Now print(a) gives

Analysis of Variance Table

Model 1: mpg ~ wt + qsec + disp
Model 2: mpg ~ wt
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     28 195.46                                
2     30 278.32 -2   -82.859 5.9348 0.007099 **

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

but tidy(a) gives

  df    sumsq meansq statistic  p.value          NA
1 28 195.4626     NA        NA       NA          NA
2 30 278.3219     -2 -82.85933 5.934796 0.007099496

It looks like tidy.anova assumes a has five columns, but it has six; the method misses the third column of a (the Df column). Apologies for not putting together a pull request, and thanks for this fantastic package.

tidy output for rms models

It will be nice if broom functions can be applied to rms class models from frank harrell's rms package. It is a pivotal package providing fantastic summary functions in the field of clinical prediction modelling.

Thanks

no .resid column

I don't get a column with of residuals, I only get the standardized residuals, not the residuals, as in the example.

When I follow the example on the github page:

lmfit <- lm(mpg ~ wt, mtcars)
head(augment(lmfit))

I get this:

.rownames	mpg	wt	.fitted	.se.fit	.hat	.sigma	.cooksd	.std.resid
Mazda RX4	21.0	2.620	23.28261	0.6335798	0.0432690	3.067494	0.0132741	-0.7661677
Mazda RX4 Wag	21.0	2.875	21.91977	0.5714319	0.0351968	3.093068	0.0017240	-0.3074305
Datsun 710	22.8	2.320	24.88595	0.7359177	0.0583757	3.072127	0.0154394	-0.7057525
Hornet 4 Drive	21.4	3.215	20.10265	0.5384424	0.0312502	3.088268	0.0030206	0.4327511
Hornet Sportabout	18.7	3.440	18.90014	0.5526562	0.0329218	3.097722	0.0000760	-0.0668188
Valiant	18.1	3.460	18.79325	0.5552829	0.0332355	3.095184	0.0009211	-0.2314831

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices
[5] utils     datasets  methods   base     

other attached packages:
[1] knitr_1.9        broom_0.3.7     
[3] gridExtra_0.9.1  dplyr_0.4.1.9000
[5] ggplot2_1.0.1    GGally_0.5.0    

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6       git2r_0.10.1     
 [3] formatR_1.1       plyr_1.8.1       
 [5] bitops_1.0-6      tools_3.2.0      
 [7] rpart_4.1-9       digest_0.6.8     
 [9] memoise_0.2.1     evaluate_0.7     
[11] gtable_0.1.2      psych_1.5.1      
[13] shiny_0.11.1.9004 DBI_0.3.1        
[15] yaml_2.1.13       parallel_3.2.0   
[17] proto_0.3-10      httr_0.6.1       
[19] stringr_1.0.0     rversions_1.0.0  
[21] devtools_1.8.0    reshape_0.8.5    
[23] R6_2.0.1          XML_3.98-1.1     
[25] rmarkdown_0.6.1   reshape2_1.4.1   
[27] tidyr_0.2.0.9000  magrittr_1.5     
[29] scales_0.2.4      htmltools_0.2.6  
[31] MASS_7.3-40       assertthat_0.1   
[33] mnormt_1.5-2      mime_0.3         
[35] xtable_1.7-4      colorspace_1.2-6 
[37] httpuv_1.3.2      labeling_0.3     
[39] stringi_0.5-1     RCurl_1.95-4.6   
[41] munsell_0.4.2

packageVersion("broom")
[1] ‘0.3.7’

Am I doing something wrong?

Using the bootstrap function

Because the boostrap function produces a grouped_df I expected it to work with dplyr::summarize as below. Apparently not. Could you give an example?

mtcars %>%
group_by(vs) %>%
summarize(mean = mean(mpg))
Source: local data frame [2 x 2]

vs mean
1 0 16.61667
2 1 24.55714

mtcars %>%
bootstrap(10) %>% 
summarize(mean = mean(mpg))
Error: corrupt 'grouped_df', contains %d rows, and %s rows in groups

Experimental quantile regression support

You can explore the branch here:

https://github.com/joranE/broom/tree/quantreg

I will split up the documentation for the rq, rqs and nlrq methods at some point; didn't realize it would be that long.

R CMD check fails with dev dplyr

Could you please have a look?

checking examples ... ERROR
Running examples in ‘broom-Ex.R’ failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: felm_tidiers
> ### Title: Tidying methods for models with multiple group fixed effects
> ### Aliases: augment.felm felm_tidiers glance.felm tidy.felm
> 
> ### ** Examples
> 
> if (require("lfe", quietly = TRUE)) {
+     N=1e2
+     DT <- data.frame(
+       id = sample(5, N, TRUE),
+       v1 =  sample(5, N, TRUE),
+       v2 =  sample(1e6, N, TRUE),
+       v3 =  sample(round(runif(100,max=100),4), N, TRUE),
+       v4 =  sample(round(runif(100,max=100),4), N, TRUE)
+     )
+ 
+     result_felm <- felm(v2~v3, DT)
+     tidy(result_felm)
+     augment(result_felm)
+     result_felm <- felm(v2~v3|id+v1, DT)
+     tidy(result_felm, fe = TRUE)
+     augment(result_felm)
+     v1<-DT$v1
+     v2 <- DT$v2
+     v3 <- DT$v3
+     id <- DT$id
+     result_felm <- felm(v2~v3|id+v1)
+     tidy(result_felm)
+     augment(result_felm)
+     glance(result_felm)
+ }
Warning in FUN(newX[, i], ...) : non-factor id coerced to factor
Warning in FUN(newX[, i], ...) : non-factor v1 coerced to factor
Error: data_frames can not contain data.frames, matrices or arrays
Execution halted

augment for loess object

I was going to dip my toe gently into exploring broom by trying to add an augment method for loess objects, but I'm not sure how you'd prefer that to be done.

Initially, it seemed like a simple matter of adding an S3 method augment.loess that just called augment_columns, along the lines of augment.lm. But in keeping with base R's sometimes inconsistent argument naming predict.loess uses the argument se rather than se.fit.

It's easy enough to write a completely self-contained augment.loess function, but it wasn't clear to me looking at the rest of the package whether you preferred that type of approach, or whether you prefer code that's a bit more centralized.

broom fails for simple `lm()`

lm() supports building column-wise models, but broom fails to clean them up.

a = matrix(1:20, nrow=10, ncol=2)
b = a + rnorm(length(a))
result = lm(b~a)
broom::tidy(result)
# Error in 1:ncol(co) : argument of length 0
#2: tidy.lm(lm(b ~ a))
#1: broom::tidy(lm(b ~ a))

Note that the output format needs to be adjusted to accommodate multiple models, either by adding columns for the indices or by reporting the term with index.

tidy output for coeftest

Thanks a lot for this package. It would be great if you can incorporate the coeftest class from the lmtest package. Thank you.

Compatibility with stargazer or xtable

I think an interesting direction would be for xtable / stargazer to support the output of tidy and glance. For now, these packages don't use multiple dispatch and they support a relatively small set of statistical commands. But tidy and glance correspond exactly to the kind of operations needed to print consistently results from different models.

If this is implemented, for a given S3 object, writing a tidy and glance method would then directly make it compatible with xtable/stargazer.

I think the only element missing to print a table from tidy and glance is the name of the dependent variable.

Retaining statistics names

I'm working on an R package of my own, which will provide functions to knit scientific manuscripts.

To this end, I'm building convenience functions that assemble text strings form analysis objects, such as htest or summary.lm etc. I am considering to use broom to tidy those objects. The only thing that is currently keeping me from using your package is that fact that when I tidy objects I loose all information about what the estimates actually are (differences of means, means of differences, correlation coefficients, etc.)---the same is, of course, true for other columns of the tidied output.

Are there any plans to add this information to tidied data.frames? I think retaining this information would be helpful for other purposes and programming literacy in general.

An unobtrusive way of doing this would be to simply add attributes to the data.frame from the original object (adding a row to the data.frame would be another possibility). I've thrown something together for an object from t.test() to illustrate what I mean (this should generalize to other htest objects with some minor adaptations):

> t_test <- t.test(extra ~ group, data = sleep)    
> tidy_t_test <- tidy(t_test)
> tidy_t_test

  estimate estimate1 estimate2 statistic    p.value parameter  conf.low conf.high
1    -1.58      0.75      2.33 -1.860813 0.07939414  17.77647 -3.365483 0.2054832

> vars <- lapply(t_test, attr, "names")
> vars <- vars[!unlist(lapply(vars, is.null))]
> conf_level <- attr(t_test$conf.int, "conf.level") * 100
> conf_levels <- paste(c((100 - conf_level) / 2, 100 - (100 - conf_level) / 2), "%")

> attr(tidy_t_test, "vars") <- c(vars$null.value, vars$estimate, vars$statistic, "p.value", vars$parameter, conf_levels)
> str(tidy_t_test)

'data.frame':  1 obs. of  8 variables:
$ estimate : num -1.58
$ estimate1: num 0.75
$ estimate2: num 2.33
$ statistic: num -1.86
$ p.value  : num 0.0794
$ parameter: num 17.8
$ conf.low : num -3.37
$ conf.high: num 0.205
- attr(*, "vars")= chr  "difference in means" "mean in group 1" "mean in group 2" "t" ...

> attr(tidy_t_test, "vars")
[1] "difference in means" "mean in group 1" "mean in group 2" "t" "p.value" "df" "2.5 %"              
[8] "97.5 %"

Update Readme.md to reflect all available tidiers

Readme.md (frontpage of repo) does not reflect all available tidiers. At least, the tieders for plm (package plm) and lm.ridge (package MASS) could be added. Think, this is really interesting information for people looking if their model is supported by broom.

`tidy.manova` table always has pillai's trace label even when other statistics are requested

suppose fitManova is the result of a manova analysis
tidy(fitManova, test = "Wilks") will return Wilks's Lambda but the label will still be Pillai's trace.

Tidy multiple models at once

(I'm putting this here more as "Hey, I want to remember to do this later," rather than "Hey, you should do this." Consider this a means to solicit feedback prior to a future pull request.)

I frequently make plots/tables of the results of multiple models simultaneously, which entails wrangling them all into one data frame. It would be nice to automate the process of combining them, and the tidy generic provides a nice way of doing that.

Hypothetical output for the "straighten" function I'm imagining:

R> set.seed(94)
R> dat <- data.frame(y = rbinom(100, 1, 0.5), x = rnorm(100))
R> fit_lm <- lm(y ~ x, data = dat)
R> fit_glm <- glm(y ~ x, data = dat, family = binomial)
R> straighten(linear = fit_lm, logit = fit_glm)
   model        term  estimate stderror statistic    p.value
1 linear (Intercept)  0.520188 0.050336  10.33437 2.2787e-17
2 linear           x -0.037276 0.051917  -0.71798 4.7448e-01
3  logit (Intercept)  0.081331 0.200702   0.40523 6.8531e-01
4  logit           x -0.150389 0.208490  -0.72132 4.7071e-01

stderror (in the tidy output) should be renamed std.error

for consistency

Request to include class "Match" objects

First off thank you for this package. I'm using quite frequently.

I'm currently working on paper where I use the Matching package to do some Genetic Matching. The output is an object that I am unable to coerce to a data frame limiting my ability to even something simple like report the findings in a table.

https://cran.r-project.org/web/packages/Matching/Matching.pdf

tidier for formula?

I know, it is a slightly different use case, but: I just wanted to extract the RHS of a formula in a data.frame rowwise and broom came to my mind. As a formula is neither an estimated model nor are there numbers/data to be extracted/displayed, do you think it makes sense to have a tidier to transform a formula's RHS in a "tidy" format?

What about model fits?

In my pre-broom code, I just plugged the model fits (like R^2 of (p)lm models) in a separate column with the same value in each line, ie for each coefficient as it is just one value for the whole model. Sure, this is not tidy data.

Would be nice to have a command to extract all model fit indicators in a tidy format (one column per indicator and just one line). For comparing different models this would be really handy as one can just rbind those tidy outputs. This idea could be related to the list discussion in #40. What do you think?

Version tags

Good work - can you lay down a version tag?

warning in glm on poisson model

I ran a poisson regression model (log link) and received this error when using tidy:
Warning message: In tidy.lm(dt_poisson, exponentiate = TRUE) : Exponentiating coefficients, but model did not use a logit link function

The link function is below, and is clearly "log". Perhaps you could include both logit and log in the error handling?
$ family :List of 12
..$ family : chr "poisson"
..$ link : chr "log"
..$ linkfun :function (mu)
..$ linkinv :function (eta)

I have to say, I was impressed that the package even considered that people might exponentiate incorrectly and provides a warning. That is great.

various extensions, esp for mixed models

broom looks really great overall. I have been thinking for a long time about updating my coefplot2 package (on R-forge, but somewhat defunct there), and it looks like broom could serve as the back-end, but I have a bunch of questions and ideas about extensions. Sorry that the following is so long, but I hope I'm starting a useful discussion. Please feel free to ignore these ideas if they're too awkward or can't be implemented without breaking compatibility too badly, but I do think they could be useful things to think about.

Model types

arm::coefplot knows about bugs, glm, lm, and polr objects (showMethods("coefplot"))

coefplot2 uses a coeftab() method to extract coefficient tables: in addition to the classes that coefplot handles, coefplot2 knows about glmmadmb, glmmML, mcmc, MCMCglmm, rjags, merMod, and mer (old lme4). Right now it extracts lower/upper confidence intervals and standard deviations if they're available; it only gets the Wald intervals for merMod objects, leaving the interval/uncertainty estimates out for the random effects parameters. It uses

ctype: (character) one of "quad" (quadratic/Wald), "profile" (likelihood profile), "quantile" (quantiles of posterior density), "HPDinterval" (highest posterior density)

ptype: parameter type: one or more of "fixef" (default: fixed effect parameters), "ranef" (posterior modes of random effects), "vcov" (parameters of random effects variance-covariance matrices)

If some of the interface issues can be sorted out I'd be happy to contribute methods for all of these.

Interface issues

In general there are a lot of different kinds of coefficients one might want to retrieve from a model. I'm going to focus on two issues, parameter types and confidence intervals.

Parameter types

At the moment, the tidy method for merMod objects offers the choice of "fixed" or "random". The "fixed" case makes sense, but the other possibilities get a little bit interesting. My own most common use cases for random-effects parameters is to want to retrieve the standard deviations and correlations of the random effects, not the estimated values of the coefficients for each level of the grouping variable(s). Thus I would implement something like

library("lme4")
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
aa <- as.data.frame(VarCorr(fm1))
aa2 <- aa[1:(nrow(aa)-1),]  ## skip Residual variance
## lots of choices here; lme4:::tnames could be exported if necessary
termnames <- lme4:::tnames(fm1,old=FALSE,prefix=c("sd","cor"))
data.frame(terms,estimate=aa2[,"sdcor"],std.error=NA,statistic=NA)
##
##                    terms    estimate std.error statistic
##      Subject.(Intercept) 24.74044761        NA        NA
##             Subject.Days  5.92213324        NA        NA
## Subject.(Intercept).Days  0.06555134        NA        NA

There are more decisions/possible user inputs here:

is residual variance estimate included (if it exists -- it doesn't for most GLMMs)?
there are three possible parameterizations: variance/covariance, standard deviation/correlation, or Cholesky decomposition scale
it would be nice to allow the user to specify a vector of parameter types to extract (although this will obviously make it harder to check the types -- match.arg() won't work any more
some users might also want to get conditional modes/BLUPs
it would sure be nice if the interface for coef() was more uniform across packages and model types,

Another useful place to look is at the coef() method for hurdle and zeroinfl models from the pscl package: from ?predict.hurdle,

coef(object, model = c("full", "count", "zero"), ...)
(where these refer to parameters determining the probability of a structural zero, the expected number of counts in non-structural-zero cases, or both)

I guess the question is how much can be passed through tidy as optional arguments to coef vs. trying to give the tidy methods as unified interface ...

Confidence intervals etc.

While many model types have well-defined and useful standard errors, others don't. In particular the Wald standard errors for GLMs can be very unreliable; profile confidence intervals are more reliable. The same is often true for random effects parameters. It would be nice to have the option to incorporate lower and upper confidence intervals in a tidy data frame, although it's a little hard to know how to get this -- would you pass a confidence-interval data frame or a likelihood profile? (For most models including glm objects it's easy and lightweight to recompute the profile confidence intervals via confint(), but for merMod objects this can be an expensive operation ...)

One thing to think about here is that the default column names that R uses for confidence intervals (2.5 %, 97.5 %) are a nuisance -- I usually use lwr and upr, although these don't specify the level -- maybe lwr_0.025, upr_0.975?

There are sometimes additional decisions: for Bayesian models, should highest posterior density (coda::HPDinterval) or quantiles of the marginal posterior be used?

Wish lists

Some things that I've wanted coefplot2 to be able to do:

automatically rescale parameters, i.e. take a model that was fitted
without centering and scaling parameters and adjust the parameters and
SEs accordingly. This is fairly straightforward if the means and standard
deviations of the original predictor variables are known (so is the
inverse transformation). arm::coefplot does this, I think, but only
by re-fitting the model, which is an unnecessary expense.
automatically change scale of parameter/confidence intervals; for
example, Wald estimates of variance parameters are much better on the
log scale, and this conversion is easy [i.e. if the std dev estimate
is b and its standard error is s, then the confidence intervals based
on a log scale are exp(log(b) +/- 1.96*s/b]

Tidying a column doesn't work on a data frame with one row

As shown in this question, when do is used on an ungrouped data frame, it cannot be used with rowwise_df_tiders:

mtcars %>% do(model = lm(mpg ~ wt, .)) 
#> Source: local data frame [1 x 1]
#> 
#>     model
#> 1 <S3:lm>

mtcars %>% do(model = lm(mpg ~ wt, .)) %>% glance(model)
#> Error in stats::complete.cases(x): invalid 'type' (list) of argument

This is because the class of the resulting tbl_df is not a rowwise_df, but simply a tbl_df and data.frame.

To fix this, I'll add a special case for tidy.tbl_df that checks if it's one row, and if so, applies the tidy.rowwise_df operation.

Enhance documentation of tidy

Would it be possible to enhance the documentation of tidy? The current documentation gives

Usage
tidy(x, ...)

Arguments
x An object to be converted into a tidy data.frame
... extra arguments

What the extra arguments might be aren't described anywhere I can find them. I know of conf.int and exponentiate from issue #5, are there others?

dplyr, glance and p.adjust

I was performing some regressions similar to the below example, when I noticed that p.adjust does not seem to do anything to p.values after a glance() has been called.

data(mtcars)
mtcars %>% 
    group_by(gear) %>% 
    do(mod = lm(mpg~wt, data=.)) %>% 
    glance(mod) %>% 
    mutate(fdr=p.adjust(p.value, method="fdr")) %>% 
    select(p.value, fdr)

$ Source: local data frame [3 x 3]
$ Groups: gear
$
$ gear      p.value          fdr
$ 1    3 0.0006048395 0.0006048395
$ 2    4 0.0010104804 0.0010104804
$ 3    5 0.0012815520 0.0012815520

Am I missing something here?

$ version  R version 3.1.2 (2014-10-31)
$ system   x86_64, darwin13.4.0       
$ broom           0.3.6       2015-02-18 CRAN (R 3.1.2)
$ dplyr           0.4.1       2015-01-14 CRAN (R 3.1.2)

Add tidiers for vegan package

Add tidiers for objects from the vegan package for community ecology.

Compatibility with mgcv::gam()

Current tidy broom::tidy.gam() seems to only work for package gam, and not for package mgcv. The latter does not have summary(x)$parametric.anova, but a data frame could be constructed from the output of the summary anyway.

Error in eval(expr, envir, enclos) : object 'group1' not found

FYI When using tidy with pairwise.t.test the following code gives an error

library(broom)
dat <- mtcars
dat$vs <- as.factor(mtcars$vs)
res <- pairwise.t.test(dat[,"mpg"], dat[,"vs"])
tidy(res)

Error in eval(expr, envir, enclos) : object 'group1' not found

Recoding the levels seems to 'fix' the issue.

vs <- as.character(mtcars$vs)
vs[vs == 1] <- "manual"
vs[vs == 0] <- "automatic"
dat$vs2 <- as.factor(vs)
res <- pair

Really like Broom. Thanks!

handle plm() objects

feature request: handling plm() objects (panel data regression) would be great; wrote my own extraction function (adapted some lm() example I found online), used the following commands to extract information [self explanatoriy, I guess]:

t(x[, "Estimate" ])
t(x[, "Pr(>|t|)" ])
t(x[ , "Std. Error" ])
t(x[ , "t-value" ])

dim(x$coefficients)[1] # number of coefficients
x$r.squared[1] # R^2
x$r.squared[2] # adj. R^2
sum(x$residuals^2) # sum of sqaures
x$fstatistic[[4]] # p-value F statistic

Factor variables

When the formula contains terms such as : or , it would be nice to split term in two columns. For instance

tidy(lm(y ~ x:as.factor(z)))

gives

term 
(Intercept)     
x*as.factor(z):1
x*as.factor(z):2
x*as.factor(z):3

but it could give two supplementary columns

term              
(Intercept)       NA   NA     
x*as.factor(z):1   x    1
x*as.factor(z):2   x    2
x*as.factor(z):3   x    3

This would allow to plot the estimate as a function of z, for instance.
I'm not sure what's the best way to do it that avoids regex.

Handle NAs consistently across lm/glm/nls/lme4

Right now some of the behavior is handled differently across functions. Should have standardized test cases. Also needs to ensure it works with or without row names.