Comments (20)
Personally, I don't see this as a particular good fit to broom. Formuals are implicitly trees, and any attempt to put them into a data frame is going to be somewhat unnatural. Instead, I think you'd be better off learning how to work with formulas as trees (e.g. http://adv-r.had.co.nz/Expressions.html#ast-funs)
from broom.
@mbojan I think some tools like this for working with formulae are sorely needed. I don't think broom
is the place for them however. recipes
has some solid work in this vein, and I imagine that eventually there'll be a formulae toolkit somewhere in tidymodels
.
In the meantime I'm going to close this because I agree with Hadley that it doesn't make much sense to represent a formula as a flat table.
from broom.
If it were to be done, I would think the most usable format would be a data frame of three columns.
- side: either 'lhs' or 'rhs'
- position: an integer giving the position of the term from left to right. The value would start at one in both 'lhs' and 'rhs'
- term: character string giving the term
I'm not sure how broad it's usage would be, but there are a couple fringe cases I can think of where this would have been useful.
from broom.
I like this idea and @nutterb's approach. Here's a first try:
tidy.formula <- function(x, ...) {
lhs <- tidy(x[[2]]) %>% mutate(position = row_number())
rhs <- tidy(x[[3]]) %>% mutate(position = row_number())
rbind(cbind(side = "lhs", lhs),
cbind(side = "rhs", rhs))
}
tidy.call <- function(x, ...) {
# skip the operator or function itself
plyr::ldply(x[-1], tidy)
}
tidy.name <- function(x, ...) {
data.frame(term = as.character(x))
}
For example:
tidy(a + b ~ c + d * e - f)
results in
side term position
1 lhs a 1
2 lhs b 2
3 rhs c 1
4 rhs d 2
5 rhs e 3
6 rhs f 4
If folks are in favor of this tidied result, I'll build it in. The one thing I'm concerned about is that we're ignoring which operators are being included. I'm not sure there's a simple way to include them.
from broom.
I was thinking about that on my drive home and thought of another minor issue with this plan. While I would agree that d * e
represents two separate terms, I would argue that d:e
is only one term.
In a perfect world, I think I would want to be able to reconstitute the formula from the tidy data frame. So for the call
tidy(a + b ~ c + d * e - f + c:d)
I would want to see
side term position operand_append
1 lhs a 1 +
2 lhs b 2 NA
3 rhs c 1 +
4 rhs d 2 *
5 rhs e 3 -
6 rhs f 4 +
7 rhs c:d 5 NA
Probably the only way to do this is with regular expressions. Fortunately, we really only have three operators to split on. The regular expression "([+]|[-]|[*])"
in str_extract
and strsplit
ought to break them up well. I will try to play with this after I get kids to bed and some other work done.
from broom.
@nutterb No need for regular expression splitting, we could treat :
(maybe others?) as a special case in recursion:
tidy.call <- function(x, ...) {
if (as.character(x[[1]]) == ":") {
# special case; treat interactions as one term
return(data.frame(term = deparse(x)))
}
# skip the operator or function itself
plyr::ldply(x[-1], tidy)
}
from broom.
Wow. I wish you'd taught me that trick six months ago.
from broom.
I have my moments.
from broom.
Great that you guys like my idea. I think it is also handy in interactive mode, i.e. for printing formulas; print(formula) is hard to read once there are more than just a few terms.
from broom.
As I've played with this more, I've found that it has some limitations.
> tidy(a + b ~ c + d * e + log(k))
side term position
1 lhs a 1
2 lhs b 2
3 rhs c 1
4 rhs d 2
5 rhs e 3
6 rhs k 4
I'm not sure that I like having completely lost the log
in log(k)
. When you add in more complex components,
tidy(a+b ~ rcs(w, 3))
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match In addition: Warning message:
In tidy.default(X[[i]], ...) :
No method for tidying an S3 object of class numeric , using as.data.frame
How does something like this look?
tidy.formula <- function(x, ...)
{
if (length(x) == 2)
return(sideTerms(x[[2]], "rhs"))
else
return(rbind(sideTerms(x[[2]], "lhs"),
sideTerms(x[[3]], "rhs")))
}
sideTerms <- function(f, side)
{
f <- deparse(f)
operand <-
unlist(stringr::str_extract_all(f, stringr::regex("([*]|[+]|[-]|[/])(?![^(]*[)])")))
f <- unlist(strsplit(f,
"([*]|[+]|[-]|[/])(?![^(]*[)])",
perl=TRUE))
f <- stringr::str_trim(f)
data.frame(side = rep(side, length(f)),
position = 1:length(f),
term = f,
operand_append = c(operand, ""),
stringsAsFactors=FALSE)
}
> tidy(a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2)
side position term operand_append
1 lhs 1 a +
2 lhs 2 b
3 rhs 1 c +
4 rhs 2 d *
5 rhs 3 e -
6 rhs 4 f +
7 rhs 5 c:f +
8 rhs 6 func(x * y) +
9 rhs 7 rcs(w, 3) +
10 rhs 8 log(k) +
11 rhs 9 g^2
> tidy(a + b ~ c + d * e + log(k))
side position term operand_append
1 lhs 1 a +
2 lhs 2 b
3 rhs 1 c +
4 rhs 2 d *
5 rhs 3 e +
6 rhs 4 log(k)
> tidy(a+b ~ rcs(w, 3))
side position term operand_append
1 lhs 1 a +
2 lhs 2 b
3 rhs 1 rcs(w, 3)
> tidy(~ rcs(w, 3))
side position term operand_append
1 rhs 1 rcs(w, 3)
> tidy(a + b ~ c / e)
side position term operand_append
1 lhs 1 a +
2 lhs 2 b
3 rhs 1 c /
4 rhs 2 e
> tidy(a + b ~ (c / e))
side position term operand_append
1 lhs 1 a +
2 lhs 2 b
3 rhs 1 (c/e)
from broom.
If it is too tricky, peeking how a formula
is parsed in lm
(or maybe there is some general routine somewhere else) could be worthwhile?
from broom.
Hmm, I'm still not entirely sold on using regular expressions, but I'll take another look. We could still keep calls such as log in a recursive solution.
@Helix123 formulas in lm are parsed by model.frame and model.matrix, which I probably wouldn't apply here.
from broom.
I don't blame you for not being sold on regex. They'really messy and unintelligible when done succinctly. But I got stuck on how to get the transformations to come out with the terms.
Also, I do see some value in having the terms without the transformations. I'm not sure the regex approach is compatible with that.
from broom.
I'll keep working on the transformation, I'm sure we'll find it.
(Hope you're both dropping in on the rOpenSci community call about broom tomorrow!)
from broom.
I almost had it! Have a look at the factors
attribute of the terms
output. It returns a matrix, and those rownames are suspiciously similar to what we want.
> form <- a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2
> terms(form)
a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) +
g^2
attr(,"variables")
list(a + b, c, d, e, f, func(x * y), rcs(w, 3), log(k), g)
attr(,"factors")
c d e func(x * y) rcs(w, 3) log(k) g d:e c:f
a + b 0 0 0 0 0 0 0 0 0
c 1 0 0 0 0 0 0 0 2
d 0 1 0 0 0 0 0 1 0
e 0 0 1 0 0 0 0 1 0
f 0 0 0 0 0 0 0 0 1
func(x * y) 0 0 0 1 0 0 0 0 0
rcs(w, 3) 0 0 0 0 1 0 0 0 0
log(k) 0 0 0 0 0 1 0 0 0
g 0 0 0 0 0 0 1 0 0
So I had the code below before I realized I lost the g^2
transformation. I'll share the code just in case it helps inspire anyone, but I think I'll have to sit back and learn from the masters now.
tidy.formula <- function(x, ...){
if (length(x) == 2)
return(sideTerms(x[[2]]), "rhs")
else
return(rbind(sideTerms(x[[2]], "lhs"),
sideTerms(x[[3]], "rhs")))
}
sideTerms <- function(f, side){
f <- stats::as.formula(paste("~", deparse(f)))
term_transform <- rownames(attributes(terms(f))$factors)
data.frame(side = rep(side, length(term_transform)),
position = 1:length(term_transform),
term_transform = term_transform,
stringsAsFactors = FALSE)
}
> tidy(a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2 + (i+j))
side position term_transform
1 lhs 1 a
2 lhs 2 b
3 rhs 1 c
4 rhs 2 d
5 rhs 3 e
6 rhs 4 f
7 rhs 5 func(x * y)
8 rhs 6 rcs(w, 3)
9 rhs 7 log(k)
10 rhs 8 g
11 rhs 9 i
12 rhs 10 j
from broom.
I would say the approach of attr(terms(formula(y~x+z^2)),"factors")
is correct, because the transformation is also lost in the modelling. If you want to use z^2
as predictor you need to use I
function.
y <- nnorm(10)
x <- 1:10
(fit1 <- lm(y ~ x^2))
Call:
lm(formula = y ~ x^2)
Coefficients:
(Intercept) x
0.299124 0.004762
attr(fit1$terms, "factors")
x
y 0
x 1
(fit2 <- lm(y ~ x))
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.299124 0.004762
attr(fit2$terms, "factors")
x
y 0
x 1
(fit3 <- lm(y ~ I(x^2)))
Call:
lm(formula = y ~ I(x^2))
Coefficients:
(Intercept) I(x^2)
0.305254 0.000521
attr(fit3$terms, "factors")
I(x^2)
y 0
I(x^2) 1
from broom.
The deparsing in sideTerms
fails for complex formulas:
# Example model from KFAS package
form <- log(drivers) ~ SSMtrend(1, Q = list(NA)) +
SSMseasonal(period = 12, sea.type = "trigonometric", Q = NA) +
log(PetrolPrice) + law
rownames(attr(terms(form), "factors"))
[1] "log(drivers)"
[2] "SSMtrend(1, Q = list(NA))"
[3] "SSMseasonal(period = 12, sea.type = \"trigonometric\", Q = NA)"
[4] "log(PetrolPrice)"
[5] "law"
sideTerms(form[[3]], "rhs")
Error in parse(text = x, keep.source = FALSE) :
<text>:2:9: unexpected '='
1: ~ SSMtrend(1, Q = list(NA)) + SSMseasonal(period = 12, sea.type = "trigonometric",
2: ~ Q =
^
This feels bit hacky but seems to work except for the left hand side which is not split:
tidy.formula <- function(x, ...){
if (length(x) == 2)
return(sideTerms(x, "rhs")
else
return(rbind(sideTerms(x, "lhs"),
sideTerms(x, "rhs")))
}
sideTerms <- function(f, side){
factors <- attr(terms(f), "factors")
term_transform <-
if(side == "lhs"){
rownames(factors)[rowSums(factors) == 0]
} else rownames(factors)[rowSums(factors) > 0]
data.frame(side = rep(side, length(term_transform)),
position = 1:length(term_transform),
term_transform = term_transform,
stringsAsFactors = FALSE)
}
form <- a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2
tidy.formula(form)
side position term_transform
1 lhs 1 a + b
2 rhs 1 c
3 rhs 2 d
4 rhs 3 e
5 rhs 4 f
6 rhs 5 func(x * y)
7 rhs 6 rcs(w, 3)
8 rhs 7 log(k)
9 rhs 8 g
form <- log(drivers) ~ SSMtrend(1, Q = list(NA)) +
SSMseasonal(period = 12, sea.type = "trigonometric", Q = NA) +
log(PetrolPrice) + law
tidy.formula(form)
side position term_transform
1 lhs 1 log(drivers)
2 rhs 1 SSMtrend(1, Q = list(NA))
3 rhs 2 SSMseasonal(period = 12, sea.type = "trigonometric", Q = NA)
4 rhs 3 log(PetrolPrice)
5 rhs 4 law
from broom.
We could preserve the "lhs" split by using a nested paste
. Looks ugly, but.....
tidy.formula <- function(x, ...){
if (length(x) == 2)
return(sideTerms(x[[2]]), "rhs")
else
return(rbind(sideTerms(x[[2]], "lhs"),
sideTerms(x[[3]], "rhs")))
}
sideTerms <- function(f, side){
f <- stats::as.formula(paste("~",
paste(stringr::str_trim(deparse(f)),
collapse = " "
)
)
)
term_transform <- rownames(attr(terms(f), "factors"))
data.frame(side = rep(side, length(term_transform)),
position = 1:length(term_transform),
term_transform = term_transform,
stringsAsFactors = FALSE)
}
form <- log(drivers) ~ SSMtrend(1, Q = list(NA)) +
SSMseasonal(period = 12, sea.type = "trigonometric", Q = NA) +
log(PetrolPrice) + law
tidy(form)
form <- a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2
tidy(form)
from broom.
I found this thread while trying to tidy()
a bunch of models that include interaction terms and categorical variables. I ran into the problem of identifying tidy
ied rows corresponding to the interaction terms including specific variables and/or levels (values of categorical predictors). At this moment it seems it can only be done with some creative use of regular expressions which are hard to write in generic way without hard-coding things like variable names or assuming something about the model specification. I dont think a generic solution can be reduced to some post-processing tidy()
output.
Perhaps broom could do better by e.g. providing a possibility to append the tidy
output with column(s) that would more closely represent the model specification?
Most of the model outputs of which broom tries to tidy
are specified using the formula interface. Formulas, that, apart from the original set of operators from Wilkinson&Rogers, now might include I()
calls, pseudo-functions (like s()
for splines), etc. Output of modeling functions essentially flattens the model formula tree (thanks @hadley ) to rows in a table of coefficients.
Perhaps one idea would be to have a column of mode list
which elements are vectors of names of variables included in the particular term? I think this could be done by somewhat reverse-engineer the process in which model terms (?terms
) and their names are constructed?
from broom.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
from broom.
Related Issues (20)
- tidy.anova fails with long predictor names (two lines?): Logical subscript `idx` must be size 1 or 2, not 3. HOT 3
- tidy.varest contains wrong (repeated) coefficients? HOT 4
- transition to cli errors
- add / polish alt text
- Regression Model Accuracy Metrics for Objects of class oohbchoice HOT 2
- No tidy method for objects of class npregression HOT 2
- Issues regarding CI computing for lm.beta models HOT 2
- augment error with `na.action = na.exclude` in `lm` HOT 1
- add conf.int and exponentiate arguments to `tidy.cch()`
- Error: No tidy method for objects of class npregression HOT 2
- Possible bug with `tidy() ` function on `lm.beta` object HOT 3
- augment() fails HOT 3
- address GHA failures re: minimum R version HOT 1
- Bug in `tidy.survfit()` coming in the next release of the {survival} pkg HOT 5
- Create tidy method HOT 5
- Support for glmtoolbox (GEE) HOT 2
- Allow choosing the type of residuals in `tidy.betareg()` HOT 3
- GHA failure re: archived package HOT 1
- dependency install errors on GHA HOT 3
- Release broom 1.0.6 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from broom.