GithubHelp home page GithubHelp logo

tidier for formula? about broom HOT 20 CLOSED

tidymodels avatar tidymodels commented on July 19, 2024
tidier for formula?

from broom.

Comments (20)

hadley avatar hadley commented on July 19, 2024 2

Personally, I don't see this as a particular good fit to broom. Formuals are implicitly trees, and any attempt to put them into a data frame is going to be somewhat unnatural. Instead, I think you'd be better off learning how to work with formulas as trees (e.g. http://adv-r.had.co.nz/Expressions.html#ast-funs)

from broom.

alexpghayes avatar alexpghayes commented on July 19, 2024 1

@mbojan I think some tools like this for working with formulae are sorely needed. I don't think broom is the place for them however. recipes has some solid work in this vein, and I imagine that eventually there'll be a formulae toolkit somewhere in tidymodels.

In the meantime I'm going to close this because I agree with Hadley that it doesn't make much sense to represent a formula as a flat table.

from broom.

nutterb avatar nutterb commented on July 19, 2024

If it were to be done, I would think the most usable format would be a data frame of three columns.

  • side: either 'lhs' or 'rhs'
  • position: an integer giving the position of the term from left to right. The value would start at one in both 'lhs' and 'rhs'
  • term: character string giving the term

I'm not sure how broad it's usage would be, but there are a couple fringe cases I can think of where this would have been useful.

from broom.

dgrtwo avatar dgrtwo commented on July 19, 2024

I like this idea and @nutterb's approach. Here's a first try:

tidy.formula <- function(x, ...) {
    lhs <- tidy(x[[2]]) %>% mutate(position = row_number())
    rhs <- tidy(x[[3]]) %>% mutate(position = row_number())

    rbind(cbind(side = "lhs", lhs),
          cbind(side = "rhs", rhs))
}

tidy.call <- function(x, ...) {
    # skip the operator or function itself
    plyr::ldply(x[-1], tidy)
}

tidy.name <- function(x, ...) {
    data.frame(term = as.character(x))
}

For example:

tidy(a + b ~ c + d * e - f)

results in

  side term position
1  lhs    a        1
2  lhs    b        2
3  rhs    c        1
4  rhs    d        2
5  rhs    e        3
6  rhs    f        4

If folks are in favor of this tidied result, I'll build it in. The one thing I'm concerned about is that we're ignoring which operators are being included. I'm not sure there's a simple way to include them.

from broom.

nutterb avatar nutterb commented on July 19, 2024

I was thinking about that on my drive home and thought of another minor issue with this plan. While I would agree that d * e represents two separate terms, I would argue that d:e is only one term.

In a perfect world, I think I would want to be able to reconstitute the formula from the tidy data frame. So for the call

tidy(a + b ~ c + d * e - f + c:d)

I would want to see

  side term position operand_append
1  lhs    a        1             + 
2  lhs    b        2            NA 
3  rhs    c        1             + 
4  rhs    d        2             * 
5  rhs    e        3             - 
6  rhs    f        4             + 
7  rhs  c:d        5            NA 

Probably the only way to do this is with regular expressions. Fortunately, we really only have three operators to split on. The regular expression "([+]|[-]|[*])" in str_extract and strsplit ought to break them up well. I will try to play with this after I get kids to bed and some other work done.

from broom.

dgrtwo avatar dgrtwo commented on July 19, 2024

@nutterb No need for regular expression splitting, we could treat : (maybe others?) as a special case in recursion:

tidy.call <- function(x, ...) {
    if (as.character(x[[1]]) == ":") {
        # special case; treat interactions as one term
        return(data.frame(term = deparse(x)))
    }
    # skip the operator or function itself
    plyr::ldply(x[-1], tidy)
}

from broom.

nutterb avatar nutterb commented on July 19, 2024

Wow. I wish you'd taught me that trick six months ago.

from broom.

dgrtwo avatar dgrtwo commented on July 19, 2024

I have my moments.

from broom.

 avatar commented on July 19, 2024

Great that you guys like my idea. I think it is also handy in interactive mode, i.e. for printing formulas; print(formula) is hard to read once there are more than just a few terms.

from broom.

nutterb avatar nutterb commented on July 19, 2024

As I've played with this more, I've found that it has some limitations.

> tidy(a + b ~ c + d * e + log(k))
  side term position
1  lhs    a        1
2  lhs    b        2
3  rhs    c        1
4  rhs    d        2
5  rhs    e        3
6  rhs    k        4

I'm not sure that I like having completely lost the log in log(k). When you add in more complex components,

tidy(a+b ~ rcs(w, 3))

 Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match In addition: Warning message:
In tidy.default(X[[i]], ...) :
  No method for tidying an S3 object of class numeric , using as.data.frame

How does something like this look?

tidy.formula <- function(x, ...)
{
  if (length(x) == 2)
    return(sideTerms(x[[2]], "rhs"))
  else 
    return(rbind(sideTerms(x[[2]], "lhs"),
                 sideTerms(x[[3]], "rhs")))
}

sideTerms <- function(f, side)
{
  f <- deparse(f)
  operand <- 
    unlist(stringr::str_extract_all(f, stringr::regex("([*]|[+]|[-]|[/])(?![^(]*[)])")))
  f <- unlist(strsplit(f, 
                       "([*]|[+]|[-]|[/])(?![^(]*[)])", 
                       perl=TRUE))
  f <- stringr::str_trim(f)

  data.frame(side = rep(side, length(f)),
             position = 1:length(f),
             term = f,
             operand_append = c(operand, ""),
             stringsAsFactors=FALSE)
}
> tidy(a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2)
   side position        term operand_append
1   lhs        1           a              +
2   lhs        2           b               
3   rhs        1           c              +
4   rhs        2           d              *
5   rhs        3           e              -
6   rhs        4           f              +
7   rhs        5         c:f              +
8   rhs        6 func(x * y)              +
9   rhs        7   rcs(w, 3)              +
10  rhs        8      log(k)              +
11  rhs        9         g^2          

> tidy(a + b ~ c + d * e + log(k))
  side position   term operand_append
1  lhs        1      a              +
2  lhs        2      b               
3  rhs        1      c              +
4  rhs        2      d              *
5  rhs        3      e              +
6  rhs        4 log(k)      

> tidy(a+b ~ rcs(w, 3))
  side position      term operand_append
1  lhs        1         a              +
2  lhs        2         b               
3  rhs        1 rcs(w, 3)     

> tidy(~ rcs(w, 3))
  side position      term operand_append
1  rhs        1 rcs(w, 3)           

> tidy(a + b ~ c / e)
  side position term operand_append
1  lhs        1    a              +
2  lhs        2    b               
3  rhs        1    c              /
4  rhs        2    e     

> tidy(a + b ~ (c / e))
  side position  term operand_append
1  lhs        1     a              +
2  lhs        2     b               
3  rhs        1 (c/e)

from broom.

 avatar commented on July 19, 2024

If it is too tricky, peeking how a formula is parsed in lm (or maybe there is some general routine somewhere else) could be worthwhile?

from broom.

dgrtwo avatar dgrtwo commented on July 19, 2024

Hmm, I'm still not entirely sold on using regular expressions, but I'll take another look. We could still keep calls such as log in a recursive solution.

@Helix123 formulas in lm are parsed by model.frame and model.matrix, which I probably wouldn't apply here.

from broom.

nutterb avatar nutterb commented on July 19, 2024

I don't blame you for not being sold on regex. They'really messy and unintelligible when done succinctly. But I got stuck on how to get the transformations to come out with the terms.

Also, I do see some value in having the terms without the transformations. I'm not sure the regex approach is compatible with that.

from broom.

dgrtwo avatar dgrtwo commented on July 19, 2024

I'll keep working on the transformation, I'm sure we'll find it.

(Hope you're both dropping in on the rOpenSci community call about broom tomorrow!)

from broom.

nutterb avatar nutterb commented on July 19, 2024

I almost had it! Have a look at the factors attribute of the terms output. It returns a matrix, and those rownames are suspiciously similar to what we want.

> form <- a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2
> terms(form)
a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + 
    g^2
attr(,"variables")
list(a + b, c, d, e, f, func(x * y), rcs(w, 3), log(k), g)
attr(,"factors")
            c d e func(x * y) rcs(w, 3) log(k) g d:e c:f
a + b       0 0 0           0         0      0 0   0   0
c           1 0 0           0         0      0 0   0   2
d           0 1 0           0         0      0 0   1   0
e           0 0 1           0         0      0 0   1   0
f           0 0 0           0         0      0 0   0   1
func(x * y) 0 0 0           1         0      0 0   0   0
rcs(w, 3)   0 0 0           0         1      0 0   0   0
log(k)      0 0 0           0         0      1 0   0   0
g           0 0 0           0         0      0 1   0   0

So I had the code below before I realized I lost the g^2 transformation. I'll share the code just in case it helps inspire anyone, but I think I'll have to sit back and learn from the masters now.

tidy.formula <- function(x, ...){
  if (length(x) == 2)
    return(sideTerms(x[[2]]), "rhs")
  else 
    return(rbind(sideTerms(x[[2]], "lhs"),
                 sideTerms(x[[3]], "rhs")))
}

sideTerms <- function(f, side){
  f <- stats::as.formula(paste("~", deparse(f)))
  term_transform <- rownames(attributes(terms(f))$factors)
  data.frame(side = rep(side, length(term_transform)),
             position = 1:length(term_transform),
             term_transform = term_transform,
             stringsAsFactors = FALSE)
}

> tidy(a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2 + (i+j))
   side position term_transform
1   lhs        1              a
2   lhs        2              b
3   rhs        1              c
4   rhs        2              d
5   rhs        3              e
6   rhs        4              f
7   rhs        5    func(x * y)
8   rhs        6      rcs(w, 3)
9   rhs        7         log(k)
10  rhs        8              g
11  rhs        9              i
12  rhs       10              j

from broom.

helske avatar helske commented on July 19, 2024

I would say the approach of attr(terms(formula(y~x+z^2)),"factors") is correct, because the transformation is also lost in the modelling. If you want to use z^2 as predictor you need to use I function.

y <- nnorm(10)
x <- 1:10
(fit1 <- lm(y ~ x^2))

Call:
lm(formula = y ~ x^2)

Coefficients:
(Intercept)            x  
   0.299124     0.004762 

attr(fit1$terms, "factors")
  x
y 0
x 1

(fit2 <- lm(y ~ x))
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
   0.299124     0.004762  

attr(fit2$terms, "factors")
  x
y 0
x 1

(fit3 <- lm(y ~ I(x^2)))
Call:
lm(formula = y ~ I(x^2))

Coefficients:
(Intercept)       I(x^2)  
   0.305254     0.000521  

attr(fit3$terms, "factors")
       I(x^2)
y           0
I(x^2)      1

from broom.

helske avatar helske commented on July 19, 2024

The deparsing in sideTerms fails for complex formulas:

# Example model from KFAS package
form <- log(drivers) ~ SSMtrend(1, Q = list(NA)) + 
  SSMseasonal(period = 12, sea.type = "trigonometric", Q = NA) + 
  log(PetrolPrice) + law

rownames(attr(terms(form), "factors"))
[1] "log(drivers)"                                                  
[2] "SSMtrend(1, Q = list(NA))"                                     
[3] "SSMseasonal(period = 12, sea.type = \"trigonometric\", Q = NA)"
[4] "log(PetrolPrice)"                                              
[5] "law"

sideTerms(form[[3]], "rhs")
Error in parse(text = x, keep.source = FALSE) : 
  <text>:2:9: unexpected '='
1: ~ SSMtrend(1, Q = list(NA)) + SSMseasonal(period = 12, sea.type = "trigonometric", 
2: ~     Q =
           ^

This feels bit hacky but seems to work except for the left hand side which is not split:

tidy.formula <- function(x, ...){
  if (length(x) == 2)
    return(sideTerms(x, "rhs")
  else 
    return(rbind(sideTerms(x, "lhs"),
                 sideTerms(x, "rhs")))
}

sideTerms <- function(f, side){
  factors <- attr(terms(f), "factors")
  term_transform <- 
    if(side == "lhs"){
      rownames(factors)[rowSums(factors) == 0]
    } else rownames(factors)[rowSums(factors) > 0]
  data.frame(side = rep(side, length(term_transform)),
             position = 1:length(term_transform),
             term_transform = term_transform,
             stringsAsFactors = FALSE)
}

form <- a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2
tidy.formula(form)
side position term_transform
1  lhs        1          a + b
2  rhs        1              c
3  rhs        2              d
4  rhs        3              e
5  rhs        4              f
6  rhs        5    func(x * y)
7  rhs        6      rcs(w, 3)
8  rhs        7         log(k)
9  rhs        8              g

form <- log(drivers) ~ SSMtrend(1, Q = list(NA)) + 
  SSMseasonal(period = 12, sea.type = "trigonometric", Q = NA) + 
  log(PetrolPrice) + law

tidy.formula(form)
side position                                               term_transform
1  lhs        1                                                 log(drivers)
2  rhs        1                                    SSMtrend(1, Q = list(NA))
3  rhs        2 SSMseasonal(period = 12, sea.type = "trigonometric", Q = NA)
4  rhs        3                                             log(PetrolPrice)
5  rhs        4                                                          law

from broom.

nutterb avatar nutterb commented on July 19, 2024

We could preserve the "lhs" split by using a nested paste. Looks ugly, but.....

tidy.formula <- function(x, ...){
  if (length(x) == 2)
    return(sideTerms(x[[2]]), "rhs")
  else 
    return(rbind(sideTerms(x[[2]], "lhs"),
                 sideTerms(x[[3]], "rhs")))
}

sideTerms <- function(f, side){
  f <- stats::as.formula(paste("~", 
                               paste(stringr::str_trim(deparse(f)), 
                                     collapse = " "
                                     )
                               )
                         )
  term_transform <- rownames(attr(terms(f), "factors"))
  data.frame(side = rep(side, length(term_transform)),
             position = 1:length(term_transform),
             term_transform = term_transform,
             stringsAsFactors = FALSE)
}

form <- log(drivers) ~ SSMtrend(1, Q = list(NA)) + 
  SSMseasonal(period = 12, sea.type = "trigonometric", Q = NA) + 
  log(PetrolPrice) + law

tidy(form)

form <- a + b ~ c + d * e - f + c:f + func(x * y) + rcs(w, 3) + log(k) + g^2

tidy(form)

from broom.

mbojan avatar mbojan commented on July 19, 2024

I found this thread while trying to tidy() a bunch of models that include interaction terms and categorical variables. I ran into the problem of identifying tidyied rows corresponding to the interaction terms including specific variables and/or levels (values of categorical predictors). At this moment it seems it can only be done with some creative use of regular expressions which are hard to write in generic way without hard-coding things like variable names or assuming something about the model specification. I dont think a generic solution can be reduced to some post-processing tidy() output.

Perhaps broom could do better by e.g. providing a possibility to append the tidy output with column(s) that would more closely represent the model specification?

Most of the model outputs of which broom tries to tidy are specified using the formula interface. Formulas, that, apart from the original set of operators from Wilkinson&Rogers, now might include I() calls, pseudo-functions (like s() for splines), etc. Output of modeling functions essentially flattens the model formula tree (thanks @hadley ) to rows in a table of coefficients.

Perhaps one idea would be to have a column of mode list which elements are vectors of names of variables included in the particular term? I think this could be done by somewhat reverse-engineer the process in which model terms (?terms) and their names are constructed?

from broom.

github-actions avatar github-actions commented on July 19, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from broom.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.