GithubHelp home page GithubHelp logo

tidy output for rms models about broom HOT 30 CLOSED

sumprain avatar sumprain commented on August 20, 2024 1
tidy output for rms models

from broom.

Comments (30)

jenniferthompson avatar jenniferthompson commented on August 20, 2024 1

True. summary.rms works differently than, say, summary.lm in a couple of ways - it wants datadist to be set so it knows what values of each covariate to compare, and it combines all terms for a particular covariate, for example in the presence of splines or interactions. So when fitting olsfit <- ols(mpg ~ rcs(wt, 3) * cyl, data = mtcars), rather than getting the six coefficients that the same summary(lm(mpg ~ rcs(wt, 3) * cyl)) fit would give, you only get two lines, by default comparing the 75th to the 25th percentile of each covariate. That's nice since that's often a more reasonable comparison than, say, one year or age, but does make it tricky in this context.

For broom to get the same information in the same format as for other models, I don't think it could use summary.rms; print(olsfit) has the information, but calculates it on the fly. It might have to get the pieces individually - eg, using coef(olsfit) and sqrt(diag(vcov(olsfit))).

from broom.

nutterb avatar nutterb commented on August 20, 2024 1

[question removed from the record]

er..ignore the clown from Kentucky. It seems asking really stupid questions is a prerequisite I must fill before I can have my moments of clarity.

from broom.

kylelundstedt avatar kylelundstedt commented on August 20, 2024 1

Bump

from broom.

alexpghayes avatar alexpghayes commented on August 20, 2024 1

Once I get 0.7.0 out this summer I'll try to bump this up the priority list.

I also wrote the author, but he's generally unhelpful.

Mood.

from broom.

Deleetdk avatar Deleetdk commented on August 20, 2024 1

@alexpghayes Bumping. As I detailed in this bug report over at rms, the broom 0.7.0 update breaks the currently existing hack solution for extracting model information from rms ols object via summary.lm(). Granted, many of the problems in steamlining rms functionality is that the rms code is not structured well, but currently, since I depend on the summary.lm() in some of my own code, I am forced to use broom 0.5.X to avoid this break, so I am keen on seeing if we can't find some solution!

from broom.

grasshoppermouse avatar grasshoppermouse commented on August 20, 2024

+1. Came here looking for exactly this.

from broom.

dgrtwo avatar dgrtwo commented on August 20, 2024

I've had trouble writing tidiers for the rms package because almost none of the package's examples work (most are nested in "dontrun".

Which classes did you have in mind? I could probably tidy the "cph" object. For anything else, it would be very helpful if you could suggest reproducible code that would generate an object that should be tidied!

from broom.

grasshoppermouse avatar grasshoppermouse commented on August 20, 2024

library(rms)
m = lrm(am ~ mpg, data=mtcars)

from broom.

dylanbeaudette avatar dylanbeaudette commented on August 20, 2024

+1.

library(rms)

x.1 <- rnorm(20)
x.2 <- rnorm(20)
x.3 <- rnorm(20)
y <- x.1 + x.2 + x.3 + x.1 * x.2 + rnorm(20)

ols(y ~ x.1 + x.2 + x.3)

from broom.

andremrsantos avatar andremrsantos commented on August 20, 2024

+1

from broom.

jenniferthompson avatar jenniferthompson commented on August 20, 2024

+1!

If you need more examples than the above, I'm happy to help - I use rms frequently, mainly because of the nice anova and summary methods, but combining models and extracting info is a challenge.

from broom.

nutterb avatar nutterb commented on August 20, 2024

Just out of curiosity, when tidying a summary.rms object, what should the expectations be regarding the datadist option. It is actually quite difficult to get summary to run without datadist being defined prior to calling summary.

from broom.

jenniferthompson avatar jenniferthompson commented on August 20, 2024

I know that feeling! And thanks for pointing out that you have tidy methods for some rms functions - that's super helpful!

from broom.

nutterb avatar nutterb commented on August 20, 2024

Before I submit this pull request, I want to ask an intelligent question this time:

I put together tidiers for ols, lrm, cph, summary.rms, and validate. I also have a glance.rms method defined (all located in my fork)

If you use the code below for an example of the summary.rms output, I'm not sure that the way I've constructed it is particularly palatable (and arguably not tidy). Do you have any feedback on how this should look in the summaries when there are Hazard Ratios (or Odds Ratios) present? My source of dissatisfaction is the term column. I have similar misgivings about the tidy.anova.rms output.

library(rms)
n <- 1000
set.seed(731)
DF <- data.frame(
    age = 50 + 12*rnorm(1000),
    sex = factor(sample(c('Male','Female'), 1000, 
                        rep=TRUE, prob=c(.6, .4))),
    cens = 15*runif(1000)
)

h <- .02*exp(.04*(DF$age-50)+.8*(DF$sex=='Female'))
DF$dt = -log(runif(1000) / h)
DF$e <- ifelse(DF$dt <= DF$cens,1,0)
DF$dt <- pmin(DF$dt, DF$cens)

fit <- cph(Surv(dt, e) ~ rcs(age,4) + sex, data = DF, 
           x=TRUE, y=TRUE, surv = TRUE)
tidy(fit)
tidy(anova(fit))
glance(fit)

dd <- datadist(DF)
options(datadist = 'dd')
tidy(summary(fit))

Output of tidied summary object

                            term      low     high     diff     effect
1                            age 40.87213 57.38464 16.51252 -0.3966150
2               age-Hazard Ratio 40.87213 57.38464 16.51252  0.6725929
3              sex - Female:Male  2.00000  1.00000       NA -0.6759644
4 sex - Female:Male-Hazard Ratio  2.00000  1.00000       NA  0.5086656
   std.error   conf.low  conf.high type
1 0.08479186 -0.5628040 -0.2304260    1
2         NA  0.5696096  0.7941952    2
3 0.06612014 -0.8055575 -0.5463713    1
4         NA  0.4468387  0.5790472    2

from broom.

nutterb avatar nutterb commented on August 20, 2024

As a separate question directed at @dgrtwo, what is broom's philosophy on adding measures to the output. With the rms models, I've often been more interested in the concordance index, which is a transformation of the discrimination index reported by rms . Is it inappropriate to calculate the concordance while tidying, or should that be left to the user?

from broom.

jenniferthompson avatar jenniferthompson commented on August 20, 2024

I'm just now exploring broom, so definitely don't know what would be most consistent with the typical package mentality. But I do have lots of experience doing my own version of "tidying" these results for (hopefully) clear and concise presentation! Usually I do the following:

  1. Separate the variable names completely from the adjust-to values, and replace the numeric versions of character versions with their character counterparts (naturally that does force the low and high columns to be formatted as character as well). Eg, replace "2.0000" and "1.0000" with "Female" and "Male."
  2. Somehow get the data in a format where I can quickly and easily get the HR/ORs. That could mean a couple of things (at least), and I'm not sure what would be best. (I usually don't need the values on the original scale, so end up subsetting the data to just the ratio rows after some preliminary work.)
    1. Separate columns for the original and ratio quantities. Great for preserving all information and quickly selecting the ratio-related values, but doesn't lend itself to consistency with methods for ols.
    2. Separate rows with an indicator for type, much like Frank already has (type 1 or 2), but with variable names instead of "Hazard Ratio," for example.

I've included some code I wrote for your tidy.summary.rms that does the above. I'm sure it could be more elegant, and I'll apologize in advance if a PR would be better - I'm very new at using Github for collaborating! I have a function that I've been working on that does some similar things, but it's not production-ready and definitely not in the tidy/broom format.

res <- fix_data_frame(summary(fit))
names(res) <- c("term", "low", "high", "diff", "effect", "std.error", "conf.low", "conf.high", "type")
res$term <- trimws(res$term)

res <- res %>%
  ## Separate variable names from adjust-to values
  separate(term, into = c("term", "highc", "lowc"), sep = " *[-:] *") %>%
  ## Create character versions of all low/high values for each row
  mutate(highc = ifelse(!is.na(highc), highc,
                 ifelse(term %in% paste(c("Hazard", "Odds"), "Ratio"), NA,
                        as.character(round(high, 2)))),
         highc = ifelse(is.na(highc), dplyr::lag(highc), highc),
         lowc = ifelse(!is.na(lowc), lowc,
                ifelse(term %in% paste(c("Hazard", "Odds"), "Ratio"), NA,
                       as.character(round(low, 2)))),
         lowc = ifelse(is.na(lowc), dplyr::lag(lowc), lowc),
         ## "term" will temporarily be "variable, [coef or ratio]"
         term = ifelse(term %in% c("Hazard Ratio", "Odds Ratio"),
                       sprintf("%s, ratio", dplyr::lag(term)),
                       sprintf("%s, coef", term))) %>%
  dplyr::select(-type, -low, -high) %>%
  separate(term, into = c("term", "type"))

## Stop here for separate **rows** for coefficient and ratio effects

Output with separate rows:

> res
  term  type  highc  lowc     diff     effect  std.error   conf.low  conf.high
1  age  coef  57.38 40.87 16.51252 -0.3966150 0.08479186 -0.5628040 -0.2304260
2  age ratio  57.38 40.87 16.51252  0.6725929         NA  0.5696096  0.7941952
3  sex  coef Female  Male       NA -0.6759644 0.06612014 -0.8055575 -0.5463713
4  sex ratio Female  Male       NA  0.5086656         NA  0.4468387  0.5790472`

Or, keep going for separate variables for coefficient and ratio effects:

res <- res %>%
  ## Reshape to get separate variables for original, ratio values
  gather(key = quantity, value = quantval, effect:conf.high) %>%
  mutate(type.quant = paste(type, quantity, sep = '.')) %>%
  dplyr::select(-quantity, -type) %>%
  spread(key = type.quant, value = quantval) %>%
  dplyr::select(-ratio.std.error) ## always NA, meaningless

Output with separate columns - might be easier to combine with tidy anova results:

> res
  term  highc  lowc     diff coef.conf.high coef.conf.low coef.effect coef.std.error ratio.conf.high
1  age  57.38 40.87 16.51252     -0.2304260    -0.5628040  -0.3966150     0.08479186       0.7941952
2  sex Female  Male       NA     -0.5463713    -0.8055575  -0.6759644     0.06612014       0.5790472
  ratio.conf.low ratio.effect
1      0.5696096    0.6725929
2      0.4468387    0.5086656

from broom.

dgrtwo avatar dgrtwo commented on August 20, 2024

Thank you both for this really interesting discussion. I know rms tidiers have been a long wait- the truth is that I don't understand the package and statistical methods nearly well enough to get the tidiers right myself.

@nutterb Generally I support adding measures if they fit roughly these conditions:

  1. They are uncontroversial: no one would argue that they should be calculated differently or that they're unhelpful. Generally I trust the package authors in this regard, which means if the values are calculated this way elsewhere (e.g. in the print method or in another function) I'm happy to do them here
  2. They are fast to compute. (If they're slow but important, you could consider adding a "quick" parameter, probably default FALSE, that allows them to be skipped. We do this in tidy.lm)

@jenniferthompson This is a great place to discuss those questions and share some draft code.

  • I definitely like your separation of the term column into three. I'm nervous about storing the numbers as part of a character vector but it may be unavoidable. What does the c in highc and lowc stand for, though? Could we replace it with something like value1 and value2?
  • I prefer the four-row version to the two row version, I don't think we should have two different sets of confidence intervals as columns. It's a tough call, though. I think this is one of the more harder models to tidy that we've tried.

from broom.

nutterb avatar nutterb commented on August 20, 2024

@jenniferthompson I agree with @dgrtwo that it is better to use the four-row version. We can make it easy enough to filter and select, but the decision to go to the two-row version seems like a decision the end user should make. Truthfully, if I were to want the ratios separate from the coefficients, I'd most likely filter the four-row version at the time that I needed it.

I've also made a counter-proposal for the high and low values. Output displayed below. I think it's better to leave the values in their numeric form--broom is pretty consistent in maintaining values for use in plots. I'm not sure that anyone would ever want to plot those values, but if they did, converting them to characters would make the task a pretty hefty nightmare. So instead, I've added two more columns with the level names for categorical levels. I think it's cosmetically less appealing, but leaves the plotting option on the table.

I also used the terminology low.val and high.val, as I wasn't sure what lowc and highc meant (low constant and high constant)? (for @dgrtwo : rms has this neat little feature in it's summary method that presents the effect as the change over the difference from the low to high value instead of as the per-unit value. This can really help with interpretation when your independent variables vary from, say, 0.27 to 0.33. A 1-unit change doesn't have a practical interpretation in those instances)

  term high.level low.level  low.val high.val     diff     effect  std.error
1  age       <NA>      <NA> 40.87213 57.38464 16.51252 -0.3966150 0.08479186
2  age       <NA>      <NA> 40.87213 57.38464 16.51252  0.6725929         NA
3  sex     Female      Male  2.00000  1.00000       NA -0.6759644 0.06612014
4  sex     Female      Male  2.00000  1.00000       NA  0.5086656         NA
    conf.low  conf.high  type
1 -0.5628040 -0.2304260  coef
2  0.5696096  0.7941952 ratio
3 -0.8055575 -0.5463713  coef
4  0.4468387  0.5790472 ratio

Then, for consistency, I took a similar approach with the tidy.anova.rms method; I used the term alone and added a logical column to identify the non-linear rows.

   term nonlinear statistic df   p.value
1   age     FALSE 121.13441  3 0.0000000
2   age      TRUE   2.22537  2 0.3286752
3   sex     FALSE 104.51535  1 0.0000000
4 TOTAL     FALSE 218.98439  4 0.0000000

Thoughts or complaints?

from broom.

jenniferthompson avatar jenniferthompson commented on August 20, 2024

highc and lowc were just what I named the character versions of the original "high" and "low" columns - I'm not attached to those names, especially if there's an applicable naming convention that better fits the broom framework. I'd prefer something with high/low rather than value1 and value2, so as to make it clear which value is the reference and which is the comparison value. Usually apparent with continuous variables, but not always with categorical!

@nutterb - I was thinking along the same lines for tidy.anova.rms. I did wonder if it would be better to have, say, age and age' for the two age rows, rather than the two variables for term + nonlinear, because this would make it easier to combine with tidy.summary.rms output (no duplicate information when joining); I'm not sure what would be most consistent with other broom functions. How does your latest version handle all the "(Factor+Higher Order Factors)", "f(A,B)", etc text that comes with a model that includes nonlinear terms and interactions (eg, fit <- cph(Surv(dt, e) ~ rcs(age,4) * sex, data = DF, x=TRUE, y=TRUE, surv = TRUE))? Those have historically been some of the trickiest pieces to deal with for me.

For the tidy.summary.rms output, I think the solution above looks like a good compromise. For my own use I'd probably nearly always combine the high.level/low.level and low.val high.val columns. (I usually combine summary.rms and anova.rms output in a single table for reports - having broom methods will shorten my code and hopefully motivate me to get a general public version ready!) I can't recall a time I've ever needed to plot those values, but agree that it would be good to preserve the numeric type for as long as it's not absolutely necessary to change it.

from broom.

nutterb avatar nutterb commented on August 20, 2024

Those have historically been some of the trickiest pieces to deal with for me

No kidding. Someone should give the rms author a stern talking to :)

With some modifications I can handle the ANOVA output in one of two ways

The first has your age and age' approach:

         term                                    type   statistic df   p.value
1         age           (Factor+Higher Order Factors) 123.5313164  6 0.0000000
2        age'                        All Interactions   2.3927544  3 0.4949848
3        age' Nonlinear (Factor+Higher Order Factors)   2.2887471  4 0.6828186
4         sex           (Factor+Higher Order Factors) 106.5338906  4 0.0000000
5        sex'                        All Interactions   2.3927544  3 0.4949848
6   age * sex           (Factor+Higher Order Factors)   2.3927544  3 0.4949848
7  age * sex'                               Nonlinear   0.1370057  2 0.9337908
8  age * sex'   Nonlinear Interaction : f(A,B) vs. AB   0.1370057  2 0.9337908
9       TOTAL                               NONLINEAR   2.2887471  4 0.6828186
10     TOTAL'                 NONLINEAR + INTERACTION   4.6770938  5 0.4565440
11     TOTAL'                                    <NA> 231.0912826  7 0.0000000

The second is similar to my earlier approach, but I had to add a second logical column to indicate the interaction rows as well.

        term interaction nonlinear                                    type   statistic df   p.value
1        age       FALSE     FALSE           (Factor+Higher Order Factors) 123.5313164  6 0.0000000
2        age        TRUE     FALSE                        All Interactions   2.3927544  3 0.4949848
3        age       FALSE      TRUE Nonlinear (Factor+Higher Order Factors)   2.2887471  4 0.6828186
4        sex       FALSE     FALSE           (Factor+Higher Order Factors) 106.5338906  4 0.0000000
5        sex        TRUE     FALSE                        All Interactions   2.3927544  3 0.4949848
6  age * sex       FALSE     FALSE           (Factor+Higher Order Factors)   2.3927544  3 0.4949848
7  age * sex       FALSE      TRUE                               Nonlinear   0.1370057  2 0.9337908
8  age * sex        TRUE      TRUE   Nonlinear Interaction : f(A,B) vs. AB   0.1370057  2 0.9337908
9      TOTAL       FALSE      TRUE                               NONLINEAR   2.2887471  4 0.6828186
10     TOTAL        TRUE      TRUE                 NONLINEAR + INTERACTION   4.6770938  5 0.4565440
11     TOTAL       FALSE     FALSE                                    <NA> 231.0912826  7 0.0000000

Despite the added goofiness of the logical columns, I'm kind of preferential to the second version. The primary reason for my preference is there really isn't anywhere else in broom that I'm aware that adds a suffix to the term name like you've proposed. Then again, nowhere else in broom do we seem to be dealing with such a wild set of objects. So I'm leaning toward consistency.

For merging with the summary output, it isn't a big leap to do

left_join(tidy(summary(rms.obj)),
          filter(tidy(anova(rms.obj)), !interaction & !nonlinear),
          by = c("term" = "term")

from broom.

jenniferthompson avatar jenniferthompson commented on August 20, 2024

I agree that consistency and clarity would be the highest goals here! The second option looks reasonable to me, and thinking through how I often present rms results, I think it would offer flexibility for further combination and presentation, as you've noted.

from broom.

xiaodaigh avatar xiaodaigh commented on August 20, 2024

bump

from broom.

Deleetdk avatar Deleetdk commented on August 20, 2024

Bump, currently, errors are returned that aren't very informative.

> g_model_1 %>% broom::tidy()
Error in summary.rms(x) : 
  adjustment values not defined here or with datadist for g1_g_loading

from broom.

alexpghayes avatar alexpghayes commented on August 20, 2024

On the docket but very low priority. If you provide a reprex I can look into a more informative error in the meantime.

from broom.

nutterb avatar nutterb commented on August 20, 2024

@Deleetdk The error your getting is not a broom error. It looks like you need to run the following before building your model (I think this is right. I'm just working from memory)

dd <- datadist(g1_g_loading)
options(datadist = 'dd')

That should make that particular error go away. I'm not sure what else may go wrong after you resolve this. The rms tidiers aren't part of broom yet, and the code in this thread is somewhat old.

from broom.

Deleetdk avatar Deleetdk commented on August 20, 2024

@nutterb

Here's a simple example. Goal: get the standard errors and p values from rms linear model.

> library(tidyverse)
> library(rms)
> iris_model = ols(Sepal.Width ~ Petal.Width, data = iris)
> iris_model
Linear Regression Model
 
 ols(formula = Sepal.Width ~ Petal.Width, data = iris)
 
                Model Likelihood     Discrimination    
                   Ratio Test           Indexes        
 Obs     150    LR chi2     21.59    R2       0.134    
 sigma0.4070    d.f.            1    R2 adj   0.128    
 d.f.    148    Pr(> chi2) 0.0000    g        0.182    
 
 Residuals
 
      Min       1Q   Median       3Q      Max 
 -1.09907 -0.23626 -0.01064  0.23345  1.17532 
 
 
             Coef    S.E.   t     Pr(>|t|)
 Intercept    3.3084 0.0621 53.28 <0.0001 
 Petal.Width -0.2094 0.0437 -4.79 <0.0001 
 
> #try to get from broom
> iris_model %>% broom::tidy()
Error in summary.rms(x) : 
  adjustment values not defined here or with datadist for Petal.Width
> #capture output
> iris_model_out = capture.output(print(iris_model))
> iris_model_out
 [1] "Linear Regression Model"                                 " "                                                      
 [3] " ols(formula = Sepal.Width ~ Petal.Width, data = iris)"  " "                                                      
 [5] "                Model Likelihood     Discrimination    " "                   Ratio Test           Indexes        "
 [7] " Obs     150    LR chi2     21.59    R2       0.134    " " sigma0.4070    d.f.            1    R2 adj   0.128    "
 [9] " d.f.    148    Pr(> chi2) 0.0000    g        0.182    " " "                                                      
[11] " Residuals"                                              " "                                                      
[13] "      Min       1Q   Median       3Q      Max "          " -1.09907 -0.23626 -0.01064  0.23345  1.17532 "         
[15] " "                                                       " "                                                      
[17] "             Coef    S.E.   t     Pr(>|t|)"              " Intercept    3.3084 0.0621 53.28 <0.0001 "             
[19] " Petal.Width -0.2094 0.0437 -4.79 <0.0001 "              " "                                                      
> #get the values for predictor
> iris_model_out[19] %>% str_match_all("(-?\\d+\\.\\d+)") %>% {.[[1]][, 2] %>% as.numeric()}
[1] -0.2094  0.0437 -4.7900  0.0001
> #cheat and use the lm() broom function manually
> iris_model %>% summary.lm() %>% tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 Intercept      3.31     0.0621     53.3  1.84e-98
2 Petal.Width   -0.209    0.0437     -4.79 4.07e- 6

I.e., these are calculated when printing the model, but aren't actually found in the model object as far as I can tell. One can attempt to call broom, but this requires setting other variables beforehand, which is not a good idea to rely upon. So somehow, the print.ols() call calculates the parameter values we want but doesn't return them, so one has to get them from the printed output.

Same issue here:

https://stackoverflow.com/questions/47724189/extract-all-model-statistics-from-rms-fits

I also wrote the author, but he's generally unhelpful.

from broom.

alexpghayes avatar alexpghayes commented on August 20, 2024

from broom.

Deleetdk avatar Deleetdk commented on August 20, 2024

Broom 0.7.3 fixes this, at least for now. https://cran.r-project.org/web/packages/broom/news/news.html

from broom.

github-actions avatar github-actions commented on August 20, 2024

This issue has been automatically closed due to inactivity.

from broom.

github-actions avatar github-actions commented on August 20, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from broom.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.