GithubHelp home page GithubHelp logo

support `optim()` output? about broom HOT 17 CLOSED

cboettig avatar cboettig commented on August 20, 2024
support `optim()` output?

from broom.

Comments (17)

dgrtwo avatar dgrtwo commented on August 20, 2024 2

@cboettig Thanks Carl, so glad to hear!

There's a little of this philosophy in the broom paper, particularly the k-means example. (That actually uses the broom function inflate, which basically expand.grids an existing data frame). And I'm working on some blog posts that cover it as well.

from broom.

cboettig avatar cboettig commented on August 20, 2024 1

@ashander Ironically I was just telling @dgrtwo how helpful this pattern has been to me on the same day you posted this. In general I've found the pattern of expand.gred -> input_data.frame -> group_by -> do -> tidy very helpful myself.

If it helps in your thinking about generalizing this in runr, here's the example I was working through when I was first asking drob about this:

This exercise is basically replicating Perry's paper on the weak evidence for density dependence after you account for measurement error (i.e. kalman filter Gompertz model), for all appropriate timeseries in the global population dynamics database. I think the code I show there could be improved upon / abstracted further, and I think this kind of research could benefit from such a higher-level abstraction syntax that makes it easier both to do this kind of analysis and to quickly see what's going on. I found Perry's Kalman filter models to be a nice level of complexity for this exercise; so curious what you think.

Thanks for sharing the runr link, will try and take a look, though I'll leave it to @dgrtwo (or maybe post as a stackoverflow q?) to comment on ideas improving the syntax and dropping the do.call.

from broom.

dgrtwo avatar dgrtwo commented on August 20, 2024

Thanks! I would certainly like to cover it, but one thing worth noting is that what optim returns is just a list, not an S3 object, so we can't just have a tidy.optim S3 method.

I've considered having a method tidy.list that would choose between methods, something like:

tidy.list <- function(x, ...) {
    if (all(c("par", "value", "counts", "convergence", "methods") %in% names(x)) {
        tidy_optim(x, ...)
    }
    else if (all(c("d", "u", "v") %in% names(x)) {
        tidy_svd(x, ...)
    } else {
        stop("No tidying method recognized for this list")
    }
}

and so on, where tidy_optim and tidy_svd are functions that are not exported. This might be a good enough reason to go through with it.

Secondly, I'm not certain "one column per parameter" is the tidy form here, since it's fairly rare that a tidy output has a potentially arbitrary number of columns. (Though recombining models with different numbers of parameters wouldn't be too painful with dplyr::bind_rows, which would fill in missing cases with NAs) I think the glance method would be the minimized value, convergence, counts, etc, but I would think of tidy as "one row per parameter". Which is more likely, to plot parameter1 and parameter2 (across many models) against each other in a scatter plot, or to have them faceted in histograms?

So there'd be two columns of tidy(o), parameter and value. If the parameter vector is unnamed it would give them names. For instance:

func <- function(x) {
    (x[1] - 2)^2 + (x[2] - 3)^2
}

Then tidy(optim(c(2, 1), func)) should become

  parameter          value
1         parameter1     2
2         parameter2     3

and tidy(optim(c(a = 2, b = 1), func)) should become

  parameter value
1         a     2
2         b     3

Your thoughts?

from broom.

cboettig avatar cboettig commented on August 20, 2024

Yeah, good point about optim never actually set as an S3 class. I like your tidy_list approach though, that seems like the perfect way to handle such cases.

Re 'one column per parameter', that's a good question -- I could see it both ways. I often end up fitting the same model to many datasets -- a parametric bootstrap is the most obvious example of this approach, where you have many simulated realizations of the same model; though often I'm working with many real datasets to which I'm fitting the same underlying model. In this way, each data set is a 'realization' of the process, hence it makes sense to think of datasets as observations in the rows and parameters as the variables in the columns; particularly when you might have thousands of datasets and only a handful of parameters.

A different case is that of one data set and many models, though the right layout isn't clear here (indeed it's not clear that any tabular layout is possible).

The case of a single model with a single dataset probably isn't that interesting, since the thing is small enough that it is easy manipulated without dplyr etc. I think the strength of the broom approach is that it makes it easy to combine model outputs when you're looping over something (like multiple datasets / subsets being fit to the same model).

In the "parameters as rows" model; maybe columns would be "dataset", "model", "parameter", "value", which would allow for adding multiple models and datasets? Now that I think about it, that does probably work more cleanly than my suggestion of "parameters as columns"

from broom.

dgrtwo avatar dgrtwo commented on August 20, 2024

I think that's exactly the idea! Each observation in a tidy model would be a dataset-model-parameter combination. (Indeed, we could go further- if there are different input settings the algorithm could take, we could end up with one dataset-model-setting-parameter combination per row, which is all the more reason not to get caught up with a plan of "one-dataset-per-row")

I often combine multiple tidy models where each model gets multiple rows (see for instance the split-apply-combine of linear models in my broom manuscript). Plus will let you do cool things like

ggplot(combined_optims, aes(x = value, color = model)) +
    geom_density() +
    facet_grid(dataset ~ parameter)

I've added an optim tidying method for now- please give it a shot :)

from broom.

cboettig avatar cboettig commented on August 20, 2024

@dgrtwo I think it would be necessary to have tidy return the "value" and the "convergence" data from optim as well; probably just as additional rows? The value (usually the minus log likelihood if we're using optim in model fitting) is particularly crucial is it is the thing we most often want to compare across the models fitted to the same data.

from broom.

dgrtwo avatar dgrtwo commented on August 20, 2024

See the glance method, which returns those values.

from broom.

cboettig avatar cboettig commented on August 20, 2024

@dgrtwo right, but it returns them as a row, and without the parameters. In the spirit of broom here I'm looking to get a single table that includes all the outputs I need. For instance, one might want to drop parameter values for which the method did not converge. It's cleaner to do this if convergence is a column. Does that make sense?

from broom.

dgrtwo avatar dgrtwo commented on August 20, 2024

I think it's important to return the levels of observational units separate in the returned values- that is, that tidy returns one-row-per-parameter, and glance returns one-row-per-model.

In your proposal, I'd separate out the tidied outputs from the glances, into something like tidied and glanced. If I don't want to run the optims twice (and I don't!), I'd keep it as a list column. Pseudo-R:

optim_results <- mydata %>%
    group_by(dataset, model) %>%
    do(optim = optim(<arguments>))

optim_results looks like:

   dataset  model     optim
1 dataset1 model1 <list[5]>
2 dataset1 model2 <list[5]>
3 dataset2 model1 <list[5]>
...

There's a shortcut in broom for tidying these rowwise models:

tidied <- tidy(optim_results, optim)
glanced <- glance(optim_results, optim)

At which point tidied looks like

   dataset  model  parameter        value
1 dataset1 model1 parameter1 0.0001274686
2 dataset1 model1 parameter2 0.0001447624
3 dataset1 model2 parameter1 0.0001274686

While a row of glance would look something like

   dataset  model        value function.count gradient.count convergence
1 dataset1 model1 3.720441e-08             65             NA           0

To filter for models that converged, I'd do

converged <- glanced %>% filter(converged == 0)
converged_params <- tidied %>% semi_join(converged, c("dataset", "model"))

Or just combine them first:

converged_params <- tidied %>% inner_join(glanced, by = c("dataset", "model"))
    filter(converged == 0)

I know I'm oversimplifying the above, but with a variation on that first line you can go a long way. See here for an example of keeping tidy, augment and glance outputs separate then recombining them after.


Of course that's more work than pre-combining them. You could always write a wrapper function that calls tidy and glance and recombines them.

But I think it doesn't fit in broom's philosophy to have that be the default. If we add columns to tidy output that are actually only one-value-per-model, would that be true everywhere? Should we include r.squared, adj.r.squared, and so on columns to tidy.lms output? (Someone could want to filter out model fits where R2 fell below a threshold). Should we add the values to augment results as well?

ETA: I definitely don't want to add the convergence as an additional row. The whole purpose of tidy data is so we can say "each row represents a <>", and not "each row represents a parameter, except for the last which represents information about the model fit"

from broom.

cboettig avatar cboettig commented on August 20, 2024

I see, thanks for the patient explanation; as you've said, it's not always obvious how best to summarize model outputs as data frames, and having done quite a lot of model classes already you have a pretty strong intuition for this and I'm still catching up.

You've hit the nail on the head about wanting to avoid multiple calls to optim(), that's how I got thinking about this in the first place. Unfortunately, that's really my motivation for wanting to combine these into a single data frame as well.

I don't really follow the notion of grouping the data by model. The data is grouped by dataset, e.g. I have 10 datasets of timeseries data say. I want to fit each model to each of the datasets, so I define my own function

f <- function(df, ...){

    model1 <- optim(init1, model1_function, df, ...)
    model2 <- optim(init2, model2_function, df, ...)
    model3 <- optim(init3, model3_function, df, ...)

   # create some data.frame summarizing the outputs of all models
  rbind(cbind(tidy(model1), glance(model1)),
           cbind(tidy(model2), glance(model2)),
           cbind(tidy(model3), glance(model3)))

}

and then doing:

df %>% group_by(dataset) %>% do(f(.)) -> fits

Perhaps it is more elegant to have a separate do call for each model; followed perhaps by inner_joins for each model and then rbind the resulting data.frames for each model together, but I was hoping for a workflow that abstracted away more of the joining. Maybe I'm just not thinking about this in the right way.

I agree that including one-value-per-model elements as repeated values in columns really isn't great.
Maybe this highlights why initially I imagined structuring a table with parameters, convergence and likelihood as columns instead of the cbind shown above, it would return one row per model per dataset.

from broom.

dgrtwo avatar dgrtwo commented on August 20, 2024

While I'm still developing my own approach to such problems, my general strategy is

  1. Develop a function that takes only numeric/character inputs.
  2. Use expand.grid to get all combinations of input parameters
  3. Group by all the input parameters and use do.

Thus I would come up with something like:

models <- list("model1" = list(func = model1_function, init = init1),
               "model2" = list(func = model2_function, init = init2),
               "model3" = list(func = model3_function, init = init3))

datasets <- list("dataset1" = dataset1,
                 "dataset2" = dataset2,
                 "dataset3" = dataset3)

optim_func <- function(model_name, dataset_name) {
    optim(models[[model_name]]$init, models[[model_name]],
          datasets[[dataset_name]])
}

combinations <- expand.grid(model = names(models),
                            dataset = names(datasets)) %>%
    group_by(model, dataset) %>%
    do(o = optim_func(.$model, .$dataset))

combinations_tidied <- tidy(combinations, o)
combinations_glanced <- glance(combinations, o)

There are elements here that can be improved. For instance, I wish there were a group_by_all feature in dplyr. (rowwise doesn't do it for me since it drops the group names afterwards). Incidentally, the do line could be replaced by do.call(optim_func, .), I just left it for transparency.

But what I like about it is its flexibility. If I later want to add another input parameter (perhaps a parameter of the model that is not fit by optimization), I just add it to the optim_func function, and to the expand.grid. I've used this pattern to handle a number of factorial and model selection problems.

Still, it doesn't get at the core of your question. My favorite uses of the above pattern have the function return a tidy data frame immediately, not an object that can be turned into multiple. But I think it's very important to keep types of observational units in separate tables. Similarly, tidy outputs don't typically have an arbitrary number of columns- when columns are optional it's usually something like "If Cook's SD can be computed on this, a .cookssd column is added. (If this is the only application you're interested in, you could always write something like a tidy_glance function that splits it out this way!)

This is definitely a conversation worth starting.

from broom.

cboettig avatar cboettig commented on August 20, 2024

Thanks for this, I really like the general strategy you outline and have been thinking about this a lot in the past few days. I agree that it's most elegant if the function returns the tidy data frame immediately, but the general approach of defining abstract function, expand.grid, group_by+do is quite powerful; (even if piping with a data.frame of names rather than the actual data seems a bit unusual for a dplyr workflow; maybe that's just me.)

One thing that bugs me about this particular example is that it suggests I convert my single, grouped data.frame of all datasets into a list of data.frames (probably with some cludgy lapply thing), or otherwise replacing datasets[[dataset_name]] with an SE version of filter_().

Would you have written this any differently if calling the function twice was not a constraint? (As you mentioned before, it feels like that would simplify things, allowing you to create a tidy data frame immediately; but I'm not actually sure that lifting this constraint would give a cleaner syntax.)

from broom.

cboettig avatar cboettig commented on August 20, 2024

@dgrtwo Just wanted to say that your expand.grid + do approach has profoundly cleaned up the way I run analyses; at least as big an improvement as tidyr/dplyr thinking. Like anything, there's always edge cases, but I'm continually surprised how far it goes. Not sure where you would send it, but I definitely think you should write this pattern up somewhere. Many thanks!

from broom.

ashander avatar ashander commented on August 20, 2024

Sorry to comment on this closed issue, but the discussion here is great. Thanks David for laying out these ideas clearly here. I've been struggling with similar issues in trying to find a useful abstraction to manage many simulation runs on parameter grids.

I just wrote a small package that abstracts out my workflow with dplyr. The idea is, for a function fun <- function(a1, <...>, aN, fixed_params) and a data frame with N columns, pass the coluns as arguments like

require(dplyr)
require(lazyeval)
run <- function(data, fun, fixed_parameters, ...) {
  ## ....
  ## argument checking
  ##

  fixed_parameters <- as.environment(fixed_parameters)
  data %>%
    rowwise %>%
    do_(interp( ~ do.call(fun, c(., fixed_parameters, ...)))) %>%
    as.data.frame()
}

More details here: https://github.com/ashander/runr

I'm curious what you think of this, and if either of you has suggestions for how to make the do_ call less clunky. In particular, is do.call necessary, or is there some way to get around this in lazyeval.

from broom.

ashander avatar ashander commented on August 20, 2024

Thanks @cboettig for sharing those links. I realized my interp is superfluous, but haven't dropped do.call yet:

require(dplyr)
run <- function(data, fun, fixed_parameters, ...) {
   ## ....
   ## argument checking
   ##

   fixed_parameters <- as.environment(fixed_parameters)
   grouped_out <- do_(rowwise(data), ~ do.call(fun, c(., fixed_parameters, ...)))
   ungroup(grouped_out)
 } 

I did post on SO, but forgot to link it here. No bites yet http://stackoverflow.com/questions/36345558/using-standard-evaluation-and-do-to-run-simulations-on-a-grid-of-parameters-wit

from broom.

ashander avatar ashander commented on August 20, 2024

Thanks to a kind commenter, I discovered purrr:invoke_rows, which works nicely! http://stackoverflow.com/a/37336356/4598520

from broom.

github-actions avatar github-actions commented on August 20, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from broom.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.