GithubHelp home page GithubHelp logo

wwiecek / baggr Goto Github PK

View Code? Open in Web Editor NEW
46.0 2.0 12.0 13.09 MB

R package for Bayesian meta-analysis models, using Stan

License: GNU General Public License v3.0

R 89.05% Stan 6.76% C++ 0.02% TeX 4.09% Shell 0.08%
meta-analysis stan bayesian-statistics treatment-effects quantile-regression

baggr's Introduction

baggr: Bayesian aggregation package for R, v0.7.8 (2024)

CRAN_Status_Badge Codecov_test_coverage

This is baggr, an R package for Bayesian meta-analysis using Stan. Baggr is intended to be user-friendly and transparent so that it’s easier to understand the models you are building and criticise them.

Baggr provides a suite of models that work with both summary data and full data sets, to synthesise evidence collected from different groups, contexts or time periods. The baggr() command automatically detects the data type and, by default, fits a partial pooling model (which you may know as random effects models) with weakly informative priors by calling Stan to carry out Bayesian inference. Modelling of variances or quantiles, standardisation and transformation of data is also possible.

The current version is a stable version of a tool that’s in active development so we are counting on your feedback.

Installation

Before starting, please follow the installation instructions for RStan, which is responsible for Bayesian inference in baggr. If you don’t have Stan, it’s worth following the instructions step-by-step.

The package itself is available on CRAN:

install.packages("baggr")

You can also install the most up-to-date version of baggr directly from GitHub; this is what we recommend, but to do that you will need the remotes package:

#installation this way may take 5-15 minutes
remotes::install_github("wwiecek/baggr", 
                        ref = "devel", #if problems try changing to ref = "master"
                        build_vignettes = TRUE, quiet = TRUE,
                        build_opts = c("--no-resave-data", "--no-manual"))

Most common issue in installing baggr is with updating other packages. Try updating your packages (and ensure R is at least version 4) before trying the remotes command.

Basic use case

baggr is designed to work well with both individual-level (“full”) and aggregate/summary (“group”) data on treatment effect. In basic cases, only the summary information on treatment effects (such as means and their standard errors) is needed. Data are always specified in a single input data frame and the same baggr() function is used for different models.

For the “standard” cases of modelling means, the appropriate model is detected from the shape of data.

library(baggr)
df_pooled <- data.frame("tau" = c(28,8,-3,7,-1,1,18,12),
                        "se"  = c(15,10,16,11,9,11,10,18))
bg <- baggr(df_pooled, pooling = "partial")

You can specify the model type from several choices, the pooling type ("none", "partial" or "full"), and certain aspects of the priors, as well as other options for data preparation, prediction and more. You can access the underlying stanfit object through bg$fit.

Flexible plotting methods are included, together with an automatic comparison of multiple models (e.g. comparing no, partial and full pooling) through baggr_compare() command. Various statistics can be calculated: in particular, pooling() for pooling metrics and loocv() for leave-one-group-out cross-validation, allowing us to then compare and select models via loo_compare(). Forest plots and plots of treatment effects are available.

Try vignette('baggr') for an overview of these functions and an example of meta-analysis workflow with baggr. If working with binary data, try vignette("baggr_binary"). Compiled vignettes are available on CRAN.

Current and future releases

Included in baggr v0.7 (2023):

  • Meta-analysis of continuous and binary outcomes
  • Both full and aggregate data sets can be used
  • Summaries and plots specific to meta-analysis, typical diagnostic plots
  • Meta-regression / fixed effects modelling
  • Compatibility with rstan and bayesplot features
  • Automatic choice of priors or “plain-text” specification of priors
  • Calculation of pooling/heterogeneity metrics
  • Cross-validation (leave-one-group-out)
  • Prior and posterior predictive distributions

Check [NEWS.md] for more information on recent changes to the package.

baggr's People

Contributors

andrjohns avatar be-green avatar dannytoomey avatar rmeager avatar wwiecek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

baggr's Issues

re-enable prepare_ma()

In cases where no covariates are present, we will cnvert individual-level data to summaries.

8 schools vignette

Should feature all of baggr functionality, probably the best default type of presentation aimed at new users - many of whom will know 8 schools already. It's should be a gentle dive into the package but also show that it's fully featured & flexible.

  • Visually compare different pooling types
  • View pooling metrics
  • View LOO CV summary
  • Show posterior dist. of variance parameter; change its prior to Inv-Gamma? (same as in Gelman 2008)
  • ??? change 2nd level distribution?

New way to specify priors

Once #5 is complete, we will introduce a new naming convention for priors in the model, probably with rstanarm style function-specifications for priors, e.g. normal(), cauchy(). @rmeager will suggest list of priors that we want to allow for various parameters (in "rubin", "mutau") at this stage.

Document rstan::sampling() arguments

It may be necessary for people running meta-analysis models with low data to change the adaptation parameters or number of iterations. Right now all of those are hidden within the ... arguments to baggr(). It might be good to have this documented for users who are unfamiliar with rstan, or who just forget these kind of things. I'm happy to take this on as a pull request.

clearly label 2nd level parameters

We need to choose and consistently apply rules about what is denoted by y's, mu's, taus, alphas, betas, sigmas and thetas. Yes, it's that many parameters!

Allow users to specify different column names for effects

One of the things that confused me when I first read through the documentation for this package was that the treatment effect was name tau and the variance was sigma_tau. On the rstan getting started page, for example, where they list the 8 schools example, what we call tau_k they call y_i. What we call sigma_tau they call tau! Anyway I just thought it might be good to allow people to list the column name that contains all the different effects instead of making them change the name of their data.frame or data.table, because if they have their own notations/conventions their code might be hard to read. Just a thought!

Possible compilation issues in Windows

I really don't know what's going on with all these models, but both MuTau and Logit now crash my R session. I'll post updates as I try to debug the behavior...

Warn in low-dimensional cases

If N = 2 we should consider stopping people from meta-analysing
If N = 3 we should suggest a prior for scale (half-Cauchy)

That means we also need to implement half-Cauchy

Warn users if IPD does not have 2 levels of treatment

schools_ipd <- data.frame()
N <- 10
for(i in 1:8)
  schools_ipd <- rbind(schools_ipd, 
                       data.frame(group = schools$group[i],
                                  outcome = rnorm(N, schools$tau[i], schools$se[i]*sqrt(N)),
                                  treatment = 1)) #this is wrong, i know
baggr(schools_ipd)

Add predict. methods and pp_check()

The idea is to do a posterior predictive check against the original input data. That means

  • predict() function that returns new draws for original data
  • pp_check() function that visualises this (using bayesplot functionality)

The first pass at it (thanks to Brice) is in devel-predict. It works nicely with Rubin model, but

  • doesn't work with mu & tau model
  • lacks some documentation
  • does not have tests
  • does not pass checks (S3)

mutau crashes my R session

No idea why this is happening but it's really fun... Will update/edit when I get more info...

I was trying to work on the predict pull request and implement the predict function for mutau but it's crashing locally for me. Maybe a memory issue? I remember some weirdness with my compiler optimization, but I don't think that would cause this. Anyway I'll report back.

specifying baseline prob prior in logit model

Currently logit.stan:62 has

  baseline ~ normal(0, 10);

  //hypermean priors:
  if(pooling_type > 0)
    target += prior_increment_real(prior_hypermean_fam, mu[1], prior_hypermean_val);
  else{
    for(k in 1:K)
      target += prior_increment_real(prior_hypermean_fam, eta[k], prior_hypermean_val);
  }

  //hyper-SD priors:
  if(pooling_type == 1)
    target += prior_increment_real(prior_hypersd_fam, tau[1], prior_hypersd_val);

  //fixed effect coefficients
  target += prior_increment_vec(prior_beta_fam, beta, prior_beta_val);

We should have one more category of flexible priors, i.e. baseline ~.
We could also consider a hierarchical baseline by analogy with "mu & tau" type of models

mutau model crashes R session

I was trying to figure out how to work with the various models and I found that when I ended up repeatedly crashing my R session. As a test I ran prepare_ma with the simplified microcredit dataset, and when I plugged that into the mutau model it began sampling and then crashed. Here's my code:

mt_data <- prepare_ma(microcredit_simplified, outcome = "consumerdurables")
baggr(mt_data)

Sometimes it crashes right away and sometimes later. This seems like an issue on the Stan side since it's crashing during sampling.

Not really sure whether it's something on my end or not, but if it is then it would be nice to have an error instead of crashing the session.

Session Info:

R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils
[5] datasets methods base

other attached packages:
[1] baggr_0.0.0.9000 Rcpp_1.0.1

loaded via a namespace (and not attached):
[1] pillar_1.4.1 compiler_3.6.0
[3] plyr_1.8.4 prettyunits_1.0.2
[5] tools_3.6.0 packrat_0.5.0
[7] bayesplot_1.7.0 pkgbuild_1.0.3
[9] tibble_2.1.3 gtable_0.3.0
[11] lattice_0.20-38 pkgconfig_2.0.2
[13] rlang_0.3.4 Matrix_1.2-17
[15] cli_1.1.0 rstudioapi_0.10
[17] parallel_3.6.0 SparseM_1.77
[19] loo_2.1.0 gridExtra_2.3
[21] dplyr_0.8.1 stringr_1.4.0
[23] MatrixModels_0.4-1 stats4_3.6.0
[25] grid_3.6.0 tidyselect_0.2.5
[27] glue_1.3.1.9000 inline_0.3.15
[29] R6_2.4.0 processx_3.3.1
[31] rstan_2.18.2 callr_3.2.0
[33] ggplot2_3.1.1 purrr_0.3.2
[35] reshape2_1.4.3 magrittr_1.5
[37] codetools_0.2-16 matrixStats_0.54.0
[39] StanHeaders_2.18.1 ps_1.3.0
[41] scales_1.0.0 ggridges_0.5.1
[43] fortunes_1.5-4 rstantools_1.5.1
[45] assertthat_0.2.1 colorspace_1.4-1
[47] quantreg_5.40 stringi_1.4.3
[49] lazyeval_0.2.2 munsell_0.5.0
[51] crayon_1.3.4

Different pooling in full model and Rubin model

I thought that full would have similar pooling to rubin, but now I realise that estimating sigma in each group means the two models are different. Is there a way to reconcile this?

# Generate IPD dataset based on 8 schools -----
# Means will differ of course because we plug in original SEs
schools_ipd <- data.frame()
N <- c(rep(10, 4), rep(20, 4))
for(i in 1:8){
  x <- rnorm(N[i])
  x <- (x-mean(x))/sd(x)
  x <- x*schools$se[i]*sqrt(N[i])/1.41 + schools$tau[i]
  
  y <- rnorm(N[i])
  y <- (y-mean(y))/sd(y)
  y <- y*schools$se[i]*sqrt(N[i])/1.41
  
  schools_ipd <- rbind(schools_ipd,
                       data.frame(group = schools$group[i], outcome = x, treatment = 1),
                       # This is just so that we don't trip off prepare_ma:
                       data.frame(group = schools$group[i], outcome = y, treatment = 0))
}

prepare_ma(schools_ipd) #similar SEs, different means, obvs
summ <- prepare_ma(schools_ipd)
schools_ipd %>% group_by(treatment, group) %>% summarise(sd(outcome))
summ$se <- summ$se.tau; summ$mu <- NULL; summ$se.mu <- NULL; summ$se.tau <- NULL
# group_by(schools_ipd, group, treatment) %>% summarise(sd(outcome))

w1 <- baggr(schools_ipd)
w2 <- baggr(summ)

Vignette for model convergence checks + some automated (?) tests

Currently we have 1 continuous, 1 binary data vignettes, but neither shows how users should check convergence. I think there should be a short section about

  1. What could go wrong
  2. When users should check for problems
  3. What plots to produce, what could a "basic" user without understanding of Bayesian inference do

Update priors documentation

Three docs contain info about prior:

  1. ?baggr
  2. ?normal etc. (a single file with prior definitions)
  3. baggr.Rmd with the following text:

As discussed above, it is possible for the user to override the default priors and specify her own priors. The first parameter you can change is the upper bound of the uniform prior on $\sigma_{\tau}$, called "prior_upper_sigma_tau", the second is the mean of the prior on $\tau$ itself called "prior_tau_mean", the third is the scale (standard deviation) of the prior on $\tau$ itself called "prior_tau_scale". Here I have changed all three as an example (at this stage in baggr if you want to alter one you must provide the full vector):

baggr(schools, "rubin", prior = c("prior_upper_sigma_tau" = 10000, 
                                  "prior_tau_mean" = -10, 
                                  "prior_tau_scale" = 100))
baggr_schools

rewrite old Stan code

Problem #1 is to check if this is a problem: Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.

Problem #2 is to modify individual-level models that create sufficient statistics... are they faster anyway?

Do not use Uniform as default prior for Rubin model

Currently if I do

baggr(schools, ppd=T)

schools is used to draw a default prior and Stan model is done on priors only (data ignored).
I can also do this:

baggr(schools, ppd=T, prior_hypersd = normal(0,10), prior_hypermean = normal(0,5))

In this case schools only suggests that model="rubin", as all priors are specified manually. So it makes sense to have this syntax:

baggr(ppd=T, model = "rubin",   #data=NULL
           prior_hypersd = uniform(0,100), prior_hypermean = normal(0,5))

Alas, this is not something I thought of before.

Prior and posterior predictive functions

This is a placeholder for the ppc checks / ppd functionality in baggr, to be added for the "rubin" and "mutau" models in Oct.

WW will review some ideas from #19 as the first step, then add a detailed plan here.

Cross-validation for full and logit models

We need to check and harmonise three cases

  1. CV with the same groups as already included
  2. CV for new groups
  3. Mixture of 1. and 2.

All of this will need tests

#using baggr_binary
ww <- baggr(df_ind[1:100,], iter = 500, test_data = df_ind[101:200,])

baggr_compare: hyper=T doesn't do anything

Currently compare = c("groups", "effects"), where effects is a "hypothetical study" (effect_draw).

I'd like to add:

  • hypermeans (Basically show what is currently returned as first table)
  • hypersds (second table)
  • groups + hypermeans (what you would get with baggr_plot(model, hyper=TRUE), but now for many models)

PS: And these tables need to be outputted differently, as per the other issue #29

More priors in the "mutau" model

Currently you can only specify i.i.d. priors for SDs and a join multi-normal prior for hypermean. I have a new version of the model that is more universal, but the R code to specify priors would need rewriting.

quantile modelling vignette

Show how to run quantiles on microcredit data + plotting and output options

A chance to check 1/ pooling, 2/ cross-validation in case of quantiles?

Standardise output of `baggr_compare`

Currently it's a mix of print within the function (bad!), ggplot output and numerical output.
We definitely need to have a print.baggr_compare S3, perhaps also plotting method. The output should be minimal.

Rubin Model Speed

The Rubin Model is quite slow compared to a stripped down version.

At first I thought there might be overhead (depending on the implementation) from the if-else statements that are embedded in the current implementation of the model. At the same time, the Stan manual says that if-else is lazily evaluated so it shouldn't be a huge performance overhead (at least not relative to the time it takes to sample).

Just to compare, there's the current model:


data {
  int<lower=0> K; // number of sites
  real tau_hat_k[K]; // estimated treatment effects
  real<lower=0> se_tau_k[K]; // s.e. of effect estimates
  int pooling_type; //0 if none, 1 if partial, 2 if full
  real prior_upper_sigma_tau;
  real prior_tau_mean;
  real prior_tau_scale;
}
transformed data {
  int K_pooled; // number of modelled sites if we take into account pooling
  if(pooling_type == 2)
    K_pooled = 0;
  if(pooling_type != 2)
    K_pooled = K;
}
parameters {
  real tau;
  real<lower=0> sigma_tau;
  vector[K] eta;
  real tau_k[K_pooled];
}
transformed parameters {
  vector[K] theta = tau + sigma_tau * eta;        // school treatment effects
}
model {

  if(pooling_type == 0){
      // we don't care about the other parameter, let it wander!
      // but in that case we need to give a sensible prior for individual tau's and their SE's?
      tau ~ normal(0, 1); //wander tau but avoid divergent transitions
      tau_k ~ normal(prior_tau_mean, prior_tau_scale); //tau_k 'take on' tau's prior
      tau_hat_k ~ normal(tau_k, se_tau_k);
  }
  
  if(pooling_type == 1){
      tau ~ normal(0, 5); // prior on mean
      se_tau_k ~ cauchy(0,5); // prior on se
      tau_hat_k ~ normal(theta, se_tau_k);
  }
  
  if(pooling_type == 2){
      tau ~ normal(prior_tau_mean, prior_tau_scale);
      tau_hat_k ~ normal(tau, se_tau_k);
    }
  
}

After some testing (that I will leave aside), I think the issue is the inclusion of a parameter that the model does not actually use. In your model, you include $tau_k$ as a parameter no matter which model you are evaluating. For whatever reason it _really_slows things down. Compare the performance of these two models:

data {
  int<lower=0> K; // number of sites
  real tau_hat_k[K]; // estimated treatment effects
  real<lower=0> se_tau_k[K]; // s.e. of effect estimates
}
parameters {
  real tau;
  real<lower=0> sigma_tau;
  vector[K] eta;
  
}
transformed parameters {
  vector[K] theta = tau + sigma_tau * eta;        // school treatment effects
}
model {
  tau ~ normal(0, 5); // prior on mean
  eta ~ normal(0,1); // implies theta ~ normal(tau,sigma_tau)
  se_tau_k ~ cauchy(0,5); // prior on se
  tau_hat_k ~ normal(theta, se_tau_k);
}

data {
  int<lower=0> K; // number of sites
  real tau_hat_k[K]; // estimated treatment effects
  real<lower=0> se_tau_k[K]; // s.e. of effect estimates
}
parameters {
  real tau;
  real<lower=0> sigma_tau;
  vector[K] eta;
  real tau_k[K];

}
transformed parameters {
  vector[K] theta = tau + sigma_tau * eta;        // school treatment effects
}
model {
  tau ~ normal(0, 5); // prior on mean
  eta ~ normal(0,1); // implies theta ~ normal(tau,sigma_tau)
  se_tau_k ~ cauchy(0,5); // prior on se
  tau_hat_k ~ normal(theta, se_tau_k);
}

test_data <- list(
  K = 8,
  tau_hat_k = schools$tau, # same as y
  se_tau_k = schools$se # same as sigma
)


times <- microbenchmark::microbenchmark(
  current_fit <- sampling(no_else_partpooled, data = test_data),
  ifelse_fit <- sampling(partpooled_extra_param, data = test_data),
  times = 1)

This is the ouput

> print(times)
Unit: seconds
                                                             expr
    stripped_down <- sampling(partpooled_stripped_down , data = test_data)
 extra_param <- sampling(partpooled_extra_param, data = test_data)
      min       lq     mean   median       uq      max neval
  1.51534  1.51534  1.51534  1.51534  1.51534  1.51534     1
 34.19129 34.19129 34.19129 34.19129 34.19129 34.19129     1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.