GithubHelp home page GithubHelp logo

mcp's People

Contributors

lindeloev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mcp's Issues

Fixed values

Add the functionality to fix some parameter values rather than having them inferred. Since this is just a 100% prior of a certain value, this should be done in prior. They have to be numerical values.

prior = list(
  int_1 = "dnorm(0, 1)",
  cp_1 = 10,
  int_2 = 20,
  cp_2 = "dnorm(40, 10) T(cp_1, )"
)

mcp(segments, data, prior)

I think it should still be included in summaries, just so that they always express the full model.

Implement MA(N), AR(N), and ARMA(N)

This should be possible:

  • mcp(..., variance = ~ ma(1)). Parameter will be named ma1.
  • mcp(..., variance = ~ ar(3)). Parameters will be named ar1, ar2, ar3
  • mcp(..., variance = ~ arma(2)); same as mcp(..., variance = ~ ma(2) + ar(2). Parameters will be named ma1, ma2, ar1, ar2.

As with other effects in #56, it would apply to the whole model.

Hypothesis tests

mcp already supports loo. Maybe we should also support more classical hypothesis tests.

  • Do something akin to brms::hypothesis. Should be straight forward using tidybayes::tidy_draws() and eval(parse(text = hypothesis))?
    • Requires sampling of the prior for Savage-Dickey. Replicate the prior-writer and suffix it with something like _nolik, e.g., cp_1_nolik and cp_2_id_nolik[5]?
  • Write an article/vignette (#46) with how-to and warnings about interpretability.

External validation of fits in unit test

Test accuracy of fits:

  • Add strong priors so that parameter estimates are (should be) altered considerably.
  • Validate parameter recovery in one-segment models against rstanarm, and loo and waic too.
  • Validate parameter recovery in multi-segment models by letting rstanarm or lme4 fit individual segments and compare fits. If reasonably vague priors and reasonably well-defined change points are used, they should be practically identical.
  • Plots: Check out the vdiffr package which ggplot2 uses.

plot() method for type = "overlay" not working

I suspect this has to do with tidybayes version.

Reprex below.

library(mcp)

my_data <- data.frame(
  x = 1:50,
  y = c(
    rep(30, 25) * abs(rnorm(25)),
    rep(30, 25) * -abs(rnorm(25))
  )
)

# Define segments
segments <- list(
  y ~ 1 + x, # Intercept
  1 ~ 1 # Intercept
)

# Start sampling
fit <- mcp(segments, my_data, cores = 1)
#> Compiling data graph
#>    Resolving undeclared variables
#>    Allocating nodes
#>    Initializing
#>    Reading data back into data table
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 50
#>    Unobserved stochastic nodes: 5
#>    Total graph size: 631
#> 
#> Initializing model
#> 
#>    user  system elapsed 
#>     2.2     0.5     2.7

summary(fit)
#> Warning: unnest() has a new interface. See ?unnest for details.
#> Try `df %>% unnest(c(.lower, .upper))`, with `mutate()` if needed
#> # A tibble: 5 x 4
#>   name     mean .lower  .upper
#>   <chr>   <dbl>  <dbl>   <dbl>
#> 1 cp_1   25.5    24.0   27.0  
#> 2 int_1  21.4     7.54  35.1  
#> 3 int_2 -24.4   -31.2  -17.7  
#> 4 sigma  16.9    13.3   20.5  
#> 5 x_1    -0.354  -1.29   0.652

# Plot fit
plot(fit)
#> Error in spread_draws_long_(tidy_draws(model), variable_names, dimension_names, : No variables found matching spec: draws

plot(fit, "combo")

plot(fit, "overlay")
#> Error in spread_draws_long_(tidy_draws(model), variable_names, dimension_names, : No variables found matching spec: draws
packageVersion("tidybayes")
#> [1] '1.1.0'

Support relative parameterization and use absolute by default

Specify parameters as relative or absolute. E.g. relative parameters represent changes from the former segment while absolute is more like "classical" Piecewise Linear Regression. This will be useful in some cases where changes are more meaningful. I think that absolute parameters should be the default. Changing to relative parameterization could be done like this:

segments = list(
    y ~ 1 + x, 
    rel(1) ~ rel(1) + rel(x))

All parameters in the first segment must be absolute. Fail if rel is used there. But a change in relative intercept following a slope-only segment is meaningful enough and should work.

  • rel on RHS
  • rel on LHS
  • Fail on rel in segment 1

Prettier printing of lists and characters

Do this by adding print.mcplist and print.mcpstr methods and corresponding classes to the following:

Lists:

  • fit$pars
  • fit$prior
  • fit$model
  • attr(fit$data$y, "simulated")

Code:

  • fit$jags_code. Used should not need to do cat(fit$jags_code)

Move example data/fits to a function

The datasets in ex_demo, ex_ar, etc. clutter up the namespace and the documentation. Move it to a function just like https://github.com/stan-dev/posterior/blob/master/R/example_draws.R.

So mcp_example("demo") returns a list with the fields

  • $model: the model used to generate the data.
  • $data: the data. "simulated")`
  • $fit: an mcpfit. Defaults to NULL.

A typical example in the README would then look like this:

model = list(
    ....,
    ....
)
fit = mcp(model, mcp_example("demo")$data)

or

ex =  mcp_example("demo")
ex$model  # Show it
fit = mcp(ex$model, ex$data)

Undecided:

  • Consider whether to add a field for the simulation values explicitly ($simulated) or just have the user extract them via attr($data$response, "simulated").
  • Having a briefer name would be nice (e.g., ex("demo")) but less semantically clear.

TO DO:

  • Update README
  • Update function doc examples
  • Update vignettes
  • Move descriptions to mcp_example() doc
  • Delete old data

Readme

Hey! Great package! Initially I had trouble to install because some of my packages were outdated. But after updating everything it ran pretty smooth.

  • It took me a bit to understand the logic of the list. I think a simple comment in the quick-start would fix this (e.g. "between each entry (a "segment") of the list, a changepoint is modelled"). After understanding this, the toolbox was intuitive to use

  • readme: sampling the prior:
    empty = mcp(segments, sample=FALSE)
    Here it is implicitly assumed that segments defines "x" somehow.

  • I think I'm a bit confused what the underlying model is. I get discrete changepoints but at points where there are no samples. This is still confusing me tbh.

  • the rel(1) command lets you parameterize the parameter relative to the last segment. Is it also possible to parameterize to any other ones? I am thinking of a situation where two changepoints define a plateau that is different and then going back to the initial value. i.e.
    grafik
    In this example, I might want to assume that the first and last segment / plateau have identical parameters (or at least put a prior that the difference is quite small)

Pretty good job! worked fine for me so far. I only ran it on simulated data, I have to check for real data :-) My problem with real data is that I usually have strong autocorrelation, i.e. changes are not really discrete hinges, but smoothed over time. I guess one could fit plateaus & slopes, but still no smoothness in the fit.

Add "variance" argument to model sigma/ARMA/varying

There should be an mcp(..., sigma = ~ [formula]) argument to set it up for the whole model. The default should be mcp(..., sigma = ~ 1). The estimated sigma should just be called sigma instead of sigma_1. If any sigma() is specified in the segments, the mcp argument would apply up to that point, but the parameter would be called sigma_1, etc.

The downside is that it is redundant to specifying + sigma(1) in segment 1 (which is already the default if not done explicitly). Having two ways of doing the same thing is confusing.

I judge that the advantages outweigh this downside, though:

  • it is congruent with other packages, e.g., brms and others.
  • Less confusion because about whether it only applies to segment 1 or later ones
  • Congruent with the implementation of upcoming functions, like the upcoming mcp(..., autocor).

One thing to decide:

  • Should it be sigma = ~ 1 or sigma = ~ sigma(1)? The latter is more verbose, but more consistent with usage in the formulas and the upcoming autocor.

Add posterior change point densities to the default plot

Would be great to:

  • Plot dens_overlay on the x-axis
  • Color it by change point (cp_1 is red, cp_2 is blue, etc.)
  • Show one line per chain as dens_overlay to inspect convergence.

This would bring the non-normal nature of the change point posteriors to the fore. It would also constitute an in-your-face inspection of convergence issues.

I would like it to be default, but you should be able to turn it off, plot(fit, show_dens = FALSE). I need to think of a good argument name here.

Support more link functions

  • "identity" for binomial() and bernoulli()
  • "log" for gaussian()
  • "identity" for poisson()?

The hard part is coming up with priors where these don't fail too often.

Move rel() to priors

Implementing #90, I now see that rel() becomes cumbersome to support henceforth (it already was) and that it becomes somewhat ambiguous for categorical predictors. Users seem to misunderstand it anyway.

  • Remove all code for rel()
  • Add support for prior = list(groupX_2 = "groupX_1+ DEFAULT", catY_3 = "catY_1 + dnorm(0, 1) T(0, )"). The returned parameter should be the distribution part so the JAGS code needs to be like catY_1_return ~ dnorm(0, 1) T(0, ); catY_1 = catY_1 + catY_3
  • Add explicit informative deprecation error if a user uses rel().
  • Update code documentation
  • Update vignettes

An advantage of this approach is that it expands functionality: allows relativity across segments (not just the former), it allows percentage-wise relativity (divide), etc.

stan backend

mcp 2.0 will support stan in addition to JAGS. It is far out in the future but this issue collects working points.

  • Obviously, generate a stan model, pass data, and sample it.
  • Support bridgesampling-based Bayes Factors
  • Can jags-functions and stan-functions be dropped as dependencies, only to be installed upon first use? (call mcp(model, data, backend = "stan")). Otherwise, the dependencies would be quite heavy for non-JAGS and non-stan users.
  • Check if stan samples more effectively using a continuous step function, e.g., as in this post.
  • Option or default to no prior for non-intercept and non-changepoint parameters? Cf. #122.

mcp() doesn't start compile

I tried basic rjags examples and they work fine (model compiled in seconds), to ensure JAGS is running properly.

Then, I experimented with mcp, but running mcp() doesn't seem to kick off JAGS at all - keeps loading, without any message; this was tried on both R 3.5.1 and 3.6.0.

I proceed with killing the run and the following message shows:

> library(mcp)
> 
> my_data <- data.frame(x = 1:50,
+                       y = c(rep(30, 25) * abs(rnorm(25)),
+                             rep(30, 25) * -abs(rnorm(25))))
> 
> # Define segments
> segments = list(
+   y ~ 1 + x,  # Intercept
+   1 ~ 1  # Intercept
+ )
> # Start sampling
> fit = mcp(segments, my_data)

Warning message:
In system(cmd, wait = FALSE, input = "") :
  'CreateProcess' failed to run 'C:\Users\Public\Data\R\R-35~1.1\bin\x64\Rscript.exe --default-packages=datasets,utils,grDevices,graphics,stats,methods -e "parallel:::.slaveRSOCK()" MASTER=localhost PORT=11743 OUT="/dev/null" SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE'
>

Seesion info

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mcp_0.1     rjags_4-9   coda_0.19-2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2.2           rstudioapi_0.10.0-9002 magrittr_1.5           tidyselect_0.2.5      
 [5] munsell_0.5.0          colorspace_1.4-1       lattice_0.20-38        R6_2.4.0              
 [9] rlang_0.4.0            stringr_1.4.0          dplyr_0.8.3            tools_3.5.1           
[13] parallel_3.5.1         grid_3.5.1             packrat_0.5.0          gtable_0.3.0          
[17] loo_2.1.0              ellipsis_0.3.0         matrixStats_0.54.0     lazyeval_0.2.2        
[21] assertthat_0.2.1       lifecycle_0.1.0        tibble_2.1.3           crayon_1.3.4          
[25] tidyr_1.0.0            purrr_0.3.2            ggplot2_3.2.1          vctrs_0.2.0.9001      
[29] zeallot_0.1.0          glue_1.3.1             stringi_1.4.3          compiler_3.5.1        
[33] pillar_1.4.2           backports_1.1.5        scales_1.0.0           pkgconfig_2.0.3 

Support "non-linear" terms

This should work:

segments = list(
    y ~ 1 + x + I(x^2.4),
    1 ~ 0 + log(x)
)

The best solution would be to allow anything that can run in JAGS. See user manual page 42.

Add vignettes

The README is growing quite big. Write more focused vignettes and link to them:

  • Typical workflow (probably in README.md?)
  • Understanding the theory and models by inspecting fit$func_y and fit$jags_code.
  • Binomial
  • Poisson
  • Priors,
  • Varying effects.

Parameterize JAGS dnorm/dcauchy priors with SD instead of precision

JAGS uses precision (1/sigma^2) instead of SD as the spread parameter in dnorm, dcauchy, and dt. This is confusing. Allow users to specify using SD instead:

prior = list(
    cp_1 = "dnorm(10, 5)",
    int_2 = "dcauchy(1, 0.2)"
)

should generate JAGS code

cp_1 ~ dnorm(10, 1/5^2)
int_2 ~ dcauchy(1, 1/0.2^2)

If users want to specify precision, they can do this:

prior = list(
    cp_1 = "dnorm(10, sqrt(1/5))",
    int_2 = "dcauchy(1, sqrt(1/0.2))"
)

Add summary.mcpfit and fitted.mcpfit

Look to brms for inspiration

  • Simple summary(fit)
  • Proper summary(fit) with info about model, convergence, etc.
  • fitted(fit). This is already done in plot,mcpfit. Could also be predict(fit)
  • ranef(fit) would also be nice.

Make a prior predictive plot

This should be fairly simple to do. Just run the JAGS model without data and the predictive formula, and collect samples. Two ideas:

mcp(segments, data = "prior")

Although philosophically sound (the prior acts in the same way as data), it does look kind of weird. Another option is to take a dedicated argument like brms::brm:

mcp(segments, sample_prior = TRUE)

Perhaps these samples should be stored in fit$prior_samples. Down the line, they could be useful for computing Bayes Factors of point-null models.

I discarded a third option immediately, because users who forget to provide data would get very confused that the model "ignores" it. The idea is to take the lack of a data argument to mean that only priors should be sampled (because sample == TRUE by default)

mcp(segments)

Custom families

Family-specific functions are currently scattered over several locations in the code and dealt with in an if-else like fashion. This hard-coding of response families makes it harder to implement new ones and maintain existing ones.

I propose storing all of this in an mcpfamily() object which extends gaussian(), etc.

  • Move priors to mcpfamily()
  • Move the JAGS likelihood code to mcpfamily()
  • Move R random generator code to mcpfamily() (used in fit$simulate()).
  • Assign a list of distributional parameters (dpar) for each family, e.g., c("mu", "sigma") for gaussian() and build regression models for each of these. Use intercept-models for all that are not explicitly included in the formulas.
  • Also somehow code whether it supports ar(), weight(), etc.
  • Document it in a vignette.

This is relevant for #89. Once implemented, support for stan will also be easier (#100).

NB support

Hi there,
Can I use glm.nb for the regression?
Thank you!

Function to return posterior draws/lines displayed by `plot.mcpfit`

Hello Jonas Kristoffer Lindelรธv,
Great work on a package which serves a very good purpose which I am sure will see plenty of uptake. This is a question. I would like to extract the data behind the posterior draws/lines in the plot.mcpfit function. Is there user exposed function which can create these data by passing the mcpfit object and a few arguments? Many thanks,
Stuart.

plot() error given arbitrary y and x names in data

Reprex below.

The tweak for plot() method suggested in #30 is not quite perfect, as it generates the following error:

library(mcp)
  
# Define the segments that are separated by change points
segments = list(
  score ~ 1 + year,  # intercept + slope
  1 ~ 0 + year,  # joined slope
  rel(1) ~ 0,  # joined plateau starting at relative change point
  1 ~ rel(1)  # disjoined plateau with relative intercept parameterization
)

# Get an mcpfit object without samples
empty = mcp(segments, sample=FALSE)

# Now use empty$func_y() to generate data from this model.
# Set some parameter values to your liking:
data = data.frame(
  year = 1:100,  # Evaluate func_y for each of these
  score = empty$func_y(
    year = 1:100,  # x
    sigma = 12,  # standard deviation
    cp_1 = 20, cp_2 = 35, cp_3 = 80,  # change points 
    int_1 = 20, int_4 = 20,  # intercepts
    year_1 = 3, year_2 = -2  # slopes
  )
)

prior = list(
  year_1 = "dnorm(0, 5)",  # Slope of segment 1
  int_1 = "dt(10, 30, 1) T(0, )",  # t-distributed prior. Truncated to be positive.
  cp_2 = "dunif(0, 60)",  # Change point is after the first, but within 60 years.
  year_2 = -2  # Fixed slope of segment 2.
)

fit = mcp(segments, data, prior)

# adapted from plot.mcpfit
my_custom_plot(fit, "overlay")

Error: (list) object cannot be coerced to type 'double'

One more tweak to make it work; please refer to code marked with # <<< below:

[...]
    # First, let's get all the predictors in shape for func_y
    Q = Q %>%
      tidyr::expand_grid(!!x$pars$x := eval_at) %>%  # correct name of x-var
      
      # Add fitted draws (vectorized)
      dplyr::mutate(!!x$pars$y := purrr::invoke(func_y, ., type = "fitted")) %>%
      
      # Add line ID to separate lines. Mark a new line when "eval_at" repeats.
      dplyr::mutate(
        line = !!sym(x$pars$x) == min(eval_at),  # <<<
        line = cumsum(line)
      )
[...]

Support binomial, Bernoulli, etc., and associated default priors

  • family = binomial()
  • show all priors in fit$prior and in jags_code.
  • family = bernoulli()
  • family = poisson()

This should work:

segments = list(
    y | trials(N) ~ 1 + x,
    1 ~ 0)

mcp(data, segments, prior, family=binomial())

This means that all parameters are on rates/probabilities rather than observed values. family = binomial() will need to take an additional column to specify the number of trials. I have a hard-coded model working with binomial so it is very doable.

joined slopes not working

Reprex below.

Should user initialize where the changepoint may be in the data?

library(mcp)

my_data <- data.frame(x = 1:50,
                      y = c(1:25 * 3 + abs(rnorm(25)),
                            1:25 * -3 + abs(rnorm(25) + 75)))

plot(x = my_data$x, y = my_data$y)

segments = list(
  y ~ 1 + x,  # intercept + slope
  1 ~ 0 + x  # joined slope
)

fit = mcp(segments, my_data, cores = 1)
#> Compiling data graph
#>    Resolving undeclared variables
#>    Allocating nodes
#>    Initializing
#>    Reading data back into data table
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Deleting model
#> Error in rjags::jags.model(model, data, n.chains = n.chains, n.adapt = n.adapt, : RUNTIME ERROR:
#> Compilation error on line 39.
#> Unknown variable cp_2
#> Either supply values for this variable with the data
#> or define it  on the left hand side of a relation.

Created on 2019-10-27 by the reprex package (v0.2.1)

Bug when fitting relative slope in a formula form?

I have a question regarding how to configure relative slope in a formula form.

I read the documentation and it seems we can use "rel" to fit the relative change. So I did the same for my time series use case as following.
Screen Shot 2020-06-15 at 12 54 56 PM

But it gives me an upward slope which is supposed to show a downward trend based on our domain knowledge. So I inspect the "jags_code". It seems it's modeled as "(x_i-1 + x_i)" for the relative part. What I would like to fit is to model the relative change of the previous slope as a multiplier.

Thanks to the "custom_jags" argument. I twisted a little bit of the "jags_code" by specifying a multiplier parameter "f_2" that follows a uniform distribution ([0.1 ~1]) and constructing the last slope as "x_2 = f_2*x_1" to replace the previous "(x_2+x_1)" component. That gives me a much better fit in the sense of being consistent with our prior knowledge.
Screen Shot 2020-06-15 at 12 55 17 PM

I'm wondering whether this is a bug of fitting with a relative slope as I'm not sure why the additive formulation(x_i-1 + x_i)) can help extrapolate the relative change. Or do you have any other suggestions to specify in a formula form for my goal? (so I don't need to twist the jags code)

Thanks very much!

Predictions for unobserved change points

As discussed in the last part of this article mcp can be "hacked" to model change points in regions with no data. This is useful when forecasting. A more user-friendly API should be included.

This is a spin-off of #78 where some details are already discussed.

Random effects in RHS

Include optional random effects for all parameters. Here is a great guide for JAGS and stan

Or add a varying slope too:

list(y~ 1 + x, 
      (1|id) ~ 0 + (1|x))

Keep it intercept-only at first with uncorrelated random effects. Think about whether to allow for fixed and random of the same parameter (mean-centering random):

list(y~ 1 + x, 
      1 + (1|id) ~ 0 + x+ (1|x))

Rename `ranef` to `ranef.mcpfit`

Hi!
I was wondering if you could help me with this. When I run this code:

model = list(
acc | trials(N) ~ 1,
1 + (1|id) ~ 0 + numerosity)

fit<- mcp(model, data = df, family = binomial())
summary(fit)
ranef(ift)

I get the following error:

Error in UseMethod("ranef") :
no applicable method for 'ranef' applied to an object of class "mcpfit"

Before that I tried using the following code with a slightly different dataset so I could use the bernoulli family:

model = list(
acc ~ 1,
1 + (1|id) ~ 0 + numerosity
)

fit<- mcp(model, data = df, family = bernoulli())
summary(fit)
ranef(ift)

I got the same error.

Thanks in advance!!

tidybayes integration

Saw the announcement of the new release on Twitter, looks awesome!

I had two questions about tidybayes integration:

  1. I see you're using tidy_samples instead of implementing a generic for tidybayes::tidy_draws... Are there limitations of the tidy_draws interface that motivate this choice or is there a difference in semantics? If there are changes in tidybayes that would make it possible to make the interfaces consistent I'd be open to it.

  2. Any interest in implementing add_fitted_draws/add_predicted_draws/etc? I haven't come up with a good way to make that easy on package developers yet but if there was interest on your part this might be the opportunity to think through how to make that happen. There is also a feature discussion for {posterior} related to generics for these kinds of functions (stan-dev/posterior#39) that I will probably be basing future implementations of add_fitted_draws/add_predicted_draws/etc on, having your input there might be helpful too.

Update plot.mcpfit and add fit$func

It currently draws using the fitted values y_[i]. This bloats the fit$samples and is very slow in every way.

Instead, we need to make a function which respects fit$segments like

fit$func = function(x, cp_1, int_1. x_1, x_2) {
    ...
}

... and use that for plotting and possibly prediction by running it for some/all MCMC draws. The fit$func may be useful for other purposes as well.

Compute loglik post-hoc

log-likelihood is currently computed during sampling. This is >90% of the size of the fit object and may slow sampling too. It's only used for loo and waic so WWBD (What Would brms Do)?:

  • Remove computation of loglik during sampling.
  • Add method fit = mcp::add_loglik(fit) which fills fit$loglik with a (Nchains*Nsample) x Ndata matrix
  • Call add_loglik in loo and waic if is.null(fit$loglik)
  • Update all documentation

Ideas for the far future

Here are some ideas which could use some discussion and careful consideration. It extends the current model specification: https://lindeloev.github.io/mcp/articles/formulas.html

In the order from "soonish" (top) to "in your dreams":

Survival models

Survival models are relatively simple and we should support them, including censoring too. The API for the model itself would be something like brms:

model = list(
    eventtime | cens(status) ~ 0,
    ~ 0 + x
)

It should support both exponential decay and Cox proportional hazards. This would probably be specified via the mcp(..., family = ) argument, but I'm unsure what would be the best.

Slope on change points

If there are multiple (piecewise) lines over a single change point, and each line is associated with a different parameter x, we can use that to predict the change point. For example, we could assess subitizing in participants of varying age, and it would be reasonable to expect the subitizing range (location of first change point) to increase with age in childhood and decrease with age in adulthood.

How to implement this in a formula, I'm unsure. Maybe it has to be in the random effect: (1 + x | age) since that specifies the grouping of multiple lines. This would also ensure that the parameter names stay intact. cp_i, cp_i_age, and then probably cp_i_x.

Multivariate regression

Multiple response variables predicted from single change points. Something like

model = list(
    c(y1, y2) ~ 1 + x,  # segment 1
    ~ 0  # seg2: joined intercept
)

This could be merged with "Variance change (Specify y)" to specify a change in one response but not the other:

model = list(
    c(y1, y2) ~ 1 + x,  # segment 1
    ~ 0  # seg2: joined intercept on y1 and y2, segment 2
    y1 ~ 1 ~ x  # seg3: slope change in y1 but not y2
)

ARIMA

mcp currently supports AR(N) models. It should go more general. Take a look at:

Observation weights

I have got a list of observations along a spatial axis, that I would like to fit a simple intercept model to (as described here).

However, I am not equally confident in all of my observations, so I would like to add weights to account for uncertainty. I could do this for instance with ecp::e.divisive.

Is this something you are considering for mcp as well?

Otherwise, there is of course the possibility to explicitly model the generative process in JAGS, but that would remove the convenience of mcp.

Update package overview

The overview should include the yet-to-be-reviewed/tested packages mentioned at the top, and perhaps an additional example: https://lindeloev.github.io/mcp/articles/packages.html.

Also update:

  • strucchange::breakpoints does take time series and returns confidence intervals on the break points.
  • changepoint.geo. It reduces multivariate problems to angle+length in vector space, and calls cpt.mean and cpt.var on this using PELT. So it's changepoint but with a tweak to allow for high-dimensional data.
  • Searching for "change detection" yields a set of new packages, e.g., changedetection.

Turn off autocorrelation

Turning off would just be changing to 0th order:

segments = list(
    y ~ 1 + x,
    ~ 0 + ar(1),  # start AR(1)
    ~ 0 + ar(0)  # Stop all AR
)

Implementation-wise, the "0" would just set zeros for the involved parameters, e.g., ar_[i_] = 0, i.e., leaving it purely to sigma to model the residuals.

Coloring the posterior estimates by CP would make interpretation of plots easier

First, fantastic package! Thank you.

When plotting the results of the model fit it can often be challenging to determine which posterior cp estimates (blue lines at bottom) correspond to the visual change points shown in the upper model curves. This is particularly difficult when one or more of the posterior estimates are bimodal, and worse still if overlapping :-(

8N6oKh8ewjEwHq8

If the individual CP posterior estimates were color coded things would be easier. You will note in my graph an attempt to identify the CP range using shaded regions and lines for the mean value.

Thanks,
Jim

Define the model with 3 CP

model = list(
y ~ 1, # plateau (int_1)
~ 0 + x, # joined slope (time_2) at cp_1
~ 1 + x, # disjoined slope (int_3, time_3) at cp_2
~ 1 + x # disjoined slope (int_4, time_4) at cp_3
)

model.string <- paste(sapply(model, function(x) Reduce(paste, deparse(x))), collapse = ", ")

Add prior knowledge to improve model

prior = list(
int_1 = 0, # Constant, not estimated
cp_1 = "dunif( 0, 200)", # has to occur in this interval
cp_3 = "dunif(300, 400)" # has to occur in this interval
)

EnvCpt clarification

I was happy to see that EnvCpt is included in your comparison documentation, thanks for including it.

In a couple of places you write "I suspect EnvCpt uses cpt.np() in the background" for the change in mean parts. I wanted to clarify that it is using cpt.meanvar() from the changepoint package. You will see that the changepoint.np package is not a dependency and therefore it could not use this in the background.

This explains the differences you see between EnvCpt and the cpt.mean() function.

Multiple predictors

Each segment should take an arbitrary number of linear predictors. As with the segmented package, the only requirement is that one continuous predictor (say, x) is the dimension of the change point. The change point is simply the value on x where the predictions of y changes to a different regression model (parameter structure and/or values).

So this API should work. It has the following features:

  • Support categorical predictors via dummy coding.
  • Support interactions
  • Each segment can have different numbers of predictors.
model = list(
    y ~ 1 + x*group + z + sigma(1 + group),  # interactions and main effects and a covariate.
    ~ 0 + x + ar(2, z),  # only one slope
    ~ 1 + group  # a range of x where group is the only predictor
)

JAGS-wise the indicator functions would be the same but now we additionally pass design matrices (X1_, X2_, etc.) and use inprod() per segment. The model above would be something like:

    # Priors for individual parameters
    cp_1 ~ ... T(MINX, MAXX)
    cp_2 ~ ... T(cp_1, MAXX)
    int_1 ~ dnorm(0, 1^-2)
    int_3 ~ ...
    xGroupFemale_1 ~ dnorm(0, 1)  # check R naming convention
    z_1 ~ dunif(0, 100)
    x_2 ~ dnorm(4, 3^-2)
    xGroupFemale_3 ~ ...

    # Model and likelihood
    for(i_ in 1:length(x1_)) {
        y_[i_] = (x_[i_] > cp_0) * (int_1 + inprod(c(xGroupFemale_1, z_1), x1_[i_, ])) + 
                (x_[i_] > cp_1) * (0 + inprod(c(x_2), x2_[i_, ])) +
                (x_[i_] > cp_2) * (int_3 + inprod(c(xGroupFemale_3), x3_[i_, ]))

        response[i_] ~ dnorm(y_[i_], sigma_[i_])
    }

where xi_ is a model matrix that is built R-side and x_ is par_x along which change points are defined. Implementing this adds the following work points:

Data structure

  • Rename intercept: Intercept_i instead of int_i.
  • Require par_x if there is not exactly one continuous predictor.
  • Deicide parameter naming: stay close to lm and brms but add _segmentnumber.
  • Split out RHS parameters from the segment table to a new long-format structure.
  • Set appropriate priors. Distinguish between intercepts, dummies, and slopes?
  • Raise error on rank deficiency using base::qr().

Modeling, sampling, and summaries

  • Update get_formula() to match the new segment table.
  • Update get_jagscode()
  • Update run_jags() and get_jags_data() to work with design matrices. One per (dpar-segment) combo.
  • Make sure that it applies for sigma(1 + x * group) etc. (I think it will out of the box).
  • Verify that it works for plot_pars(), hypothesis(), summary(), fixef(), etc.
  • get_summary(): Translate between code parameter name and user-facing parameter name.

Simulated, fitted, and predicted values

  • Change args to fit$simulate(fit, data, par1, par2, ...), i.e., add fit and replace par_x with data.frame/tibble.
  • Check that the columns in data have the correct format. Set factor levels to match the original data.
  • Make fit$simulate() a wrapper around a lower-level fast function to use internally. Call it fit$.internal$simulate_vec(par_x, cp_1, ..., rhs_par1, rhs_par2, ...). Only the former should do asserts, call add_simulated(), etc.
  • Implement for ar()
  • Implement for varying change points.
  • Call simulate_vectorized() from all internal functions instead of fit$simulate().
  • Ensure that it works for fitted(), predict(), pp_check(), etc.
  • Fix existing mcp_examples
  • Add new mcp_examples?

Plot

  • Allow faceting by any unique combination of (categorical) variables using plot(fit, facet_by = c("my_rhs", "my_varying_cp")). Still default to no facets.
  • Control spaghetti groupings and colors using color_by = c("my_categorical1", "my_categorical2). It defaults to color_by = "all_categorical", i.e. all unique combinations of categorical levels on RHS. This will also set the grouping for spaghettis. I think that color_by should pertain solely to the RHS which share change points. Varying change points will not be accepted.
  • Add a way to include a subset of the predictors: plot(fit, effects = "my_categorical1"). It's like brms::marginal_effects(). This should probably be implemented in tidy_samples().
  • Display only a subset of the levels using plot(fit, filter = data.frame(my_categorical1 = c("levelA", "levelB"), my_categorical2 = "level1"). This is like brms::marginal_effects(), only filter using a data.frame replaces int_conditions which is a named list. For variables in effects that are not in filter, all levels will be included. This should probably be implemented in tidy_samples().
  • (Add option to facet_by group levels in pp_check?)
  • Add pages to plot_pars()

Tests

  • Pass existing test suite.
  • Write tests for combinations of main effects, factor-factor interactions, and factor-continuous interactions on ct, sigma, and ar. Also test the number of expected model parameters with and without intercepts.
  • Expect errors on multiple terms inside base functions, e.g., ~exp(1 + x ).
  • Test the new plot functions and tidy_draws()

More articles

  • Model comparison (once bridgesampling is implemented)
  • Bernoulli (merge with binomial?)
  • Model diagnostics.
  • Comparison to other packages: changepoint, ecp, bcp, segmented, strucchange, tsmcp, robts:changerob.
    Some of these are superior when it comes to speed (at the cost of accurate posteriors) and unsupervised change point detection.

Missing output columns in fit()

Reprex below.

Using ex_demo, there are match and sim columns in mcpfit object, but not when using my_demo.

Is this driven by Rhat or n.eff values?

library(mcp)
#> Warning: package 'mcp' was built under R version 3.5.3

# Define the model
model <- list(
  response ~ 1, # plateau (int_1)
  ~ 0 + time, # joined slope (time_2) at cp_1
  ~ 1 + time # disjoined slope (int_3, time_3) at cp_2
)

# Fit it. The `ex_demo` dataset is included in mcp
fit <- mcp(model, data = ex_demo)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 100
#>    Unobserved stochastic nodes: 7
#>    Total graph size: 1731
#> 
#> Initializing model
#> Finished sampling in 7.5 seconds
fit
#> Family: gaussian(link = 'identity')
#> Iterations: 9000 from 3 chains.
#> Segments:
#>   1: response ~ 1
#>   2: response ~ 1 ~ 0 + time
#>   3: response ~ 1 ~ 1 + time
#> 
#> Population-level parameters:
#>     name match  sim  mean lower upper Rhat n.eff
#>     cp_1    OK 30.0 30.57 22.57 37.96    1   355
#>     cp_2    OK 70.0 69.78 69.28 70.24    1  5621
#>    int_1    OK 10.0 10.27  8.77 11.59    1  1152
#>    int_3    OK  0.0  0.46 -2.51  3.41    1   776
#>  sigma_1    OK  4.0  4.01  3.46  4.62    1  4109
#>   time_2    OK  0.5  0.54  0.41  0.67    1   379
#>   time_3    OK -0.2 -0.22 -0.38 -0.04    1   741

my_demo <- structure(list(
  response = c(
    138.989, 97.232, 45.717, 25.919,
    12.67, 39.103, 57.598, 39.518, 43.226, 2.374, 7.972, 6.779
  ),
  time = 1:12
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(
  NA,
  -12L
))

fit <- mcp(model, data = my_demo)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 12
#>    Unobserved stochastic nodes: 7
#>    Total graph size: 235
#> 
#> Initializing model
#> Finished sampling in 0.9 seconds
fit
#> Family: gaussian(link = 'identity')
#> Iterations: 9000 from 3 chains.
#> Segments:
#>   1: response ~ 1
#>   2: response ~ 1 ~ 0 + time
#>   3: response ~ 1 ~ 1 + time
#> 
#> Population-level parameters:
#>     name  mean lower upper Rhat n.eff
#>     cp_1   2.2   1.0   5.1  1.3   110
#>     cp_2   3.3   1.2   9.1  1.3    67
#>    int_1 108.9  46.2 151.7  1.1   212
#>    int_3  36.4  -1.4  69.0  1.1   306
#>  sigma_1  24.7  12.7  41.8  1.1   236
#>   time_2  -1.1 -12.9  11.3  1.0  1191
#>   time_3  -2.0  -6.8   3.4  1.0   818

Include demo datasets

Give users something to toy with to get acquainted with mcp.

  • simulated data
  • real-world data, ideally from academic papers on change point estimation or where it could/should be used.

Should it loaded with mcp or should one run data("mcp_sim1")?

Once done, update the README and mcp examples to use these datasets.

Unit test updates

Test that it does not break, and that the data structures are inteact:

  • Test priors, including fixing and truncation.
  • Use set.seed and add an extra test for more exact values, now that randomness from data generation has been avoided.
  • Test allowed data types
  • Test underscores in variable names
  • Test y and varying effects
  • Begin sampling and make 1 draw on a very small dataset - just for speed. This will test the model compilation and initialization.
  • Test loo and waic
  • Plots.
  • Add more testing environments (R < 3.6 and Linux), cf. #27

Is there a "predict" function?

It seems there isn't a "predict" function. Please correct me if I'm wrong?

I think the predicted values would be generated from the model parameters in the last segment, correct?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.