romainkp / stremr Goto Github PK

Streamlined Estimation for Static, Dynamic and Stochastic Treatment Regimes in Longitudinal Data

License: MIT License

R 99.84% TeX 0.16%

propensity-scores censoring-events ipw-msm grid-search tuning-parameters tmle survival machine-learning targeted-learning time-varying-confounding

stremr's Introduction

R/`stremr`: Streamlined Causal Inference for Static, Dynamic and Stochastic Regimes in Longitudinal Data

Analysis of longitudinal data, with continuous or time-to-event (binary) outcome and time-varying confounding. Allows adjustment for all measured time-varying confounding and informative right-censoring. Estimate the expected counterfactual outcome under static, dynamic or stochastic interventions. Includes doubly robust and semi-parametrically efficient Targeted Minimum Loss-Based Estimator (TMLE), along with several other estimators. Perform data-adaptive estimation of the outcome and treatment models with Super Learner sl3.

Authors: Oleg Sofrygin, Mark van der Laan, Romain Neugebauer

Available estimators

Currently available estimators can be roughly categorized into 4 groups:

Propensity-score / Inverse Probability Weighted (IPW):
- direct (bounded) IPW (directIPW)
- IPW-adjusted Kaplan-Meier (survNPMSM)
- MSM-IPW for the survival hazard (survMSM)
Outcome regression:
- longitudinal G-formula (GCOMP) (Bang and Robins, 2005)
Doubly-robust (DR) approaches:
- TMLE for longitudinal data (TMLE) (van der Laan and Gruber, 2012)
- iterative TMLE (iterTMLE)
- cross-validated TMLE (CVTMLE)
Sequentially doubly-robust (SDR) approaches:
- infinite-dimensional TMLE (iTMLE) (Luedtke, Sofrygin, van der Laan, and Carone, 2017)
- doubly robust unbiased transformations (DR_transform) (Rubin and van der Laan, 2006, Luedtke, Sofrygin, van der Laan, et al. (2017)`)

Input data format

The exposure, monitoring and censoring variables can be coded as either binary, categorical or continuous. Each can be multivariate (e.g., can use more than one column of dummy indicators for different censoring events). The input data needs to be in long format.

Possibly right-censored data has to be in long format.
Each row must contain a subject identifier (ID) and the integer indicator of the current time (t), e.g., day, week, month, year.
The package assumes that the temporal ordering of covariates in each row is fixed according to (ID, t, L,C,A,N,Y), where
- L -- Time-varying and baseline covariates.
- C -- Indicators of right censoring events at time t; this can be either a single categorical or several binary columns.
- A -- Exposure (treatment) at time t; this can be multivariate (more than one column) and each column can be binary, categorical or continuous.
- N -- Indicator of being monitored at time point t+1 (binary).
- Y -- Outcome (binary 0/1 or continuous between 0 and 1).
Categorical censoring can be useful for representing all of the censoring events with a single column (variable).

Estimation of the outcome and treatment models

Separate models are fit for the observed censoring, exposure and monitoring mechanisms. Jointly, these make up what is known as the propensity score.
Separate outcome regression models can be specified for each time-point.
Each propensity score model can be stratified (separate model is fit) by time or any other user-specified stratification criteria. Each strata is defined with by a single logical expression that selects specific observations/rows in the observed data (strata).
By default, all models are fit with the logistic regression.
Alternatively, model fitting can be performed via any machine learning (ML) algorithm available in sl3 and gridisl R packages. See xgboost and h2o for a sample description of possible ML algorithms.
One can select the best model from an ensemble of many learners via model stacking or Super Learning (Breiman, 1996; van der Laan, Polley, and Hubbard, 2007), which finds the optimal convex combination of all models in the ensemble via cross-validation.

Brief overview of `stremr`

Installing stremr and Documentation
Issues
Documentation
Example with Simulated Data
Sequential G-Computation (GCOMP) and Targeted Maximum Likelihood Estimation (TMLE) for longitudinal survival data
Machine Learning
Ensemble Learning with Discrete SuperLearner (based on gridisl R package)
Details on some implemented estimators

Installation

To install the development version (requires the devtools package):

devtools::install_github('osofr/stremr')

For ensemble-learning with Super Learner algorithm we recommend installing the latest development version of the sl3 R package. It can be installed as follows:

devtools::install_github('jeremyrcoyle/sl3')

For optimal performance, we also recommend installing the latest version of data.table package:

remove.packages("data.table")                         # First remove the current version
install.packages("data.table", type = "source",
    repos = "http://Rdatatable.github.io/data.table") # Then install devel version

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.

Documentation

To obtain documentation for specific relevant functions in stremr package:

?stremr
?importData
?fitPropensity
?getIPWeights
?directIPW
?survNPMSM
?survMSM
?fit_GCOMP
?fit_iTMLE

Simulated data example

Load the data:

require("magrittr")
#> Loading required package: magrittr
require("data.table")
#> Loading required package: data.table
require("stremr")
#> Loading required package: stremr

data(OdataNoCENS)
datDT <- as.data.table(OdataNoCENS, key=c("ID", "t"))

Define some summaries (lags):

datDT[, ("N.tminus1") := shift(get("N"), n = 1L, type = "lag", fill = 1L), by = ID]
datDT[, ("TI.tminus1") := shift(get("TI"), n = 1L, type = "lag", fill = 1L), by = ID]

Define counterfactual exposures. In this example we define one intervention as always treated and another as never treated. Such intervention can be defined conditionally on other variables (dynamic intervention). Similarly, one can define the intervention as a probability that the counterfactual exposure is 1 at each time-point t (for stochastic interventions).

datDT[, ("TI.set1") := 1L]
datDT[, ("TI.set0") := 0L]

Import input data into stremr object DataStorageClass and define relevant covariates:

OData <- importData(datDT, ID = "ID", t = "t", covars = c("highA1c", "lastNat1", "N.tminus1"), CENS = "C", TRT = "TI", OUTCOME = "Y.tplus1")

Once the data has been imported, it is still possible to inspect it and modify it, as shown in this example:

get_data(OData)[, ("TI.set0") := 1L]
get_data(OData)[, ("TI.set0") := 0L]

Regressions for modeling the propensity scores for censoring (CENS) and exposure (TRT). By default, each of these propensity scores is fit with a common model that pools across all available time points (smoothing over all time-points).

gform_CENS <- "C ~ highA1c + lastNat1"
gform_TRT <- "TI ~ CVD + highA1c + N.tminus1"

Stratification, that is, fitting separate models for different time-points, is enabled with logical expressions in arguments stratify_... (see ?fitPropensity). For example, the logical expression below states that we want to fit the censoring mechanism with a separate model for time point 16, while pooling with a common model fit over time-points 0 to 15. Any logical expression can be used to define such stratified modeling. This can be similarly applied to modeling the exposure mechanism (stratify_TRT) and the monitoring mechanism (stratify_MONITOR).

stratify_CENS <- list(C=c("t < 16", "t == 16"))

Fit the propensity scores for censoring, exposure and monitoring:

OData <- fitPropensity(OData,
                       gform_CENS = gform_CENS,
                       gform_TRT = gform_TRT,
                       stratify_CENS = stratify_CENS)

Estimate survival based on non-parametric/saturated IPW-MSM (IPTW-ADJUSTED KM):

AKME.St.1 <- getIPWeights(OData, intervened_TRT = "TI.set1") %>%
             survNPMSM(OData) %$%
             estimates

The result is a data.table that contains the estimates of the counterfactual survival for each time-point, for the treatment regimen TI.set1. In this particular case, the column St.NPMSM contains the survival estimates for IPW-NPMSM and the first row represents the estimated proportion alive at the end of the first cycle / time-point. Note that the column St.KM contains the unadjusted/crude estimates of survival (should be equivalent to standard Kaplan-Meier estimates for most cases).

head(AKME.St.1[],2)
#>    est_name time sum_Y_IPW sum_all_IPAW   ht.NPMSM  St.NPMSM      ht.KM
#> 1:    NPMSM    0 1.6610718     38.13840 0.04355379 0.9564462 0.04733728
#> 2:    NPMSM    1 0.8070748     48.10323 0.01677797 0.9403990 0.01863354
#>        St.KM rule.name
#> 1: 0.9526627   TI.set1
#> 2: 0.9349112   TI.set1

Estimate survival with bounded IPW:

IPW.St.1 <- getIPWeights(OData, intervened_TRT = "TI.set1") %>%
            directIPW(OData) %$%
            estimates

As before, the result is a data.table with estimates of the counterfactual survival for each time-point, for the treatment regimen TI.set1, located in column St.directIPW.

head(IPW.St.1[],2)
#>     est_name time sum_Y_IPW  sum_IPW St.directIPW rule.name
#> 1: directIPW    0  9.828827 225.6710    0.9564462   TI.set1
#> 2: directIPW    1 14.841714 308.6067    0.9519073   TI.set1

Estimate hazard with IPW-MSM then map into survival estimate. Using two regimens and smoothing over two intervals of time-points:

wts.DT.1 <- getIPWeights(OData = OData, intervened_TRT = "TI.set1", rule_name = "TI1")
wts.DT.0 <- getIPWeights(OData = OData, intervened_TRT = "TI.set0", rule_name = "TI0")
survMSM_res <- survMSM(list(wts.DT.1, wts.DT.0), OData, tbreaks = c(1:8,12,16)-1,)

In this particular case the output is a little different, with separate survival tables for each regimen. The output of survMSM is hence a list, with one item for each counterfactual treatment regimen considered during the estimation. The actual estimates of survival are located in the column(s) St.MSM. Note that survMSM output also contains the standard error estimates of survival at each time-point in column(s) SE.MSM. Finally, the output table also contains the subject-specific estimates of the influence-curve (influence-function) in column(s) IC.St. These influence function estimates can be used for constructing the confidence intervals of the counterfactual risk-differences for two contrasting treatments (see help for get_RDs function for more information).

head(survMSM_res[["TI0"]][["estimates"]],2)
#>    est_name time      ht.MSM    St.MSM      SE.MSM rule.name
#> 1:      MSM    0 0.004214338 0.9957857 0.002105970       TI0
#> 2:      MSM    1 0.013068730 0.9827720 0.004100295       TI0
#>                                                                       IC.St
#> 1: 0.004543242,0.004543242,0.004543242,0.004543242,0.004543242,0.004543242,
#> 2: 0.004483868,0.016683119,0.016797415,0.017900770,0.017900770,0.017900770,
head(survMSM_res[["TI1"]][["estimates"]],2)
#>    est_name time     ht.MSM    St.MSM     SE.MSM rule.name        IC.St
#> 1:      MSM    0 0.04355379 0.9564462 0.01521910       TI1 0,0,0,0,0,0,
#> 2:      MSM    1 0.01677797 0.9403990 0.01786105       TI1 0,0,0,0,0,0,

Longitudinal GCOMP (G-formula) and TMLE

Define time-points of interest, regression formulas and software to be used for fitting the sequential outcome models:

tvals <- c(0:10)
Qforms <- rep.int("Qkplus1 ~ CVD + highA1c + N + lastNat1 + TI + TI.tminus1", (max(tvals)+1))

To run iterative means substitution estimator (G-Computation), where all at risk observations are pooled for fitting each outcome regression (Q-regression):

gcomp_est <- fit_GCOMP(OData, tvals = tvals, intervened_TRT = "TI.set1", Qforms = Qforms)

The output table of fit_GCOMP contains the following information, with the column St.GCOMP containing the survival estimates for each time period:

head(gcomp_est$estimates[],2)
#>    est_name time  St.GCOMP St.TMLE   type    cum.inc              IC.St
#> 1:    GCOMP    0 0.9837583      NA pooled 0.01624168 NA,NA,NA,NA,NA,NA,
#> 2:    GCOMP    1 0.9699022      NA pooled 0.03009778 NA,NA,NA,NA,NA,NA,
#>    fW_fit rule.name
#> 1:   NULL   TI.set1
#> 2:   NULL   TI.set1

To run the longitudinal long format Targeted Minimum-Loss Estimation (TMLE), stratified by rule-followers for fitting each outcome regression (Q-regression):

tmle_est <- fit_TMLE(OData, tvals = tvals, intervened_TRT = "TI.set1", Qforms = Qforms)
#> GLM TMLE update cannot be performed since the outcomes (Y) are either all 0 or all 1, setting epsilon to 0
#> GLM TMLE update cannot be performed since the outcomes (Y) are either all 0 or all 1, setting epsilon to 0

The output table of fit_TMLE contains the following information, with the column St.TMLE containing the survival estimates for each time period. In addition, the column SE.TMLE contains the standard error estimates and the column and the column IC.St contains the subject-specific estimates of the efficient influence curve. The letter estimates are useful for constructing the confidence intervals of risk differences for two contrasting treatments (see help for get_RDs function for more information).

head(tmle_est$estimates[],2)
#>    est_name time St.GCOMP   St.TMLE   type    cum.inc     SE.TMLE
#> 1:     TMLE    0       NA 0.9839271 pooled 0.01607286 0.003449949
#> 2:     TMLE    1       NA 0.9707676 pooled 0.02923243 0.004492235
#>                                                                             IC.St
#> 1: -0.007292922,-0.007292922,-0.010190141,-0.007292922,-0.007292922,-0.007292922,
#> 2: -0.009469707,-0.009469707,-0.010503891,-0.009469707,-0.009469707,-0.009469707,
#>    fW_fit rule.name
#> 1:   NULL   TI.set1
#> 2:   NULL   TI.set1

To parallelize estimation over several time-points (tvals) for either GCOMP or TMLE use argument parallel = TRUE:

require("doParallel")
registerDoParallel(cores = parallel::detectCores())
tmle_est <- fit_TMLE(OData, tvals = tvals, intervened_TRT = "TI.set1", Qforms = Qforms, parallel = TRUE)

Data-adaptive estimation, cross-validation and Super Learning

Nuisance parameters can be modeled with any machine learning algorithm supported by sl3 R package. For example, for GLMs use learner Lrnr_glm_fast, for xgboost use learner Lrnr_xgboost, for h2o GLM learner use Lrnr_h2o_glm, for any other ML algorithm implemented in h2o use Lrnr_h2o_grid$new(algorithm = "algo_name"), for glmnet use learner Lrnr_glmnet. All together, these learners provide access to a wide variety of ML algorithms. To name a few: GLM, Regularized GLM, Distributed Random Forest (RF), Extreme Gradient Boosting (GBM) and Deep Neural Nets.

Model selection can be performed via V-fold cross-validation or random validation splits and model stacking and Super Learner combination can be accomplished by using the learner Lrnr_sl and specifying the meta-learner (e.g., Lrnr_solnp). In the example below we define a Super Learner ensemble consisting of several learning algorithms.

First, we define sl3 learners for for xgboost, two types of GLMs and glmnet. Then we will stack these learners into a single learner called Stack:

library("sl3")
lrn_xgb <- Lrnr_xgboost$new(nrounds = 5)
lrn_glm <- Lrnr_glm_fast$new()
lrn_glm2 <- Lrnr_glm_fast$new(covariates = c("CVD"))
lrn_glmnet <- Lrnr_glmnet$new(nlambda = 5, family = "binomial")
## Stack the above candidates:
lrn_stack <- Stack$new(lrn_xgb, lrn_glm, lrn_glm2, lrn_glmnet)

Next, we will define a Super Learner on the above defined stack, by feeding the stack into the Lrnr_sl object and then specifying the meta-learner that will find the optimal convex combination of the learners in a stack (Lrnr_solnp):

lrn_sl <- Lrnr_sl$new(learners = lrn_stack, metalearner = Lrnr_solnp$new())

We will now use stremr to estimate the exposure / treatment propensity model with the above defined Super Learner (lrn_sl):

OData <- fitPropensity(OData,
                       gform_CENS = gform_CENS,
                       gform_TRT = gform_TRT,
                       models_TRT = lrn_sl,
                       stratify_CENS = stratify_CENS)

Details on some implemented estimators

Currently implemented estimators include:

Kaplan-Meier Estimator. No adjustment for time-varying confounding or informative right-censoring.
Inverse Probability Weighted (IPW) Kaplan-Meier (survNPMSM). Also known as the Adjusted Kaplan Meier (AKME). Also known as the saturated (non-parametric) IPW-MSM estimator of the survival hazard. This estimator inverse weights each observation based on the exposure/censoring model fits (propensity scores).
Bounded Inverse Probability Weighted (B-IPW) Estimator of Survival('directIPW'). Estimates the survival directly (without hazard), also based on the exposure/censoring model fit (propensity scores).
Inverse Probability Weighted Marginal Structural Model (survMSM) for the hazard function, mapped into survival. Currently only logistic regression is allowed where covariates are time-points and regime/rule indicators. This estimator is also based on the exposure/censoring model fit (propensity scores), but allows additional smoothing over multiple time-points and includes optional weight stabilization.
Longitudinal G-formula (GCOMP). Also known as the iterative G-Computation formula or Q-learning. Directly estimates the outcome model while adjusting for time-varying confounding. Estimation can be stratified by rule/regime followed or pooled across all rules/regimes.
Longitudinal Targeted Minimum-Loss-based Estimator (TMLE). Also known as L-TMLE. Doubly robust and semi-parametrically efficient estimator that de-biases each outcome regression fit with a targeting step, using IPW.
Iterative TMLE (iterTMLE) for longitudinal data. Fits sequential G-Computation and then iteratively performs targeting for all pooled Q's until convergence.
Infinite-dimensional TMLE (iTMLE) for longitudinal data. Fits sequential G-Computation and performs additional infinite-dimensional targeting to achieve sequential double robustness.

Citation

To cite stremr in publications, please use:

Sofrygin O, van der Laan MJ, Neugebauer R (2017). stremr: Streamlined Estimation for Static, Dynamic and Stochastic Treatment Regimes in Longitudinal Data. R package version x.x.xx

Funding

This work was partially supported through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1403-12506). All statements in this work, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee. This work was also supported through an NIH grant (R01 AI074345-07).

Copyright

The contents of this repository are distributed under the MIT license.

The MIT License (MIT)

Copyright (c) 2015-2017 Oleg Sofrygin 

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

References

[1] L. Breiman. "Stacked regressions". In: Machine learning 24.1 (1996), pp. 49-64.

[2] H. Bang and J. Robins. "Doubly robust estimation in missing data and causal inference models". In: Biometrics 61 (2005), pp. 962-972. DOI: 10.1111/j.1541-0420.2005.00377.x.

[3] M. J. van der Laan, E. C. Polley and A. E. Hubbard. "Super learner". In: Statistical applications in genetics and molecular biology 6.1 (2007).

[4] M. van der Laan and S. Gruber. "Targeted minimum loss based estimation of causal effects of multiple time point interventions". In: The International Journal of Biostatistics 8 (2012). DOI: 10.1515/1557-4679.1370.

[5] A. R. Luedtke, O. Sofrygin, M. J. van der Laan, et al. "Sequential Double Robustness in Right-Censored Longitudinal Models". In: arXiv preprint arXiv:1705.02459 (2017). arXiv: 1705.02459.

stremr's People

Contributors

Stargazers

Watchers

Forkers

migariane jlstiles rfherrerac ehsanx rfherrera belalanik yangzhao98 jeremyrcoyle

stremr's Issues

Add truncation level and % truncated in report

older version of stremr and new version of sl3 are not compatible

:when calling fitPropensity:

Version 0.8.99 of stremr when the treatment is continuous give this error:
Error: 'Lrnr_condensier' is not an exported object from 'namespace:sl3'

When I install version 1 or 2, then even the bionary treatment does not work. All of the cases give me this error:
Error: outvar is not a character vector

my project is on hold until I can solve this problem. Your help is appreciated. Best Regards

Add option for observation weights

Hello,

I mentioned this in person but thought I'd start a quick feature request issue to support observation weights. I believe Mark is on board with simply passing the weights through to the Q & g estimation as well as the fluctuation. For the variance of the IC he suggested that the weights be normalized to sum to 1 (possibly also applied to the Q & g estimation as well, not sure of the specifics).

Personally I could use this for a study where we have 30 million observations but only a few categorical covariates, so we can aggregate the replicated observations and incorporate observation weights to drastically speed up the superlearning & reduce memory usage.

Susan's tmle package doesn't support observation weights but ltmle does, so I'm using that right now. Mark suggested that observation weights would generally be important to support because they can be used to solve many different problems.

Thanks,
Chris

package dependencies

Is there documentation for all of the package dependencies of stremr?

better docs for the output

Describe the output results table(s) for each function
Define each column and describe what it contains)

For example:

     est_name time   sum_Y_IPW    sum_IPW St.directIPW rule.name
 1: directIPW    0    9.828827   225.6710    0.9564462   TI.set1
 2: directIPW    1   14.841714   308.6067    0.9519073   TI.set1
 3: directIPW    2   21.627479   430.0012    0.9497037   TI.set1
 4: directIPW    3   21.627479   606.0012    0.9643112   TI.set1
 5: directIPW    4   43.571094   868.0025    0.9498030   TI.set1
 6: directIPW    5   61.760385  1241.4314    0.9502507   TI.set1
 7: directIPW    6   66.382274  1802.6454    0.9631751   TI.set1
 8: directIPW    7  116.322109  2638.1985    0.9559085   TI.set1
 9: directIPW    8  198.486284  3871.7562    0.9487348   TI.set1
10: directIPW    9  254.941362  5714.0214    0.9553832   TI.set1
11: directIPW   10  254.941362  8456.0006    0.9698508   TI.set1
12: directIPW   11  397.563386 12563.9928    0.9683569   TI.set1
13: directIPW   12 1225.200094 18784.3937    0.9347756   TI.set1
14: directIPW   13 1816.300699 27581.8941    0.9341488   TI.set1
15: directIPW   14 1816.300699 40628.9102    0.9552954   TI.set1
16: directIPW   15 4927.103677 60323.6409    0.9183222   TI.set1
17: directIPW   16 4927.103677 88105.7038    0.9440774   TI.set1

get_MSMRDs with a single t0

RDtables.t0.1.2year <- get_MSM_RDs(MSM.IPAW, c(3,7), getSEs = TRUE) ## BUG
RDtables.t0.1.year <- get_MSM_RDs(MSM.IPAW, c(3), getSEs = TRUE) ## BUG

Error in se.RDscale.Sdt.K[d1.idx, d2.idx] <- getSE.RD.d1.minus.d2(nID = nID, :
replacement has length zero

Note that when there is only a single t0 specified in the argument (value 3), there is an error while the routine runs without a problem with 2+ t0's.

Fit treatment propensity based only on initial time point

Hello,

I'm trying to conduct an intent-to-treat analysis and am having difficulty fitting the treatment propensity score using only t == 0 (baseline) and then using that estimated baseline treatment propensity across all time points with time-varying censoring.

It looks like when I stratify and only calculate it on t == 0, the treatment propensity score for all remaining time points (1 to 170) may be set to 1 by default? Any recs or hints on how to apply the t = 0 propensity score to all future time points? Censoring on the other hand I do want to be estimated at each time point (or in a pooled manner). Also just fyi at t == 0, 8.7% of the sample is treated.

This is using scrambled data from Romain and attempting to replicate Marcus, J. L., Neugebauer, R. S., Leyden, W. A., Chao, C. R., Xu, L., Quesenberry Jr, C. P., ... & Silverberg, M. J. (2016). Use of abacavir and risk of cardiovascular disease among HIV-infected individuals. JAIDS Journal of Acquired Immune Deficiency Syndromes, 71(4), 413-419.

Here is my rough code:

vars = list(
  id = "ID",
  time = "intnum",
  treatment = "exposure",
  censoring = "censor",
  outcome = "outcome",
  covars_time_varying =
    c("cd4", "I.cd4", "vl", "I.vl", "egfr.low", "I.egfr.low", "diabetes", "htn.meds",
      "llt.meds")
  # covars_baseline will be defined after factors_to_indicators conversion.
)

odata =
  importData(dt,
             ID = vars$id,
             t = vars$time,
             covars = c(vars$covars_baseline, vars$covars_time_varying),
             CENS = vars$censoring,
             TRT = vars$treatment,
             OUTCOME = vars$outcome)

# Try simpler propensity score specification to get a better distribution.
(gform_TRT = paste(vars$treatment, "~", paste(c("eversmoke", "everdrug", "sex_m", "art.naive"), collapse = " + ")))

# For ITT parameter only estimate treatment propensity at baseline.
# (uses rlang::list2 and !! so that we can substitute the treatment variable into the list name)
(stratify_TRT = list2(!!vars$treatment := paste0(vars$time, " == 0L")))

(gform_CENS = paste(vars$censoring, "~", paste(c(vars$covariates), collapse = " + ")))

odata = fitPropensity(odata,
                      gform_TRT = gform_TRT,
                      gform_CENS = gform_CENS,
                      stratify_TRT = stratify_TRT)

# Excessive right skewing: 1st quartile to max are all 1.0
# Min is 6.7%.
# Perhaps because we need to examine only t = 0?
summary(odata$g_preds$g0.A)
qplot(odata$g_preds$g0.A)

Thank you and happy to provide any more context if it would be helpful. Sorry if this is an easy fix.

Add p value for the test of a difference between areas under any two curves

contrast of joint (dx,bar{n}) interventions defined by same dx but different bar{n}

Example: UACR/TI 90days.

dx with x=7.5
bar{n}* = continous 1, 101010..., 1001001...

In this example, we are interested in an MSM for 3 rules.

survMSM however ignores the fact that each intervention is indexed by a particular monitoring intervention and defines model terms solely based on the various levels of treatment interventions. In this case, there is only one dx so survMSM gives:

MSM.IPAW$MSM.fit
$coef
Periods.0to0_new.d7.5 Periods.1to1_new.d7.5 Periods.2to2_new.d7.5 Periods.3to3_new.d7.5
-5.362056 -3.953137 -4.067354 -3.152000
Periods.4to4_new.d7.5 Periods.5to5_new.d7.5 Periods.6to6_new.d7.5 Periods.7to7_new.d7.5
-6.936838 -3.870630 -4.138810 -2.931822
Periods.8to8_new.d7.5 Periods.9to17_new.d7.5
-5.218133 -5.028533

$linkfun
[1] "logit_linkinv"

$fitfunname
[1] "h2o.glm"

I'll email template code that describes the problem in more details

inconsistency between doc and implementation for getIPWeights

Documentation of getIPWeights states:

"The output is person-specific data with evaluated weights, ‘wts.DT’, only observation-times with non-zero weight are kept"

The returned object can actually contain person-time obs with cum.IPAW = 0

IPW for Categorical Exposure with 4 levels

IPW_for_Categorical_Exposure_with_4_Levels.docx

fitPropensity has error when the exposure is categorical with more than 2 levels (‘max’ not meaningful for factors). Please see the example bellow

stremr version 0.8.99 and Installing stremr package from GitHub and loading the packages


# ----------------------------------------------------------------------
# Instal stremr Version 0.8.99 Data
# ----------------------------------------------------------------------
#knitr::opts_chunk$set(echo = TRUE)

library(devtools)
#install_github("osofr/stremr", ref = "experimental_master")

# ----------------------------------------------------------------------
# Get Libraries
# ----------------------------------------------------------------------
library(stremr)
library(data.table)
library(magrittr)
library(h2o)
options(stremr.verbose=TRUE)
sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rmarkdown_1.8 repmis_0.5 sl3_1.0.0 h2o_3.16.0.2
[5] magrittr_1.5 data.table_1.10.5 devtools_1.13.4 haven_1.1.0
[9] stremr_0.8.99

Get Source Data from another Github repository

Read sampleAD.RData from Soudi00 GitHub repository

library(repmis)

source_data("https://github.com/Soudi00/Multi-Treatment-Causal-Modeling/blob/master/sampleAD.RData?raw=True")

AD = as.data.table(AD, key= c(ID,SEQ))

use importData to prepare the data to get porpensity scores, also define censoring and exposrure regressions

# ----------------------------------------------------------------------
# Import Data
# ----------------------------------------------------------------------
OData.2  <-  importData(AD, ID = "ID", t_name = "SEQ", 
                        covars = c("CAT_VAR1","CAT_VAR2","CONT_VAR1"),           
                        CENS = c("CNS","ADM_CNS"), 
                        TRT = "TRTN",
                        MONITOR = NULL, OUTCOME = "STATUS",
                        weights = NULL, remove_extra_rows = TRUE,
                        verbose = getOption("stremr.verbose"))

# ----------------------------------------------------------------------
# Look at the input data object
# ----------------------------------------------------------------------
print(OData.2)

# ----------------------------------------------------------------------
# Access the input data
# ----------------------------------------------------------------------
get_data(OData.2)

# ----------------------------------------------------------------------
# Regression formula for Right Censoring and Administrative
# Censoring and  Exposure
# ----------------------------------------------------------------------
gform_CENS <- "CNS + ADM_CNS ~ CAT_VAR1 + CONT_VAR1"
gform_TRT = "TRTN ~ CAT_VAR1 + CAT_VAR2 + CONT_VAR1"

#Error in fitPropensity (not meaningful for factors)

I tried different options in fitPropensity but none of them works for categorical with more than 2 levels.


# ----------------------------------------------------------------------
# Estimate Propensity Scores
# fitPRopensity score with all defult option has an error
# ----------------------------------------------------------------------

OData.2 <- fitPropensity(OData.2, gform_CENS = gform_CENS,ngform_TRT = gform_TRT )

Using the default regression formula: TRTN ~ CAT_VAR1_2+CAT_VAR1_3+CAT_VAR1_4+CAT_VAR1_6+CAT_VAR1_7+CAT_VAR2_2+CAT_VAR2_3+CAT_VAR2_4+CAT_VAR2_5+CONT_VAR1
[1] "New 'ModelBinomial' regression defined:"
[1] "P(CNS|CAT_VAR1_2, CAT_VAR1_3, CAT_VAR1_4, CAT_VAR1_6, CAT_VAR1_7, CONT_VAR1);\ outvar.class: binomial;\ Stratify: ;\ N: NA"
[1] "New 'ModelBinomial' regression defined:"
[1] "P(ADM_CNS|CNS, CAT_VAR1_2, CAT_VAR1_3, CAT_VAR1_4, CAT_VAR1_6, CAT_VAR1_7, CONT_VAR1);\ outvar.class: binomial;\ Stratify: CNS == 0;\ N: NA"
[1] "New 'ModelCategorical' regression defined:"
[1] "P(TRTN|CAT_VAR1_2, CAT_VAR1_3, CAT_VAR1_4, CAT_VAR1_6, CAT_VAR1_7, CAT_VAR2_2, CAT_VAR2_3, CAT_VAR2_4, CAT_VAR2_5, CONT_VAR1);\ outvar.class: categorical;\ Stratify: ;\ N: NA"
[1] "fitting the model: P(CNS|CAT_VAR1_2, CAT_VAR1_3, CAT_VAR1_4, CAT_VAR1_6, CAT_VAR1_7, CONT_VAR1);\ outvar.class: binomial;\ Stratify: ;\ N: NA"
[1] "fitting the model: P(ADM_CNS|CNS, CAT_VAR1_2, CAT_VAR1_3, CAT_VAR1_4, CAT_VAR1_6, CAT_VAR1_7, CONT_VAR1);\ outvar.class: binomial;\ Stratify: CNS == 0;\ N: NA"
[1] "fitting the model: P(TRTN|CAT_VAR1_2, CAT_VAR1_3, CAT_VAR1_4, CAT_VAR1_6, CAT_VAR1_7, CAT_VAR2_2, CAT_VAR2_3, CAT_VAR2_4, CAT_VAR2_5, CONT_VAR1);\ outvar.class: categorical;\ Stratify: ;\ N: NA"
Failed on Lrnr_condensier_c("equal.mass", "equal.len", "dhist")_5_20_FALSE_NA_FALSE_NULL
Error in Summary.factor(structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, :
‘max’ not meaningful for factors

sl3 error debugging info:
[1] "Error in Summary.factor(structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, : \n ‘max’ not meaningful for factors\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in Summary.factor(structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 3L, 2L, 3L, 1L, 2L, 3L, 2L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 2L, 1L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 1L, 3L, 4L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 4L, 4L, 4L, 2L, 2L, 2L, 4L, 2L, 4L, 4L, 4L, 2L, 2L, 2L, 4L, 4L, 4L, 2L, 4L, 2L, 2L, 4L, 2L, 4L, 4L, 4L, 2L, 2L, 2L, 4L, 4L, 2L, 2L, 2L), .Label = c("1", "2", "3", "4"), class = "factor"), na.rm = FALSE): 'max' not meaningful for factors>
...trying to run Lrnr_glm_fast as a backup...
Warning in Ops.factor(y, mu): '-' not meaningful for factors
Warning in Ops.factor(eta, offset): '-' not meaningful for factors
Warning in Ops.factor(y, mu): '-' not meaningful for factors
Warning in Ops.factor(y, mu): '-' not meaningful for factors
Warning in Ops.factor(weights, y): '*' not meaningful for factors
Warning in Ops.factor(y, mu): '-' not meaningful for factors
Warning in Ops.factor(y, mu): '-' not meaningful for factors
Error in private$PsAsW.models[[k_i]]$predictAeqa(newdata = newdata, n = n, : some of the modeling predictions resulted in NAs, which indicates an error in a prediction routine

tried modeling treatment with Gradient Boosting machines same error


# ----------------------------------------------------------------------
# Fitting treatment model with Gradient Boosting machines:
# ----------------------------------------------------------------------
require("h2o")
h2o::h2o.init(nthreads = -1)
gform_CENS <- "CNS + ADM_CNS ~ CAT_VAR1 + CONT_VAR1"
models_TRT <- sl3::Lrnr_h2o_grid$new(algorithm = "gbm")
OData.2 <- fitPropensity(OData.2, gform_CENS = gform_CENS,
                        gform_TRT = gform_TRT,
                        models_TRT = models_TRT)

# Use `H2O-3` distributed implementation of GLM for treatment model estimator:
models_TRT <- sl3::Lrnr_h2o_glm$new(family = "multinomial")
OData.2 <- fitPropensity(OData.2, gform_CENS = gform_CENS,
                        gform_TRT = gform_TRT,
                        models_TRT = models_TRT)

UACR/TI 90days - continuous N=1 - error

The code below leads to this error:

Error in runglmMSM(OData, wts_data, all_dummies, Ynode, verbose) :
trying to get slot "model" from an object (class "try-error") that is not an S4 object

Note that the function g.Nstatic10001 below implies that the monitoring intervention is 11111 when number.0.between.1s <- 0

# ESTIMATE SURVIVAL FOR ONE TREATMENT REGIMEN WITH STATIC (1,0,0,0,1,0,...) INTERVENTIONS ON MONITORING REGIME
# get the likelihood for the following static g^* N(t): 1,0,0,0,1,0,0,0,1,0
g.Nstatic10001 <- function(){
  function(OdataDT, gn.N = "g0.N", SHIFTED.OUTCOME = "outcome.tplus1", ID = "ID", MONITOR = "N", t = "t", ...){
    ID.expression <- as.name(ID)
    Odata_sel <- OdataDT[, c(ID, MONITOR, t, SHIFTED.OUTCOME, gn.N), with = FALSE]
    number.0.between.1s <- 0 ## every k weeks means k-1 0's between 1's and this shoudl be set to k-1
    N.star.t <- as.integer((Odata_sel[[t]]%%(number.0.between.1s+1))%in%number.0.between.1s) 
    Odata_sel[, g.N := as.numeric(as.integer(Odata_sel[[MONITOR]]) == N.star.t)] 
    return(Odata_sel[["g.N"]])
  }
}
# Define N(t) rule followers under static N.g.star: (1,0,1,0) and a column to the observed data.table in OData:
g.star.N.follow <- g.Nstatic10001()
OData$dat.sVar[, N.star.stat10001 := g.star.N.follow(OData$dat.sVar, gn.N = "g0.N", SHIFTED.OUTCOME = "outcome.tplus1", ID = "StudyID", MONITOR = "N", t = "intnum")]

#1. Obtain weighted data sets by rule:
wts.St.d7 <- getIPWeights(OData, gstar_TRT = "new.d7", gstar_MONITOR = "N.star.stat10001")
wts.St.d7.5 <- getIPWeights(OData, gstar_TRT = "new.d7.5", gstar_MONITOR = "N.star.stat10001")
wts.St.d8 <- getIPWeights(OData, gstar_TRT = "new.d8", gstar_MONITOR = "N.star.stat10001")
wts.St.d8.5 <- getIPWeights(OData, gstar_TRT = "new.d8.5", gstar_MONITOR = "N.star.stat10001")
wts.all <- list(d7 = wts.St.d7, d7.5 = wts.St.d7.5, d8 = wts.St.d8, d8.5 = wts.St.d8.5)
print(object.size(wts.all), units = "MB")
wts.all <- rbindlist(wts.all)
wts.all <- wts.all[!is.na(cumm.IPAW) & !is.na(outcome.tplus1) & (cumm.IPAW > 0), ]
print(object.size(wts.all), units = "MB")

data.table::setthreads(20) # will help load data into h2o
# MSM for hazard with regular weights:
t.breaks.byquarter <- c(1:9)-1 ## need to change or will get error due to bins with no event when computing IC - need to add as bug on github
MSM.IPAW <- survMSM(OData, wts_data = wts.all, t_breaks = t.breaks.byquarter, use_weights = TRUE, est_name = "IPAW", getSEs = getSEs)
# MSM for hazard with truncated weights:
MSM.trunc <- survMSM(OData, wts_data = wts.all, t_breaks = t.breaks.byquarter, use_weights = TRUE, trunc_weights = 20, est_name = "IPAWtrunc", getSEs = getSEs)
# crude MSM for hazard without any weights:
MSM.crude <- survMSM(OData, wts_data = wts.all, t_breaks = t.breaks.byquarter, use_weights = FALSE, est_name = "crude", getSEs = getSEs)
# save(list = c("MSM.IPAW", "MSM.trunc", "MSM.crude"), file = "./MSMs.Rdata")

Better docs of the function output

The results tables provided by estimating functions are poorly documented.

This could be used as a start: https://github.com/osofr/stremr/#simulated-data-example

new make_report argument

Currently, a call to make_report_rmd leads to a browser or a pdf reader to open up automatically and display the report of analysis results. This is problematic when make_report_rmd is called several times from a larger program and one ends up with a cluttered desktop.

Is it possible to add an argument that prevents the pop-up of the display windows?

automatic report sample

sequential randomization

The trial is a sequentially randomized trial with possible re-randomization based on previous response. At baseline, participants are randomized 1:1:1 into three arms, texting intervention, mobile app intervention, and standard of care. At 3 month increments participants in the text and app arms who do not "improve" their risk behaviors are eligible for re-randomization into a third intervention, e-coaching.

Because it's a randomized trial, at each time I know the probability of receiving each treatment given the past. I don't necessarily have to input the known probabilities, but I can't figure out how to properly stratify the fitPropensity command and can't digest what the error messages are actually telling me.

Attached is an example data table.

ID = partic. id
covariate = a binary baseline covariate
Y = the outcome at month 12
txt = on the texting intervention
app = on the app intervention
t = time (0 = baseline, 3, 6, 9, 12 months)
monitor = whether they are seen at t + 3
risk = risk behavior score, if risk at t+3 is bigger than or equal to risk at t (and risk at t != 0) and -
txt == 1 | app == 1, then they are eligible at t+3 for rerandomizaiton to e-coaching. Y is the sum of risk at 3, 6, 9, and 12.
trt = multinomial treatment indicator 1 = txt, 2 = app
Y_scale = Y scaled by maximum value of Y
trt.setXX = treatment levels for rules I'm interested in (no int, all txt int, all app int, txt + e-coaching when risk doesn't improve, app + e-coaching when risk doesn't improve)
eligible = whether or not someone is eligible for re-randomization at t
ecoach = on the e-coach intervention
coach.tminus1 = on e-coach intervention at previous time point
cens = censoring = 0 for everyone, since it seemed like providing monitoring status made stremr happier than censoring + missing values

Here's what I've tried:

gform_MONITOR <- "monitor ~ risk + covariate"
# would perhaps prefer this to be intercept only, but ~1 syntax seems to piss everything off
gform_TRT <- "trt ~ risk"
stratify_TRT <- list(trt = c(
  "t == 0", # baseline randomization prob.
  "t == 3 & eligible == 1", # at time 3 prob. of getting e-coaching amongst those eligible (should be 1/2)
  "t == 6 & eligible == 1", # at time 6 prob. of getting e-coaching amongst those eligible (should be 1/2)
  "t == 9 & eligible == 1", # at time 9, etc... 
  "t == 3 & txt == 1 & eligible == 0", # at time 3, txt folks should stay on txt with prob. 1 if not eligible for e-coach
  "t == 6 & txt == 1 & eligible == 0", # at time 6, txt folks should stay on txt with prob. 1 if not eligible for e-coach
  "t == 9 & txt == 1 & eligible == 0", # etc...
  "t == 3 & app == 1 & eligible == 0", # ditto app people who are not eligible
  "t == 6 & app == 1 & eligible == 0", # etc 
  "t == 9 & app == 1 & eligible == 0", # etc
  "t == 3 & app == 0 & txt == 0", # ditto standard of care people, i.e. trt == 0
  "t == 6 & app == 0 & txt == 0", # etc
  "t == 9 & app == 0 & txt == 0", # etc
  "t == 6 & ecoach.tminus1 == 1",  # once you get to e-coaching, you stay on e-coaching with prob. 1
  "t == 9 & ecoach.tminus1 == 1"))

I'm just wondering how to properly specify these regressions and whether there is a way to make them intercept only or not.

testDT.RData.zip

option : return_wts = TRUE dose not work for survNPMSM and only return results when used in directIPW

Please see the example from the documentation of the functions bellow:

the return_wts option dose not return value when survNPMSM function is called.

----------------------------------------------------------------------

Simulated Data

----------------------------------------------------------------------

data(OdataNoCENS)
OdataDT <- as.data.table(OdataNoCENS, key=c(ID, t))

define lagged N, first value is always 1 (always monitored at the first time point):

OdataDT[, ("N.tminus1") := shift(get("N"), n = 1L, type = "lag", fill = 1L), by = ID]
OdataDT[, ("TI.tminus1") := shift(get("TI"), n = 1L, type = "lag", fill = 1L), by = ID]

----------------------------------------------------------------------

Define intervention (always treated):

----------------------------------------------------------------------

OdataDT[, ("TI.set1") := 1L]
OdataDT[, ("TI.set0") := 0L]

----------------------------------------------------------------------

Import Data

----------------------------------------------------------------------

OData <- importData(OdataDT, ID = "ID", t = "t", covars = c("highA1c", "lastNat1", "N.tminus1"),
CENS = "C", TRT = "TI", MONITOR = "N", OUTCOME = "Y.tplus1")

----------------------------------------------------------------------

Look at the input data object

----------------------------------------------------------------------

print(OData)

----------------------------------------------------------------------

Access the input data

----------------------------------------------------------------------

get_data(OData)

----------------------------------------------------------------------

Model the Propensity Scores

----------------------------------------------------------------------

gform_CENS <- "C ~ highA1c + lastNat1"
gform_TRT = "TI ~ CVD + highA1c + N.tminus1"
gform_MONITOR <- "N ~ 1"
stratify_CENS <- list(C=c("t < 16", "t == 16"))

----------------------------------------------------------------------

Fit Propensity Scores

----------------------------------------------------------------------

OData <- fitPropensity(OData, gform_CENS = gform_CENS,
gform_TRT = gform_TRT,
gform_MONITOR = gform_MONITOR,
stratify_CENS = stratify_CENS)

----------------------------------------------------------------------

IPW Ajusted KM or Saturated MSM

----------------------------------------------------------------------

require("magrittr")
AKME.St.1 <- getIPWeights(OData, intervened_TRT = "TI.set1") %>%
survNPMSM(OData, return_wts = TRUE) %$%
estimates
AKME.St.1$wts_data

----------------------------------------------------------------------

Bounded IPW

----------------------------------------------------------------------

IPW.St.1 <- getIPWeights(OData, intervened_TRT = "TI.set1") %>%
directIPW(OData, return_wts = TRUE)
IPW.St.1$wts_data

getIPWeights - need to add a check that user has specified PS model for all nodes on which we intervene

Say user specifies stratified PS models for A (resp. C or N) nodes such that the union of the person-time observations on A that fall in each stratum is not equal to the total number of person-time observations in the dataset.

Say that the user then asks stremr to compute an effect estimate for interventions on a subset of the A nodes (resp. C or N) such some elements of the subsets of person-time observations on A do not fall into any of the strata specified for gA.

Stremr should then return an error message to warn the user than they did not specify a sufficient number of strata for gA. Currently, stremr will return an output which will be incorrect.

Idea to fix this: add an argument check in getIPWeights that checks that the A (resp. C and N) person-time observations on which the user wish to intervene are all contained in the union of the person-time observations on A that fall within each stratum specified for gA. If not, stremr should give an error that asks the user to modify/add how they specified the strata for the PS model for gA.

add option for isotonic regression on cumulative risks estimates over time

See stats::isoreg(y = y, x = ...)

ID cannot be a factor, all character variables will be ignored

Add a test that the ID variable provided by the user is NOT a factor within importData and return an error to the user otherwise.

Make it clear in the doc that the ID variable should be not be provided as a factor. Otherwise, the current implementation will create dummies for all study ID.

Make it clear in the doc that all character variables are ignored by stremr (i.e., they won't be converted to factors).

factor with 1 level only

When a factor variable can take on only one value, importData creates a dummy of class "list" which cannot be handled by subsequent stremr routines (Error message is produced).

Stremr should catch the problem, print a warning and create a single dummy variable of class "integer".

bug in defineMONITORvars

    DT <- data.table(data, key = c("ID", "t"))

I think "ID" and "t" should be replaced with IDvar and tvar

using stremr without monitoring

Is it possible to use stremr without a monitoring variable? There does not seem to be a default for MONITOR in defineIntervedTRT(), so do I have to construct a pseudo monitoring variable for the function to work?

Thanks

Time dependent propensity models

Different from stratified ps models: how do I specify different variable sets in g_form_TRT?

add discreteSL example to readme.md

add xgboost and h2oEnsemble as learners

xgboost
ensemble

Bug in `QlearnModel` when ML-based regression fails

When doing Q learning, especially with stratification, a single algorithm might fail. In this case a fall back fitting either with glm or speedglm should be performed. For some reason this doesn't happen in this case (./tests/examples/2_building_blocks_example.R):

The following error is produced:

unable to run randomForest with h2o for: intercept only models or designmat with zero rows or  constant outcome (y) ...
Error in UseMethod("predictP1") : 
  no applicable method for 'predictP1' applied to an object of class "try-error"
In addition: Warning messages:

The code from ./tests/examples/2_building_blocks_example.R:

data(OdataNoCENS)
OdataDT <- as.data.table(OdataNoCENS, key=c(ID, t))
OdataDT[, ("N.tminus1") := shift(get("N"), n = 1L, type = "lag", fill = 1L), by = ID]
OdataDT[, ("TI.tminus1") := shift(get("TI"), n = 1L, type = "lag", fill = 1L), by = ID]
OdataDT[, ("TI.set1") := 1L]
OdataDT[, ("TI.set0") := 0L]
OData <- importData(OdataDT, ID = "ID", t = "t", covars = c("highA1c", "lastNat1", "N.tminus1"),
                    CENS = "C", TRT = "TI", MONITOR = "N", OUTCOME = "Y.tplus1")
gform_CENS <- "C ~ highA1c + lastNat1"
gform_TRT = "TI ~ CVD + highA1c + N.tminus1"
gform_MONITOR <- "N ~ 1"
stratify_CENS <- list(C=c("t < 16", "t == 16"))

require("h2o")
h2o::h2o.init(nthreads = -1)

params_TRT = list(fit.package = "h2o", fit.algorithm = "gbm", ntrees = 50,
    learn_rate = 0.05, sample_rate = 0.8, col_sample_rate = 0.8,
    balance_classes = TRUE)
params_CENS = list(fit.package = "speedglm", fit.algorithm = "glm")
params_MONITOR = list(fit.package = "speedglm", fit.algorithm = "glm")
OData <- fitPropensity(OData,
            gform_CENS = gform_CENS, stratify_CENS = stratify_CENS, params_CENS = params_CENS,
            gform_TRT = gform_TRT, params_TRT = params_TRT,
            gform_MONITOR = gform_MONITOR, params_MONITOR = params_MONITOR)

t.surv <- c(0:5)
Qforms <- rep.int("Q.kplus1 ~ CVD + highA1c + N + lastNat1 + TI + TI.tminus1", (max(t.surv)+1))
params_Q = list(fit.package = "h2o", fit.algorithm = "randomForest",
                ntrees = 100, learn_rate = 0.05, sample_rate = 0.8,
                col_sample_rate = 0.8, balance_classes = TRUE)
tmle_est <- fitTMLE(OData, t_periods = t.surv, intervened_TRT = "TI.set1",
            Qforms = Qforms, params_Q = params_Q,
            stratifyQ_by_rule = TRUE)

automated reports 2

Replace IPAW with IPW

In the report and in the argument values of the make report routine

This is to be match the most common terminology used in the literature

IC routine crash

when the MSM parameterization results in no event for a time interval under a given rule

stremr output

Can you help me understand the output of stremr? In the optimal dynamic treatment rule example, in the output table IPW_estimates, there are the variables: sum_Y_IPAW, sum_all_IPAW, ht, St.IPTW, ht.KM. Can you help me understand how to interpret each of these variables?

allow stratified modeling with Q learning

All the functionality is already there, except that extraction of the prediction results requires writing custom access functions. Specifically we need to extract the final predictions for each strata of Q prediction and combine them into N predictions.

Problem with CVTMLE

Hi, thanks a lot for this wonderful package. I wanted to run CVTMLE, and I am getting this error:
Error in UseMethod("predict_SL") :
no applicable method for 'predict_SL' applied to an object of class "c('Lrnr_sl', 'Lrnr_base', 'R6')"

I guess it is something to do with gridsl::predict_SL.
My function:
params <- gridisl::defModel(estimator = "speedglm__glm")
IPW <- getIPWeights(Odata, intervened_TRT = "Intervention_1", holdout = TRUE)
qwigh <- quantile(IPW[cum.IPAW>0,][["cum.IPAW"]], c(0.99))
gcomp_est <- fit_GCOMP(Odata, tvals = tvals,
TMLE = TRUE,
CVTMLE = TRUE,
Qforms = Qforms,
intervened_TRT = "Intervention_1"
models = params,
stratifyQ_by_rule = FALSE,
stratify_by_last = TRUE,
trunc_weights = qwigh,
fit_method = "cv",
byfold_Q = FALSE,
IPWeights = IPW)

Regards!

automated reports 3

How to draw Doubly robust adjusted Kaplan Meier curve and estimate log-rank test?

I inted to submit paper about observational study in urological field.
I think TMLE is double estimation. If that is right, I could not understand the meaning of the result from TMLE.
Please, tell me, the script for drawing the Doubly robust adjusted Kaplan Meier curve (if possible 95% CI) and estimate log-rank test.

SL as option for PS including screener

Allowing data with no censoring and/or no monitoring and/or no exposure

When either one of the arguments CENS, TRT,MONITOR` is not specified, do not to intervene on it and do not attempt to fit the corresponding model for censoring, treatment or monitoring. Currently will return an error if either is not specified.

Error when running example code

Seems some definitions are missing in stratify_ if when following your example code.

> OData <- fitPropensity(OData, gform_CENS = gform_CENS, gform_TRT = gform_TRT, gform_MONITOR = gform_MONITOR, stratify_CENS = stratify_CENS)
Error in create_subset_expr(outvars = res$outvars, stratify.EXPRS = stratify.EXPRS) : 
  Could not locate the appropriate regression variable(s) within the supplied stratification list stratify_CENS, stratify_TRT or stratify_MONITOR.
The regression outcome variable(s) specified in gform_CENS, gform_TRT or gform_MONITOR were: ( 'C,TI,N' )
However, the item names in the matching stratification list were: ( 'C' )

Add CI and p value for MSM coefficients

Return a warning when g^*=NA

Sometimes the user may set the intervention node (g^*) to NA. If this value is actually being applied, then we should produce a warning, since the resulting weights will be also NA.

The best place to catch it is probably here:
https://github.com/osofr/stremr/blob/master/R/DeterministicBinaryOutcomeModel.R#L29-L44

    fit = function(overwrite = FALSE, data, ...) { # Move overwrite to a field? ... self$overwrite
      self$n <- data$nobs
      self$define.subset.idx(data)
      private$probA1 <- data$get.outvar(TRUE, self$gstar.Name)
      # private$.isNA.probA1 <- is.na(private$probA1)
      # self$subset_idx <- rep.int(TRUE, self$n)
      self$subset_idx <- seq_len(self$n)
      private$.outvar <- data$get.outvar(TRUE, self$getoutvarnm) # Always a vector of 0/1
      # private$.isNA.outvar <- is.na(private$.outvar)
      self$is.fitted <- TRUE
      # **********************************************************************
      # to save RAM space when doing many stacked regressions wipe out all internal data:
      # self$wipe.alldat
      # **********************************************************************
      invisible(self)
    },

Or here (private$probA1 is the g^*):
https://github.com/osofr/stremr/blob/master/R/DeterministicBinaryOutcomeModel.R#L53-L67

  predictAeqa = function(newdata, ...) { # P(A^s[i]=a^s|W^s=w^s) - calculating the likelihood for indA[i] (n vector of a`s)
      assert_that(self$is.fitted)
      if (missing(newdata)) {
        indA <- self$getoutvarval
      } else {
        indA <- newdata$get.outvar(self$getsubset, self$getoutvarnm) # Always a vector of 0/1
      }
      assert_that(is.integerish(indA)) # check that observed exposure is always a vector of integers
      probAeqa <- rep.int(1L, self$n) # for missing values, the likelihood is always set to P(A = a) = 1.
      # probA1 <- private$probA1[self$getsubset]
      probA1 <- private$probA1
      probAeqa[self$getsubset] <- probA1^(indA) * (1 - probA1)^(1L - indA)
      self$wipe.alldat # to save RAM space when doing many stacked regressions wipe out all internal data:
      return(probAeqa)
    },
`

column name of h2o table overlap in a report stored in pdf format

pdf_format_h2o_issue.docx

Add percentiles of IP weights in report

Add 95th, 99th, and 99.9th percentiles and maximum value of IP weights in relevant section of report

Allow input of counterfactual (intervention) probabilities for TRT and MONITOR

Current set-up only permits the input of probabilities/indicators of observation following an intervention-specific treatment and/or monitoring rule (arguments gstar.TRT and gstar.MONITOR).

In some cases it might be more intuitive to have as input the counterfactual values of treatment and monitoring based on some intervention rule (the current functionality of ltmle and tmlenet). For dynamic rules, these would be evaluated for each observation and could also be specified as probabilities of TRT[t]=1 and MONITOR[t]=1, under some intervention rule. In this case, the rule followers/non-followers (gstar.TRT & gstar.N) would be evaluated automatically by comparing the observed TRT/MONITOR to their corresponding counterfactual values/probabilities.

stratifyQ_by_rule

The documentation in fit_GCOMP() for stratifyQ_by_rule says,

Set to 'TRUE' for stratifying the fit of Q (the outcome model) by rule-followers only.

Does this mean that non-rule-followers are excluded at each time point from the hazard estimate? I don't see separate "stratified" estimates.

Thanks,

Defining counterfactual dynamic treatment nodes with multiple dummy exposure

Multivaraite_TRT__4_dummy_exposure_.docx

Install stremr Package Version: 0.8.99 and get Libraries


# ----------------------------------------------------------------------
# Instal stremr Version 0.8.99 Data
# ----------------------------------------------------------------------
#knitr::opts_chunk$set(echo = TRUE)

library(devtools)
#install_github("osofr/stremr", ref = "experimental_master")

# ----------------------------------------------------------------------
# Instal stremr Version 0.8.99 Data
# ----------------------------------------------------------------------
library(stremr)
library(data.table)
library(magrittr)
library(h2o)
options(stremr.verbose=TRUE)
sessionInfo()

Get Source Data from another Github Repository

library(repmis)

source_data("https://github.com/Soudi00/Multi-Treatment-Causal-Modeling/blob/master/sampleAD.RData?raw=True")

AD = as.data.table(AD, key= c(ID,SEQ))

#I have 4 different treatment , should I only use 3 of the dummies in the importData or I should include all of them?


# ----------------------------------------------------------------------
# Define intervention (always on TRT1):
# ----------------------------------------------------------------------
AD[, ("zero.set1") := 0L]
AD[, ("zero.set2") := 0L]
AD[, ("zero.set3") := 0L]
AD[, ("TRT1.set") := 1L]

# ----------------------------------------------------------------------
# Import Data in to stremr object
# ----------------------------------------------------------------------
OData.1  <-  importData(AD, ID = "ID", t_name = "SEQ", 
                        covars = c("CAT_VAR1","CAT_VAR2","CONT_VAR1"),           
                        CENS = c("CNS","ADM_CNS"), 
                        TRT = c("TRT1","TRT2","TRT3","TRT4"),
                        MONITOR = NULL, OUTCOME = "STATUS",
                        weights = NULL, remove_extra_rows = TRUE,
                        verbose = getOption("stremr.verbose"))

# ----------------------------------------------------------------------
# Look at the input data object
# ----------------------------------------------------------------------
print(OData.1)

# ----------------------------------------------------------------------
# Access the input data
# ----------------------------------------------------------------------
get_data(OData.1)

# ----------------------------------------------------------------------
# Model the Right Censroing and Adminstrative Censoring and Exposure
# ----------------------------------------------------------------------
gform_CENS <- "CNS + ADM_CNS ~ CAT_VAR1 + CONT_VAR1"
gform_TRT = "TRT1+TRT2+TRT3+TRT4 ~ CAT_VAR1 + CAT_VAR2 + CONT_VAR1"

# ----------------------------------------------------------------------
# Fit Propensity Scores
# ----------------------------------------------------------------------

OData.1 <- fitPropensity(OData.1, gform_CENS = gform_CENS,ngform_TRT = gform_TRT )

What should be the dimension of the intervened_TRT when we are using multiple dummy treatment

I have my own defined dynamic treatment patterns of interest (5 dummy variables for the 5 patterns). That is:

Always TRT1 (PATH1)
Always TRT2 (PATH2)
Always TRT3 (PATH3)
Start TRT1, switch at any time to TRT3 (PATH4)
Start TRT2, switch at any time to TRT3 (PATH5)

# ----------------------------------------------------------------------
#  Error: length(intervened_NODE) not equal to length(NodeNames)
# ----------------------------------------------------------------------

wts.DT.1 <- getIPWeights(OData = OData.1, intervened_TRT="PATH1")

# ----------------------------------------------------------------------
# Error in modelfit.g$getPsAsW.models()[[i]] : subscript out of bounds
# ----------------------------------------------------------------------

wts.DT.1 <- getIPWeights(OData = OData.1, intervened_TRT=c("TRT1.set","zero.set1","zero.set2","zero.set3"))

# ----------------------------------------------------------------------
# useing diffrent intervened_TRT didnt make a diffrence in the result
# ----------------------------------------------------------------------

wts.DT.1 <- getIPWeights(OData = OData.1, useonly_t_TRT="PATH1==1",rule_name ="Only TRT1")
wts.DT.1

wts.DT.2 <- getIPWeights(OData = OData.1, useonly_t_TRT="PATH2==1", rule_name = "Only TRT2")
wts.DT.2

romainkp / stremr Goto Github PK

stremr's Introduction

R/stremr: Streamlined Causal Inference for Static, Dynamic and Stochastic Regimes in Longitudinal Data

Available estimators

Input data format

Estimation of the outcome and treatment models

Brief overview of stremr

Installation

Issues

Documentation

Simulated data example

Longitudinal GCOMP (G-formula) and TMLE

Data-adaptive estimation, cross-validation and Super Learning

Details on some implemented estimators

Citation

Funding

Copyright

References

stremr's People

Contributors

Stargazers

Watchers

Forkers

stremr's Issues

fitPropensity has error when the exposure is categorical with more than 2 levels (‘max’ not meaningful for factors). Please see the example bellow

Get Source Data from another Github repository

tried modeling treatment with Gradient Boosting machines same error

----------------------------------------------------------------------

Simulated Data

----------------------------------------------------------------------

define lagged N, first value is always 1 (always monitored at the first time point):

----------------------------------------------------------------------

Define intervention (always treated):

----------------------------------------------------------------------

----------------------------------------------------------------------

Import Data

----------------------------------------------------------------------

----------------------------------------------------------------------

Look at the input data object

----------------------------------------------------------------------

----------------------------------------------------------------------

Access the input data

----------------------------------------------------------------------

----------------------------------------------------------------------

Model the Propensity Scores

----------------------------------------------------------------------

----------------------------------------------------------------------

Fit Propensity Scores

----------------------------------------------------------------------

----------------------------------------------------------------------

IPW Ajusted KM or Saturated MSM

----------------------------------------------------------------------

----------------------------------------------------------------------

Bounded IPW

----------------------------------------------------------------------

Install stremr Package Version: 0.8.99 and get Libraries

Get Source Data from another Github Repository

What should be the dimension of the intervened_TRT when we are using multiple dummy treatment

Recommend Projects

Recommend Topics

Recommend Org

Jobs

R/`stremr`: Streamlined Causal Inference for Static, Dynamic and Stochastic Regimes in Longitudinal Data

Brief overview of `stremr`