pzivich / zepid Goto Github PK

Epidemiology analysis package

License: MIT License

Python 100.00%

epidemiology inverse-probability-weights risk-ratio epidemiology-analysis risk-difference g-formula ipw aipw g-computation tmle

zepid's Introduction

zEpid

zEpid is an epidemiology analysis package, providing easy to use tools for epidemiologists coding in Python 3.5+. The purpose of this library is to provide a toolset to make epidemiology e-z. A variety of calculations and plots can be generated through various functions. For a sample walkthrough of what this library is capable of, please look at the tutorials available at https://github.com/pzivich/Python-for-Epidemiologists

A few highlights: basic epidemiology calculations, easily create functional form assessment plots, easily create effect measure plots, and causal inference tools. Implemented estimators include; inverse probability of treatment weights, inverse probability of censoring weights, inverse probabilitiy of missing weights, augmented inverse probability of treatment weights, time-fixed g-formula, Monte Carlo g-formula, Iterative conditional g-formula, and targeted maximum likelihood (TMLE). Additionally, generalizability/transportability tools are available including; inverse probability of sampling weights, g-transport formula, and doubly robust generalizability/transportability formulas.

If you have any requests for items to be included, please contact me and I will work on adding any requested features. You can contact me either through GitHub (https://github.com/pzivich), email (gmail: zepidpy), or twitter (@zepidpy).

Installation

Installing:

You can install zEpid using pip install zepid

Dependencies:

pandas >= 0.18.0, numpy, statsmodels >= 0.7.0, matplotlib >= 2.0, scipy, tabulate

Module Features

Measures

Calculate measures directly from a pandas dataframe object. Implemented measures include; risk ratio, risk difference, odds ratio, incidence rate ratio, incidence rate difference, number needed to treat, sensitivity, specificity, population attributable fraction, attributable community risk

Measures can be directly calculated from a pandas DataFrame object or using summary data.

Other handy features include; splines, Table 1 generator, interaction contrast, interaction contrast ratio, positive predictive value, negative predictive value, screening cost analyzer, counternull p-values, convert odds to proportions, convert proportions to odds

For guided tutorials with Jupyter Notebooks: https://github.com/pzivich/Python-for-Epidemiologists/blob/master/3_Epidemiology_Analysis/a_basics/1_basic_measures.ipynb

Graphics

Uses matplotlib in the background to generate some useful plots. Implemented plots include; functional form assessment (with statsmodels output), p-value function plots, spaghetti plot, effect measure plot (forest plot), receiver-operator curve, dynamic risk plots, and L'Abbe plots

For examples see: http://zepid.readthedocs.io/en/latest/Graphics.html

Causal

The causal branch includes various estimators for causal inference with observational data. Details on currently implemented estimators are below:

G-Computation Algorithm

Current implementation includes; time-fixed exposure g-formula, Monte Carlo g-formula, and iterative conditional g-formula

Inverse Probability Weights

Current implementation includes; IP Treatment W, IP Censoring W, IP Missing W. Diagnostics are also available for IPTW. IPMW supports monotone missing data

Augmented Inverse Probability Weights

Current implementation includes the augmented-IPTW estimator described by Funk et al 2011 AJE

Targeted Maximum Likelihood Estimator

TMLE can be estimated through standard logistic regression model, or through user-input functions. Alternatively, users can input machine learning algorithms to estimate probabilities. Supported machine learning algorithms include sklearn

Generalizability / Transportability

For generalizing results or transporting to a different target population, several estimators are available. These include inverse probability of sampling weights, g-transport formula, and doubly robust formulas

Tutorials for the usage of these estimators are available at: https://github.com/pzivich/Python-for-Epidemiologists/tree/master/3_Epidemiology_Analysis/c_causal_inference

G-estimation of Structural Nested Mean Models

Single time-point g-estimation of structural nested mean models are supported.

Sensitivity Analyses

Includes trapezoidal distribution generator, corrected Risk Ratio

Tutorials are available at: https://github.com/pzivich/Python-for-Epidemiologists/tree/master/3_Epidemiology_Analysis/d_sensitivity_analyses

zepid's People

Contributors

Stargazers

Watchers

zepid's Issues

Write tests

Need to write tests for all functionalities

Add plot function to measure classes

Add some type of plot functionality to measure classes, like lifelines Cox PH model

Should be a functional that all can access. Option to log scale plots (default is linear scale)

Stochastic Treatment TMLE

This is another valuable addition to TMLE (that I also need as part of a project I am working on). Essentially, this would allow more complex treatments than treat-all vs. treat-none, similar to custom treatments in the g-formula.

What it does is shift the probability of A distribution. However, the single-step convergence of TMLE is no longer valid. I would need to iteratively estimate Q* until it epsilon converges to 0. This should be easy enough. After convergence, follows the remainder of the TMLE procedure

Likely best if I make this separate from TMLE

Starting points;
https://github.com/tlverse/tmle3shift
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4117410/#SD1

causal_comparisons.py in the correct spot?

The file docs/causal_comparisons.py seems out of place (and references some files on your local machine). If this is intended as an example, I suggest moving to an examples/ folder and converting to a jupyter notebook.

Bayesian G-Formula

https://www.researchgate.net/publication/287250321_A_Bayesian_approach_to_the_g-formula

Most likely made a new class. Valid for both time-fixed and longitudinal

Add sample longitudinal data set

For sequential regression g-formula examples. Also will be used later for LTMLE

Update standardized difference calculator

Not discussed in the Austin, Stuart article is the formula for categorical SMD. I will need to update this to work as intended.

Currently standardized_mean_differences and plot_love assume that variables are either binary or continuous. There is no option for variables to be categorical. This is a problem, since the formulation for categorical variables is distinct from binary variables

Will need to interact with patsy 's C(...) functionality to detect variable types properly.

Reference
http://support.sas.com/resources/papers/proceedings12/335-2012.pdf

Add G-formula

One lofty goal is to implement the G-formula. Would need to code two versions; time-fixed and time-varying. The Chapter by Robins & Hernan is good reference. I have code that implements the g-formula using pandas. It is reasonably fast.

TODO: generalize to a class, allow input models then predict, need to determine how to allow users to input custom treatment regimes (all/none/natural course are easy to do), compare results (https://www.ncbi.nlm.nih.gov/pubmed/25140837)

Time-fixed version will be relatively easy to write up

Time-varying will need the ability to specify a large amount of models and specify the order in which the models are fit.

Note; I am also considering reorganizing in v0.2.0 that IPW/g-formula/doubly robust will all be contained within a folder caused causal, rather than adding to the current ipw folder

Suggestion: used NamedTuples in calc utils

This became clear in reading the tests like:

    def test_match_sas_ci(self):
        sas_ci = (0.361409618, 0.638590382)
        r = risk_ci(25, 50, confint='wald')
        npt.assert_allclose(r[1:3], sas_ci)

It's not clear what r[1:3] represents, or r[0], etc.

My suggestion is to create a NamedTuple at the top of the file:

from typing import NamedTuple
...

class CalcResults(NamedTuple):
    name: string
    point_estimate: float
    lower_bound: float
    upper_bound: float
    standard_error: float

def incidence_rate_ci(events, time, alpha=0.05):
   ...

    return CalcResults('incidence_date', ir, lower, upper, sd)

and then users (and yourself) can access elements easier:

result = incidence_rate_ci(...)
result.point_estimate
result.standard_error

A NamedTuple is like a light-weight class - another option is lean in and create a full-on class with methods, etc.

Implement permutation weighting

I think it would fit well in ipw/, and the method looks quite easy to implement.

https://arxiv.org/pdf/1901.01230.pdf

Add density plots to IPTW diagnostics

I want to add density plots as an alternative to histograms in the IPTW probability diagnostics. Should be easy with SciPy kernel density. Likely use gaussian_kde

Add time-varying RD/RR plots

Reminder to create time-varying Risk Difference / Risk Ratio plots in the vein of Table 3 in
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325676/ as recommended by
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325680/

Inputs should be two pandas Series for the risks indexed by time (like lifelines fitted Kaplan Meier 1 - km.surivival_function_. Duplicated risks should be dropped, Series merged by index (time), forward filled, then plotted. Include as part of 0.1.4

Issues to consider: confidence intervals (if IP-weighted would need to be bootstrapped), scale options (like effectmeasureplot), LOWESS curve options (like functionalform).

IPMW for Non-monotone Missing Data

Similar to the resources in #55 it would be nice to have both potential options for IPMW implemented (this is the framework that something like AIPMW would use anyways).

For monotone, need a recursive process to estimate each probability. Then calculate the weights.

Tchetgen has the unconstrained version for nonmonotone data. See his paper referenced in #55
The real difficulty for implementation will be the Bayesian Constrained (BC) estimator. This estimation process uses Bayesian tools to constrain the weights. The rationale for the BC is that the unconstrained version does not converge well

Framework:
IPMW(..., missing=['var1', 'var2'], monotone=True)
For the list of missing variables, they would each have their respectively probability of observation calculated (in the monotone scenario). Their order is key for the monotonic missing data
For nonmonotonic (monotone=False), the unconstrained log-likelihood would be maximized based on the calculated observed variable

To implement:

Write montone detector (to check users entered the data correctly)
start with monotone, since it is the easier case. Rely on loop through all the missing variables
Create example data set with monotone missing data. Use to write tests compared to R

Add LTMLE

Need to add a longitudinal version of TMLE https://www.ncbi.nlm.nih.gov/pubmed/28992064

Add as LTMLE in the doubly robust branch

Transition from AIPW to AIPTW

For clarity,the current AIPW will transition to AIPTW. I will need to update this and generate a user-warning for the upcoming change. Mentioned in #55

AIPW will no longer be supported in v0.5

Name change is necessary for eventual addition of Augment IPW for missingness (AIPMW). Additionally, the proposed naming would better match the naming conventions in ipw/

G-formula estimation

Currently, use the Monte Carlo estimation procedure for the g-formula. An alternative is to use sequential regression. Sequential regression has the advantage of needing fewer models to be specified. Krief et al. 2017 has a nice description of sequential regression estimation. This is the same paper that describes LTMLE

Network TMLE

Add network TMLE for causal inference with network data

https://arxiv.org/abs/1705.08527

TimeVaryGFormula in Cython

To speed up the TimeVaryGFormula, I should try cython to see how big of a speed boost I can get. Currently, SAS destroys python in regards to time to complete the g-formula. Keil et al. is a great example of this

Transition ipw to causal branch

In version 0.2.0, ipw will be moved to a new branch labelled causal. Reason for transition is to better group IPW with similar methods (doubly robust, G-computation)

AIPW for nonmonotone missing data

IPMW handles only monotone missing data (only a single variable with missing data). This is fine but data missing for multiple variable simultaneously is more common in practice. There has been a recent proposal to use an AIPW to correct for missing data (under the MAR assumption)

Papers say it is easy to implement with existing software. Will have to see how true this is...

https://academic.oup.com/aje/article/187/3/585/4642953
https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2016.1256814#.XDOLZVxKhGM

Will require the renaming of the current AIPWto AIPTW (something I have been meaning to do)

Add option to include weights in GFormula

Need to add option for weights in g-formula. Would allow for sampling weights or other options (IPCW / IPMW)

Add stratify option to summary measures

For all the basic summary measures, it might be useful to include a stratify option. Default would be None. If specified with a variable, it would calculate the summary measure for each strata and then provide the Mantel-Haenszel estimate for that measure. Also should conduct the homogeneity test (warn users if test seems to be violated).

Plot should become more complex (plot all the stratified and summary).

Need to think about how to use for multivariate exposures...

Add Survival TMLE

Add a survival-analysis TMLE. Should be similar to the LTMLE, but need to find a good example

Add survival analysis implementation of AIPW

Update website documentation

Need to review all the documentation on the website. Additionally, I have a wishlist of items I would like to add.

Add math formulas for the various estimators. Need to look up math for .rst
Create new section of documentation that goes over all the documentation for each function. Much like how pandas is set up

Recommendations:

Follow the format of NetworkX's documentation but retain the narrative structure as the front-end
Update the documentation to reflect rST, so it can all be rendered to the website

DeprecationWarning

statsmodels v0.10 raises the following warning;

DeprecationWarning: Calling Family(..) with a link class as argument is deprecated. Use an instance of a link class instead.

Need to update the various parts of code to avoid this warning

Travis CI and matplotlib

I got Travis CI to check that all plots return axes objects once. Now it's having some trouble...

Everything passes locally and Travis is using the same version of matplotlib that I am using locally. Maybe something with the OS?...

Network G-Formula

Another for the wishlist. This will require me to learn Gibbs sampling to implement. Will also have to add NetworkX to the list of dependencies (or optional dependency)

Reference:
https://arxiv.org/pdf/1709.01577.pdf

Branch Plan:
either

---causal
      |
      -gformula

---causal
      |
      -interference

Second option might be better for the NetworkX optional import

Verification:
Probably just adapt some existing network simulation code I have

Other:
Need to add NetworkX as a(n) (optional) dependency

G-estimation of Structural Nested Models

Add SNM to the zepid.causal branch. After this addition, all of Robin's g-methods will be implemented in zEpid.

SNM are discussed in the Causal Inference book (https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/) and The Chapter. SAS code for search-based and closed-form solvers is available at the site. Ideally will have both implemented. Will start with time-fixed estimator

Add interference

Later addition, but since statsmodels 0.9.0 has GLIMMIX, I would like to add something to deal with interference for the causal branch. I don't have any part of this worked out, so I will need to take some time to really learn what is happening in these papers

References:
https://www.ncbi.nlm.nih.gov/pubmed/21068053
https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.12184
https://github.com/bsaul/inferference

Branch plan:

---causal
      |
      -interference

Verification:
inferference the R package has some datasets that I can compare results with

Other:
Will need to update requirements to need statsmodels 0.9.0

Update Documents

Reminder to update and ensure all function documentation is up-to-date

Verify all functions

Reminder to verify all functions. Checklist of what has been verified

Base functions

Calculation Branch

Graphics

func_form_plot
effectmeasure_plot
pvalue_plot

Causal

Sensitivity Analyses

rr_corr
trapezoidal

Stochastic Treatment g-formula

Might be a good addition to add a stochastic treatment plan g-formula. Good preparation for me before stochastic TMLE #52 Maybe add something like .fit_stochastic()

At this point, I would only add it to the TimeFixedGFormula. Uses a re-sampling procedure. Have some parameter like resample to set. Default should be 100 or something. So stochastic fit would take the mean marginal outcome of the treatment plan

Complexities:

Easy to naive treat population at p percent. Need to think about how to implement treat those with X=x at p and those with X=y at q

IPCW with Missing Observations

Found an error in the IPCW code that caused it to raise an error for missing predicted values. I have updated the code, but need to verify it.

TMLE prediction models with ML

To allows for sklearn ML to generate predictions (or whatever the user would like, i.e. SuPyLearner), the model statement should allow fitted ML models. Due to how sklearn runs predictions (NumPy arrays), I might need to learn patsy to pull out the correct variables from a pandas dataframe

conf.py missing?

Working on lifelines docs, I was editing my conf.py file - which is a config file for Sphinx and Readthedocs. I don't see a conf.py in zepid, which makes me wonder how zepid.readthedocs.io is even working...

Add TMLE to website

Still need to add TMLE to the EffectMeasurePlot to show difference between methods

Add summary to EffectMeasurePlot

Like forest plots, there zEpid should calculate the summary measure of multiple studies into a single summary estimate (and add that estimate to the bottom of the plot).

This would be for a systematic review. Currently, I made the function to summarize stratified results or results across several models. This would be a further generalization.

I should also clean up this code while I am at it (I wrote it a long time ago and it can be improved). Also it would be ideal to remove t_adjuster and come up with some algorithm to optimally space the table (but leave the option for fine adjustments)

Support for continuous treatments in TimeFixedGFormula

Following AJPH 2016, add support for continuous treatments. Can verify against the simulated data (and R's results)

Looks like it might be a valuable time to add stochastic treatments. This would assign treatment with some probability to individuals. It would repeat the process multiple times (n=100+) and obtain the average of those treatments. This is useful for continuous treatments

Add custom_model to IPTW

Currently IPTW uses parametric logit models to estimate the weights. Like TMLE, we should allow users to specify custom models. While bootstrapped variances are invalid (maybe worth generating a warning?), the variance from GEE is valid for IPTW with machine learning models. This would allow semi-parametric weight estimation

Also not too hard to add, since the code to implement already lives within TMLE

TMLE additions

This is a longer term project. As I am reading through Targeted Learning, I will add to this list regarding features I would like to add. Also important notes that I have gleamed from the book.

Add support for continuous outcomes (Targeted Learning Ch7 (pg 125, 126)
For a continuous Y, it must be bounded between 0-1 before starting the process. Transform using the following Y* = (Y-a) / (b-a) where a = min(Y), b = max(Y)
Psi = (b-a) Psi* to convert from the bounded causal effect back to the original
Use some alpha to keep logit(Y*) being undefined. alpha = 0.0005 maybe?
~~Add support for F(A=1) and F(A=0)~~
~~Need to look up formulation in targeted learning book and IC~~
NOT IMPLEMENTING. James Robins showed in 1988 (Confidence intervals for causal parameters) that the corresponding confidence intervals may only be valid assuming deterministic potential outcomes. As a result, I have decided to not implement this feature (since I don't want to make that assumption
Natively handle missing data
R tmle will be a good reference for this
Add an option to specify a missing data model. This should be optional to include
Mediation analysis (direct, indirect, total) (Targeted Learning Ch 8)
Can add this, but I am not largely convinced of current mediation analysis. I know that people do like to use it though...
pg 139 has some good notes
Collaborative TMLE (CTMLE)
g-model should be based on TMLE of Q, not the fit to g. CTMLE is an approach to formalized this. Might try to add as an additional class object (depending on utility and substantive differences)

Optimize MonteCarloGFormula

bottleneck is a pandas dependency. Might be useful to consider a speed up for TimeVaryGFormula without having to add Cython to dependencies

Allow for plot_kde and plot_boxplot to plot log-odds

I'd actually like to get your expert opinion on this question and answer: https://stats.stackexchange.com/questions/378876/why-is-it-easier-and-just-as-valid-to-assess-overlap-with-logit-propensities/

With that in mind, should we allow users to set if they want raw probabilities or log-odds?

IPTW Balance Diagnostic

There is one diagnostic I still don't have fully addressed and that is balance. Figure 2 in Austin Stuart 2015 show the standardized differences for all variables. I have a calculator for standardized differences but it isn't set up fully.

New function would have to (0) pull variables from patsy, (1) determine variable type (binary, continuous), (2) calculated standardized differences, (3) generate plot

Also should store everything in a DataFrame object to sort and plot by standardized differences (as in Austin Stuart 2015)

Add Risk / Risk Ratio to TMLE

This was planned. Now that I understand what a "clever covariate" is, it should be too bad. Need to bake in some conditionals

AIPW

Augmented IPW is the proper name for the Funk et al formula. Need to update this for clarity

Update website

Reorganization:

TMLE bound IPW

R-TMLE has the default option of bounding the estimated IPW to correct for near-positivity violations. Essentially, everything below the 2.5th and above the 97.5th percentiles are set to that value.

I would like to add this option in. Should be simple. Add in something like bound_ipw=True somewhere to control this. I still need to decide whether to default to this or not. R-TMLE uses this as the default.

R TMLE also allows the bounds to be set to user-specified values. This should be the alternative. Maybe if-then logic to go through; True-> no bounding, False-> 2.5/97.5 bounding, float-> bounding at specification

Splines and Functional Form

When looking at functional form plots for spline models, they display oddly. Either something is going wrong in the generation of the matplotlib object or the spline calculation is doing something unexpected....

IPCW Prep for late entry

As it stands, ipcw_prep does not drop the time periods that occurred before late entry. Need to add in a line to drop those observations before the entrance time (should be easy fix)

pzivich / zepid Goto Github PK

zepid's Introduction

zEpid

Installation

Installing:

Dependencies:

Module Features

Measures

Graphics

Causal

G-Computation Algorithm

Inverse Probability Weights

Augmented Inverse Probability Weights

Targeted Maximum Likelihood Estimator

Generalizability / Transportability

G-estimation of Structural Nested Mean Models

Sensitivity Analyses

zepid's People

Contributors

Stargazers

Watchers

Forkers

zepid's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs