GithubHelp home page GithubHelp logo

pzivich / zepid Goto Github PK

View Code? Open in Web Editor NEW
135.0 9.0 33.0 11.15 MB

Epidemiology analysis package

Home Page: http://zepid.readthedocs.org

License: MIT License

Python 100.00%
epidemiology inverse-probability-weights risk-ratio epidemiology-analysis risk-difference g-formula ipw aipw g-computation tmle

zepid's Introduction

I am an epidemiologists primarily working in infectious diseases, causal inference, statistics, and open-source software. You can find more on my personal website: https://pzivich.github.io/

Paul's GitHub stats

zepid's People

Contributors

camdavidsonpilon avatar darrenreger avatar gitter-badger avatar joannadiong avatar pzivich avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zepid's Issues

Add density plots to IPTW diagnostics

I want to add density plots as an alternative to histograms in the IPTW probability diagnostics. Should be easy with SciPy kernel density. Likely use gaussian_kde

Update website documentation

Need to review all the documentation on the website. Additionally, I have a wishlist of items I would like to add.

  1. Add math formulas for the various estimators. Need to look up math for .rst
  2. Create new section of documentation that goes over all the documentation for each function. Much like how pandas is set up

Recommendations:

  1. Follow the format of NetworkX's documentation but retain the narrative structure as the front-end
  2. Update the documentation to reflect rST, so it can all be rendered to the website

IPTW Balance Diagnostic

There is one diagnostic I still don't have fully addressed and that is balance. Figure 2 in Austin Stuart 2015 show the standardized differences for all variables. I have a calculator for standardized differences but it isn't set up fully.

New function would have to (0) pull variables from patsy, (1) determine variable type (binary, continuous), (2) calculated standardized differences, (3) generate plot

Also should store everything in a DataFrame object to sort and plot by standardized differences (as in Austin Stuart 2015)

Verify all functions

Reminder to verify all functions. Checklist of what has been verified

Base functions

  • RiskRatio
  • RiskDiff
  • NNT
  • OddsRatio
  • IncRateDiff
  • IncRateRatio
  • IC
  • ICR
  • ACR
  • PAF
  • StandMeanDiff
  • Sensitivity
  • Specificity
  • spline

Calculation Branch

  • rr
  • rd
  • nnt
  • oddsratio
  • ird
  • irr
  • acr
  • paf
  • risk_ci
  • ir_ci
  • stand_mean_diff
  • odds_to_prop
  • ppv_conv
  • npv_conv
  • screening_cost_analyzer
  • counternull_pvalue

Graphics

  • func_form_plot
  • effectmeasure_plot
  • pvalue_plot

Causal

  • propensity_score
  • IPTW
  • IPMW
  • IPCW
  • SimpleDoubleRobust
  • TimeFixedGFormula
  • TimeVaryGFormula

Sensitivity Analyses

  • rr_corr
  • trapezoidal

Add Risk / Risk Ratio to TMLE

This was planned. Now that I understand what a "clever covariate" is, it should be too bad. Need to bake in some conditionals

Add G-formula

One lofty goal is to implement the G-formula. Would need to code two versions; time-fixed and time-varying. The Chapter by Robins & Hernan is good reference. I have code that implements the g-formula using pandas. It is reasonably fast.

TODO: generalize to a class, allow input models then predict, need to determine how to allow users to input custom treatment regimes (all/none/natural course are easy to do), compare results (https://www.ncbi.nlm.nih.gov/pubmed/25140837)

Time-fixed version will be relatively easy to write up

Time-varying will need the ability to specify a large amount of models and specify the order in which the models are fit.

Note; I am also considering reorganizing in v0.2.0 that IPW/g-formula/doubly robust will all be contained within a folder caused causal, rather than adding to the current ipw folder

Add stratify option to summary measures

For all the basic summary measures, it might be useful to include a stratify option. Default would be None. If specified with a variable, it would calculate the summary measure for each strata and then provide the Mantel-Haenszel estimate for that measure. Also should conduct the homogeneity test (warn users if test seems to be violated).

Plot should become more complex (plot all the stratified and summary).

Need to think about how to use for multivariate exposures...

IPCW with Missing Observations

Found an error in the IPCW code that caused it to raise an error for missing predicted values. I have updated the code, but need to verify it.

TMLE prediction models with ML

To allows for sklearn ML to generate predictions (or whatever the user would like, i.e. SuPyLearner), the model statement should allow fitted ML models. Due to how sklearn runs predictions (NumPy arrays), I might need to learn patsy to pull out the correct variables from a pandas dataframe

TMLE bound IPW

R-TMLE has the default option of bounding the estimated IPW to correct for near-positivity violations. Essentially, everything below the 2.5th and above the 97.5th percentiles are set to that value.

I would like to add this option in. Should be simple. Add in something like bound_ipw=True somewhere to control this. I still need to decide whether to default to this or not. R-TMLE uses this as the default.

R TMLE also allows the bounds to be set to user-specified values. This should be the alternative. Maybe if-then logic to go through; True-> no bounding, False-> 2.5/97.5 bounding, float-> bounding at specification

AIPW for nonmonotone missing data

IPMW handles only monotone missing data (only a single variable with missing data). This is fine but data missing for multiple variable simultaneously is more common in practice. There has been a recent proposal to use an AIPW to correct for missing data (under the MAR assumption)

Papers say it is easy to implement with existing software. Will have to see how true this is...

https://academic.oup.com/aje/article/187/3/585/4642953
https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2016.1256814#.XDOLZVxKhGM

Will require the renaming of the current AIPWto AIPTW (something I have been meaning to do)

Add summary to EffectMeasurePlot

Like forest plots, there zEpid should calculate the summary measure of multiple studies into a single summary estimate (and add that estimate to the bottom of the plot).

This would be for a systematic review. Currently, I made the function to summarize stratified results or results across several models. This would be a further generalization.

I should also clean up this code while I am at it (I wrote it a long time ago and it can be improved). Also it would be ideal to remove t_adjuster and come up with some algorithm to optimally space the table (but leave the option for fine adjustments)

Splines and Functional Form

When looking at functional form plots for spline models, they display oddly. Either something is going wrong in the generation of the matplotlib object or the spline calculation is doing something unexpected....

G-formula estimation

Currently, use the Monte Carlo estimation procedure for the g-formula. An alternative is to use sequential regression. Sequential regression has the advantage of needing fewer models to be specified. Krief et al. 2017 has a nice description of sequential regression estimation. This is the same paper that describes LTMLE

TMLE additions

This is a longer term project. As I am reading through Targeted Learning, I will add to this list regarding features I would like to add. Also important notes that I have gleamed from the book.

  • Add support for continuous outcomes (Targeted Learning Ch7 (pg 125, 126)

  • For a continuous Y, it must be bounded between 0-1 before starting the process. Transform using the following Y* = (Y-a) / (b-a) where a = min(Y), b = max(Y)

  • Psi = (b-a) Psi* to convert from the bounded causal effect back to the original

  • Use some alpha to keep logit(Y*) being undefined. alpha = 0.0005 maybe?

  • Add support for F(A=1) and F(A=0)

  • Need to look up formulation in targeted learning book and IC

  • NOT IMPLEMENTING. James Robins showed in 1988 (Confidence intervals for causal parameters) that the corresponding confidence intervals may only be valid assuming deterministic potential outcomes. As a result, I have decided to not implement this feature (since I don't want to make that assumption

  • Natively handle missing data

  • R tmle will be a good reference for this

  • Add an option to specify a missing data model. This should be optional to include

  • Mediation analysis (direct, indirect, total) (Targeted Learning Ch 8)

  • Can add this, but I am not largely convinced of current mediation analysis. I know that people do like to use it though...

  • pg 139 has some good notes

  • Collaborative TMLE (CTMLE)

  • g-model should be based on TMLE of Q, not the fit to g. CTMLE is an approach to formalized this. Might try to add as an additional class object (depending on utility and substantive differences)

Write tests

Need to write tests for all functionalities

TimeVaryGFormula in Cython

To speed up the TimeVaryGFormula, I should try cython to see how big of a speed boost I can get. Currently, SAS destroys python in regards to time to complete the g-formula. Keil et al. is a great example of this

Transition from AIPW to AIPTW

For clarity,the current AIPW will transition to AIPTW. I will need to update this and generate a user-warning for the upcoming change. Mentioned in #55

AIPW will no longer be supported in v0.5

Name change is necessary for eventual addition of Augment IPW for missingness (AIPMW). Additionally, the proposed naming would better match the naming conventions in ipw/

IPCW Prep for late entry

As it stands, ipcw_prep does not drop the time periods that occurred before late entry. Need to add in a line to drop those observations before the entrance time (should be easy fix)

Add TMLE to website

Still need to add TMLE to the EffectMeasurePlot to show difference between methods

Add time-varying RD/RR plots

Reminder to create time-varying Risk Difference / Risk Ratio plots in the vein of Table 3 in
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325676/ as recommended by
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325680/

Inputs should be two pandas Series for the risks indexed by time (like lifelines fitted Kaplan Meier 1 - km.surivival_function_. Duplicated risks should be dropped, Series merged by index (time), forward filled, then plotted. Include as part of 0.1.4

Issues to consider: confidence intervals (if IP-weighted would need to be bootstrapped), scale options (like effectmeasureplot), LOWESS curve options (like functionalform).

Update website

Reorganization:

  • Create Causal page
  • Move IPW to Causal page
  • Add time-fixed g-formula doc
  • Add time-varying g-formula doc
  • Add doubly robust doc
  • Add IPMW doc
  • Move dynamic risk plot to graphics (better suited there since it is under that branch)
  • Update index.rst
  • Change LOESS value for functional form plots (I set it too small so it looks silly)
  • Condense p-values plots
  • Consider switching forest plot from my study to causal methods instead
  • Remove histogram from diagnostics
  • Update README.md to reference all causal method implemented

Suggestion: used NamedTuples in calc utils

This became clear in reading the tests like:

    def test_match_sas_ci(self):
        sas_ci = (0.361409618, 0.638590382)
        r = risk_ci(25, 50, confint='wald')
        npt.assert_allclose(r[1:3], sas_ci)

It's not clear what r[1:3] represents, or r[0], etc.

My suggestion is to create a NamedTuple at the top of the file:

from typing import NamedTuple
...

class CalcResults(NamedTuple):
    name: string
    point_estimate: float
    lower_bound: float
    upper_bound: float
    standard_error: float

def incidence_rate_ci(events, time, alpha=0.05):
   ...

    return CalcResults('incidence_date', ir, lower, upper, sd)

and then users (and yourself) can access elements easier:

result = incidence_rate_ci(...)
result.point_estimate
result.standard_error

A NamedTuple is like a light-weight class - another option is lean in and create a full-on class with methods, etc.

Stochastic Treatment g-formula

Might be a good addition to add a stochastic treatment plan g-formula. Good preparation for me before stochastic TMLE #52 Maybe add something like .fit_stochastic()

At this point, I would only add it to the TimeFixedGFormula. Uses a re-sampling procedure. Have some parameter like resample to set. Default should be 100 or something. So stochastic fit would take the mean marginal outcome of the treatment plan

Complexities:

  • Easy to naive treat population at p percent. Need to think about how to implement treat those with X=x at p and those with X=y at q

DeprecationWarning

statsmodels v0.10 raises the following warning;

DeprecationWarning: Calling Family(..) with a link class as argument is deprecated. Use an instance of a link class instead.

Need to update the various parts of code to avoid this warning

Travis CI and matplotlib

I got Travis CI to check that all plots return axes objects once. Now it's having some trouble...

Everything passes locally and Travis is using the same version of matplotlib that I am using locally. Maybe something with the OS?...

causal_comparisons.py in the correct spot?

The file docs/causal_comparisons.py seems out of place (and references some files on your local machine). If this is intended as an example, I suggest moving to an examples/ folder and converting to a jupyter notebook.

Add custom_model to IPTW

Currently IPTW uses parametric logit models to estimate the weights. Like TMLE, we should allow users to specify custom models. While bootstrapped variances are invalid (maybe worth generating a warning?), the variance from GEE is valid for IPTW with machine learning models. This would allow semi-parametric weight estimation

Also not too hard to add, since the code to implement already lives within TMLE

Update Documents

Reminder to update and ensure all function documentation is up-to-date

Stochastic Treatment TMLE

This is another valuable addition to TMLE (that I also need as part of a project I am working on). Essentially, this would allow more complex treatments than treat-all vs. treat-none, similar to custom treatments in the g-formula.

What it does is shift the probability of A distribution. However, the single-step convergence of TMLE is no longer valid. I would need to iteratively estimate Q* until it epsilon converges to 0. This should be easy enough. After convergence, follows the remainder of the TMLE procedure

Likely best if I make this separate from TMLE

Starting points;
https://github.com/tlverse/tmle3shift
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4117410/#SD1

Transition ipw to causal branch

In version 0.2.0, ipw will be moved to a new branch labelled causal. Reason for transition is to better group IPW with similar methods (doubly robust, G-computation)

conf.py missing?

Working on lifelines docs, I was editing my conf.py file - which is a config file for Sphinx and Readthedocs. I don't see a conf.py in zepid, which makes me wonder how zepid.readthedocs.io is even working...

Add Survival TMLE

Add a survival-analysis TMLE. Should be similar to the LTMLE, but need to find a good example

IPMW for Non-monotone Missing Data

Similar to the resources in #55 it would be nice to have both potential options for IPMW implemented (this is the framework that something like AIPMW would use anyways).

For monotone, need a recursive process to estimate each probability. Then calculate the weights.

Tchetgen has the unconstrained version for nonmonotone data. See his paper referenced in #55
The real difficulty for implementation will be the Bayesian Constrained (BC) estimator. This estimation process uses Bayesian tools to constrain the weights. The rationale for the BC is that the unconstrained version does not converge well

Framework:
IPMW(..., missing=['var1', 'var2'], monotone=True)
For the list of missing variables, they would each have their respectively probability of observation calculated (in the monotone scenario). Their order is key for the monotonic missing data
For nonmonotonic (monotone=False), the unconstrained log-likelihood would be maximized based on the calculated observed variable

To implement:

  1. Write montone detector (to check users entered the data correctly)

  2. start with monotone, since it is the easier case. Rely on loop through all the missing variables

  3. Create example data set with monotone missing data. Use to write tests compared to R

Update standardized difference calculator

Not discussed in the Austin, Stuart article is the formula for categorical SMD. I will need to update this to work as intended.

Currently standardized_mean_differences and plot_love assume that variables are either binary or continuous. There is no option for variables to be categorical. This is a problem, since the formulation for categorical variables is distinct from binary variables

Will need to interact with patsy 's C(...) functionality to detect variable types properly.

Reference
http://support.sas.com/resources/papers/proceedings12/335-2012.pdf

Support for continuous treatments in TimeFixedGFormula

Following AJPH 2016, add support for continuous treatments. Can verify against the simulated data (and R's results)

Looks like it might be a valuable time to add stochastic treatments. This would assign treatment with some probability to individuals. It would repeat the process multiple times (n=100+) and obtain the average of those treatments. This is useful for continuous treatments

Add interference

Later addition, but since statsmodels 0.9.0 has GLIMMIX, I would like to add something to deal with interference for the causal branch. I don't have any part of this worked out, so I will need to take some time to really learn what is happening in these papers

References:
https://www.ncbi.nlm.nih.gov/pubmed/21068053
https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.12184
https://github.com/bsaul/inferference

Branch plan:

---causal
      |
      -interference

Verification:
inferference the R package has some datasets that I can compare results with

Other:
Will need to update requirements to need statsmodels 0.9.0

Network G-Formula

Another for the wishlist. This will require me to learn Gibbs sampling to implement. Will also have to add NetworkX to the list of dependencies (or optional dependency)

Reference:
https://arxiv.org/pdf/1709.01577.pdf

Branch Plan:
either

---causal
      |
      -gformula

or

---causal
      |
      -interference

Second option might be better for the NetworkX optional import

Verification:
Probably just adapt some existing network simulation code I have

Other:
Need to add NetworkX as a(n) (optional) dependency

AIPW

Augmented IPW is the proper name for the Funk et al formula. Need to update this for clarity

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.