healthcatalyst / healthcareai-py Goto Github PK

View Code? Open in Web Editor NEW

308.0 51.0 185.0 6.77 MB

Python tools for healthcare machine learning

Home Page: http://healthcare.ai

License: MIT License

Python 95.84% PowerShell 2.13% Batchfile 1.30% Shell 0.46% Makefile 0.18% Dockerfile 0.10%

python machine-learning healthcare

healthcareai-py's Introduction

healthcareai

The aim of healthcareai is to streamline machine learning in healthcare. The package has two main goals:

Allow one to easily create models based on tabular data, and deploy a best model that pushes predictions to a database such as MSSQL, MySQL, SQLite or csv flat file.
Provide tools related to data cleaning, manipulation, and imputation.

Installation

Windows

If you haven't, install 64-bit Python 3.5 via the Anaconda distribution
- Important When prompted for the Installation Type, select Just Me (recommended). This makes permissions later in the process much simpler.
Open the terminal (i.e., CMD or PowerShell, if using Windows)
Run conda install pyodbc
Upgrade to latest scipy (note that upgrade command took forever)
Run conda remove scipy
Run conda install scipy
Run conda install scikit-learn
Install healthcareai using one and only one of these three methods (ordered from easiest to hardest).
1. Recommended: Install the latest release with pip run pip install healthcareai
2. If you know what you're doing, and instead want the bleeding-edge version direct from our github repo, run pip install https://github.com/HealthCatalyst/healthcareai-py/zipball/master

Why Anaconda?

We recommend using the Anaconda python distribution when working on Windows. There are a number of reasons:

When running anaconda and installing packages using the conda command, you don't need to worry about dependency hell, particularly because packages aren't compiled on your machine; conda installs pre-compiled binaries.
A great example of the pain the using conda saves you is with the python package scipy, which, by their own admission "is difficult".

Linux

You may need to install the following dependencies:

sudo apt-get install python-tk
sudo pip install pyodbc
- Note you'll might run into trouble with the pyodbc dependency. You may first need to run sudo apt-get install unixodbc-dev then retry sudo pip install pyodbc. Credit stackoverflow

Once you have the dependencies satisfied run pip install healthcareai or sudo pip install healthcareai

macOS

pip install healthcareai or sudo pip install healthcareai

Linux and macOS (via docker)

Install docker
Clone this repo (look for the green button on the repo main page)
cd into the cloned directory
run docker build -t healthcareai .
run the docker instance with docker run -p 8888:8888 healthcareai
You should then have a jupyter notebook available on http://localhost:8888.

Verify Installation

To verify that healthcareai installed correctly, open a terminal and run python. This opens an interactive python console (also known as a REPL). Then enter this command: from healthcareai import SupervisedModelTrainer and hit enter. If no error is thrown, you are ready to rock.

If you did get an error, or run into other installation issues, please let us know or better yet post on Stack Overflow (with the healthcare-ai tag) so we can help others along this process.

Getting started

Read through the Getting Started section of the healthcareai-py documentation.
Read through the example files to learn how to use the healthcareai-py API.
- For examples of how to train and evaluate a supervised model, inspect and run either example_regression_1.py or example_classification_1.py using our sample diabetes dataset.
- For examples of how to use a model to make predictions, inspect and run either example_regression_2.py or example_classification_2.py after running one of the first examples.
- For examples of more advanced use cases, inspect and run example_advanced.py.
To train and evaluate your own model, modify the queries and parameters in either example_regression_1.py or example_classification_1.py to match your own data.
Decide what type of prediction output you want. See Choosing a Prediction Output Type for details.
Set up your database tables to match the schema of the output type you chose.
- If you are working in a Health Catalyst EDW ecosystem (primarily MSSQL), please see the Health Catalyst EDW Instructions for setup.
- Otherwise, please see Working With Other Databases for details about writing to different databases (MSSQL, MySQL, SQLite, CSV)
Congratulations! After running one of the example files with your own data, you should have a trained model. To use your model to make predictions, modify either example_regression_2.py or example_classification_2.py to use your new model. You can then run it to see the results.

For Issues

Double check that the code follows the examples here
If you're still seeing an error, create a post in Stack Overflow (with the healthcare-ai tag) that contains
- Details on your environment (OS, database type, R vs Py)
- Goals (ie, what are you trying to accomplish)
- Crystal clear steps for reproducing the error
You can also log a new issue in the GitHub repo by clicking here

healthcareai-py's People

Stargazers

Watchers

Forkers

octaflop slcpython patrickboswell codeaudit hotkee sunnypig618 grantpace daverunfast stump0 cherifsy yoshimaa lkwoolsey michaelgiessing sjbloom pabulson michaelbonn ademata yvanhuele ethantaft hanlei-zhu jpo niilante haynesaj1 zlianggithub kormilitzin adrish mxlei01 dokotta dokottabyo aylr team-ai-tokyo jungbah projectkubo shjb16 tannerdietrich anjkt neflokneback bianan adeb09 sandrews clustersdata alysivji danwellisch1 nalbarr mkingsbu-hci dthinkcs jlitzingerdev anukat2015 rabihsaliba juho2 olatechie 53r4ph1n3 hinafirdaus ajitds xtaraim tanghuil pelluru innfiniteminds dszakielo raihan29s wmh130030 lulzzz singhcse glenrs najmehta ckanu13k nikned chris-mac biterbilen st0w yugrocks bjfuzhao gslabdev jmscraig pittacus vishalbelsare fauzanbudi icywinddale cjkanani reloadbrain jlee2cacn salamahk bazeemuddin rosemary0401 donnyzhao othag metachenyiyan vijayphugat vijaysingh-gslab konerukeerthi fding253 kaiiyer kiranvnvopensource mohamedabdelhafezelnhas skganta color4 animeshsinghrajput rahulmarlabs swaker01 biskumar

healthcareai-py's Issues

Add what-if calc/output to Deploy step to provide guidance on improving prob

Essentially what we want to do is provide better guidance compared to what the current deploy method provides.

Instead of just providing the top three columns that were most important in a prediction, we want to provide the top three columns that provide the quickest improvement in probability over a 0.5 std deviation.

For example, if BMI and BP were feeding a prediction of diabetes probability, the output of the method would not only be probability (as it is now), but also a ranking of BMI vs BP according to which one would provide the biggest drop in diabetes likelihood vs a .5 std deviation drop in BMI and LDL, the resulting (lower) BMI and LDL that we put the person down 0.5 std deviations, and the associated (lower) probability. So the output cols might looks like this:

Need a risk factor shared data mart!

Tracy Veyo: Archimedes does this what-if scenario vizualization

Current:
BinindID, BindingNM, GrainID, CurrentProbability, FirstFactor, SecondFactor, ThirdFactor
0 R 10001 0.05 BMI LDL Age

Ideal:
BindingID, BindingNM, GrainID, CurrentProbability, FirstFactor, FirstTarget, FirstAdjProb, SecondFactor, SecondTarget, SecondAdjPRob
0 R 10001 0.70 BMI 21 0.65 BP 120 .67

This allows us to not worry about per-row variable importance from our efforts in deconstructing logistic!

We'll start with numeric columns (we use this kind of check for our imputation)

We'll start with a list of alterable factors (as arg to method); eventually we'll pull from a SQL table (created via editable SAM)

Make sure we understand how sphinx can interact with our github pages

Pull down from HC Community (ie Jive).

When finished, upload to both Spark and Jive.

Add mixed effects models

Make similar to linear/rf models already there. Use the statsmodels package.

http://statsmodels.sourceforge.net/devel/mixed_linear.html

Change method name in HCPyTools to linear instead of logit/lasso and standardize overall

Make sure all method naming in HCPyTools is all underscore (using _ as separator) following this:
http://visualgit.readthedocs.io/en/latest/pages/naming_convention.html#methods

Note there are methods for feature importance and ROC needing this as well.

Also, add check in Deploy step method to make sure the user enters either linear or random_forest.

Modifiable risk factors

Expand what was done in what-if work for PyTools issue (#6).

Now we need to add it for categorical cols, such that the what-if functionality works for orderset optimization and changes such as smoker Y/N, etc.

What this method will do, is allow one to see what col Z will result in if the clinician goes with treatment A, B, or C.

Test install instructions from README on VM

For windows
For linux

Fix requirements.txt for virtual env install

Ask Faris where he got this list of required packages.

The associated instructions in the readme aren't working.

Add function to convert NLP col to 3,4, or 5 numeric cols

Based on PCA, we want to be able to convert text (using scikit-learn package) to a varying number of numeric columns.

Have an argument letting user specify which is the text col (since we'll expect a dataframe).

Two routes for this (and we can use an argument to specify which way the user wants):

We have a method argument that specifies the number of resultant columns, and we just grab that many eigenvalues (from the large matrix that arises after vectorization)
We have an argument such as percent.var.kept, which tells us how many columns (or eigenvectors) to grab that would make sure we get that much variance (from the super large matrix that, of course, contains all the variance).

So if percent.var.kept=10 and we grab three eigenvectors and have three cols in the resulting dataframe, then percent.var.kept=90 would pull back 30-40 columns. Of course, the number of columns will differ based on how much text is in the input text column.

Make DeploySupervisedModel compatible with DBs other than SQL Server

Currently, the DeploySupervisedModel class makes several assumptions about the target database. It assumes that the driver is SQL Server Native Client 11.0 and that the user is authenticating via a trusted connection.

Rather than handling the connection details directly, a better approach may be to refactor the DeploySupervisedModel class to accept a connection object as an argument. This would allow the end user to use the database of their choice.

Make Dev step work for cols that have names less than three characters

First, test that this is broken. Import a csv without col headers and see if you can run the Dev step with linear/rf.

If broken,

Look for the piece of code that removes the cols with DTS suffixes. To fix, you could use an if statement, that checks to make sure col names are three chars or longer.

Dockerize for cross-platform awesomeness

A docker-compose and Dockerfile would go a long way towards showing some deployment strategies and setting up a consistent python notebook across dev distros.

Create function to remove columns that are only NaN

Please make sure this works for cols that are 1) NA only and 2) NaN only

implement
unit tests

Can use this to test:

df = pd.DataFrame({'a':[1, None, 2, 3],
'b':['m', 'f', None, 'f'],
'c':[3, 4, 5, None],
'd':[None, 8, 1, 3],
'label':['Y', 'N', 'Y', 'N']})

Create grouped lasso functionality for DevelopSupervisedModel step

Using http://contrib.scikit-learn.org/lightning/

We probably are fine without CV, as lasso prevents over-fitting.

Can start with this code that includes CV and non-CV implementations using lightning's CDClassifier.

from hcpytools.impute_custom import DataFrameImputer
from hcpytools import modelutilities
from sklearn.linear_model import RandomizedLogisticRegression, LogisticRegression, Lasso, ElasticNet, ElasticNetCV
from sklearn.feature_selection import RFE, SelectFromModel
from lightning.classification import CDClassifier
from hcpytools.develop_supervised_model import DevelopSupervisedModel
import warnings
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from hcpytools.impute_custom import DataFrameImputer
from sklearn.metrics import roc_auc_score
from sklearn import cross_validation
from sklearn import datasets
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
import time
import sys
warnings.filterwarnings("ignore", category=DeprecationWarning)
pd.set_option('display.max_columns', None)

df = pd.read_csv('C:/Users/levi.thatcher/Desktop/Pyscratch/Airline2008.csv', nrows=100000)

print(len(df))

#df.drop(['UniqueCarrier','Month','TailNum','FlightNum','Origin','Dest','CancellationCode'], axis=1, inplace=True)

print(df.dtypes)

df = df[['Cancelled','Month','DayOfWeek','Diverted','ArrTime','TaxiIn','TaxiOut','NASDelay','DepTime','AirTime']]

# Convert from numeric to factor
df['Month'] = df['Month'].astype(object)
df['DayOfWeek'] = df['DayOfWeek'].astype(object)
df['Diverted'] = df['Diverted'].astype(object)

# Convert from 1/0 to Y/N
df['Cancelled'].replace([1,0],['Y','N'], inplace=True)

# Look at data that's been pulled in
print(df.head())
print(df.dtypes)


#Step 1: compare two models
o = DevelopSupervisedModel(modeltype='classification',
                           df=df,
                           predictedcol='Cancelled',
                           graincol='',  #OPTIONAL/ENCOURAGED
                           impute=True,
                           debug=False)

t0 = time.time()
o.linear(cores=1,
         debug=False,
         tune=True)
print('Time: {}\n'.format(time.time() - t0))

t0 = time.time()
o.randomforest(cores=1,
               debug=False)
print('Time: {}\n'.format(time.time() - t0))

# Convert back from Y/N to 1/0 for non-HCRTools measures
df['Cancelled'].replace(['Y','N'],[1,0], inplace=True)

sys.exit()

##############################################

y = df['Cancelled']
df.drop(['Cancelled'], axis=1, inplace=True)
X = df

X = DataFrameImputer().fit_transform(X)
print('After imputation and before dummy creation')
print(X.dtypes)
print(X.head())

X = pd.get_dummies(X, drop_first=True, prefix_sep='.')
print('After dummy creation')
print(X.dtypes)
print(X.head())

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y, test_size=0.2, random_state=0)

colname = X_train.columns.values
print(y_train.shape)
print(X_train.shape)
print(X_test.shape)

print(y_train.value_counts())


#################### Grouped Lasso from lightning WITH CV ###########################
t0 = time.time()
print('Group Lasso with CV')
clf = CDClassifier(penalty="l1/l2",
                   loss="log",
                   multiclass=False,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)

pipeline = Pipeline([('clf', clf)])
tuned_parameters = [{'clf__alpha': [0.01,0.25,0.5,0.75,.99, 5, 10]}]

estimator = GridSearchCV(pipeline, tuned_parameters, cv = 5)            

estimator.fit(X_train, y_train)

y_true, y_pred = y_test, estimator.predict_proba(X_test)
print('AUC: {}'.format(roc_auc_score(y_true, y_pred[:,1])))
#print(estimator.best_params_)
print(estimator.best_estimator_.named_steps['clf'].coef_)
print('Time: {}\n'.format(time.time() - t0))


#################### Grouped Lasso from lightning ###################################
t0 = time.time()
print('Group Lasso no CV')
clf1 = CDClassifier(penalty="l1/l2",
                   loss="log",
                   multiclass=False,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)

clf1.fit(X_train, y_train)

y_true, y_pred = y_test, clf1.predict_proba(X_test)
print('AUC: {}'.format(roc_auc_score(y_true, y_pred[:,1])))
print(clf1.coef_)
print('Time: {}\n'.format(time.time() - t0))

Create unit test for py ConvertTimeToDummies function

Make sure task #11 is done (ie the function already exists)

Create model objects

Add doc strings to Python functions

Verify that these work via the ?NameOfFunction trick

Change InWindow argument to InTestWindow in Deploy class

Make it align well with HCRTools deploy class.

Might have to update the HREmployeeDeploy.csv file in HCPyTools to have col name InTestWindow

Create unit tests for DeploySupervisedModel

Assert that logit and rf results are consistent.

Use test-develop-supervised-model.R in HCRTools as an example.

Create jupyter notebook for Py deploy step

Use the notebooks in the README.md for guidance.

Link this new notebook from README.md.

Re-factor Py constructors to use functions

There's way too much code in the develop and deploy constructors.

We want to move these steps to small functions in modelutilities that can be called in both dev and deploy constructors.

Fix typos and formatting issues in py docs

Change name of package to healthcareai

Talk to group about this.

Potential repo name: healthcareai-py

Package structure would be

healthcareai-py
healthcareai
notebooks

Investigate "build" tools for Python that mimic what appveyor does

Check out this link which talks about the following packages:

pyflakes,pep8,autopep8

The idea is that before checking-in we'd have a build tool that's similar to 'Check' in the RStudio/devtools world.

Fix scikit warning by using new syntax

C:\Python35\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
C:\Python35\lib\site-packages\sklearn\grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)

Create AU_PR plot to complement the AU_ROC plot

Create plot that shows precision (y-axis) and recall (x-axis) similarly to how ROC shows TPR vs FPR.

Also print AUPR metric to console near where AUROC is output.

See here: http://stats.stackexchange.com/questions/90779/area-under-the-roc-curve-or-area-under-the-pr-curve-for-imbalanced-data

Fix python sdist (for creating built distribution)

After creating zip and unzipping, installation doesn't happen like it used to.

Also, add description summary and description files to setup.py (if applicable); also fix home page to Jive for now (will later change when our docs have an open url).

Shift from ceODBC to pyodbc

While this'll make reads from SQL Server slower, ceODBC isn't well supported.

And pyodbc is faster than SQL Alchemy:
http://levithatcher.com/2015/11/r-vs-python-performance-benchmarking/

If each row of df has NA and impute=False, raise proper error message

create or find data that has NA in each row of non-label cols

Can use this:
df = pd.DataFrame({'a':[1, None, 2, 3],
'b':['m', 'f', None, 'f'],
'c':[3, 4, 5, None],
'd':[None, 8, 1, 3],
'label':['Y', 'N', 'Y', 'N']})

Run DevelopSupervisedModel to create object with predicted.col=label
Note that this error arises:

"Traceback (most recent call last):
File "C:/Source/DataScience/HCPyTools/hcpytools/test.py", line 23, in
debug=False)
File "C:\Source\DataScience\HCPyTools\hcpytools\develop_supervised_model.py", line 70, in init
print(df.shape)
AttributeError: 'NoneType' object has no attribute 'shape'"

To fix: Check if each row contains NA and (if so) raise error before trying to remove rows with NAs. See here for proper way to do it: http://stackoverflow.com/a/24065533/5636012
--In our code, look for the line with df = df.dropna(axis=0, how='any', inplace=True)

write unit tests

Separate code examples into a distinct folder

If one chooses regression and pred col is binary, throw error

This is the current unhelpful message: ValueError: Unable to parse string "N" at position 0

Set error (with helpful message) when classification is chosen and pred col is numeric

Create RESTful API and call model prediction from another VM

So basically, we'll host the model and RESTful API on a workstation and make calls to the API from the ETL machine (which will tie in to the platform extensibility point).

In this task, I will use two Azure VMs within a VNet to test this functionality.

Note that the API call (from the SQL Server VM) will not pass values, but will instead make the call to the workstation VM to run the entire script (including pulling data in from the SQL Server VM).

Find way to check HCPyTools locally same way that 'Check' in RStudio works

Mostly important locally, as Appveyor has CI checks that it does.

Check what CI does for Python packages and try to set up same check locally.

Create function to remove columns from df that have same values in each row

Might look at HCRTools for an example

Doing this because columns with zero variance don't help model.

Please complete in same PR as #16

Switch import statements to 'from hcpytools import DevelopSupervisedModel'

Create unit test for function that removes cols with same value in each row

Please complete this in same PR as #15

See if healthcareai-py has to be BSD license (or if it can be MIT)

Fix install instructions in readme

This may help: http://stackoverflow.com/questions/15268953/how-to-install-python-package-from-github

Look for a command that doesn't require git to be installed.

Refactor Python Code - Move all train to dev step, create model objects

Create date-time feature engineering function

Use the HCRTools example in common.R as an example.

implement feature engineering
test coverage

Move unit tests and examples to using DiabetesClinical

move the new data in (and in manifest)
unit tests
examples (including one notebook)

`ceODBC` is platform-specific to Windows — remove references and use a cross-platform tool

ceODBC is breaking installations on non-Windows deployments. We need to use a different driver or perhaps update the DB connection paradigm entirely — like use a better database connection protocol or a python DB cursor / ORM.

Example notebook isn't fully functional

There are some method typos, missing imports, a broken csv path, and a few columns that don't match up to the example data in the csv.

I submitted a PR to resolve this.

Create unit test for function that removes cols that are only NaN

Please complete this in same PR as #17

Make sure this works for cols that are 1) NaN and 2) None

Can use this to test:

df = pd.DataFrame({'a':[1, None, 2, 3],
'b':['m', 'f', None, 'f'],
'c':[3, 4, 5, None],
'd':[None, 8, 1, 3],
'label':['Y', 'N', 'Y', 'N']})

Add function to convert NLP col to 3,4, or 5 numeric cols

Based on PCA, we want to be able to convert text (using tm package) to a varying number of numeric columns.

Have an argument letting user specify which is the text col (since we'll expect a dataframe).

Two routes for this (and we can use an argument to specify which way the user wants):

We have a method argument that specifies the number of resultant columns, and we just grab that many eigenvalues (from the large matrix that arises after vectorization)
We have an argument such as percent.var.kept, which tells us how many columns (or eigenvectors) to grab that would make sure we get that much variance (from the super large matrix that, of course, contains all the variance).

Create unit test for when each row in df has null and impute = FALSE

Please complete this in same PR as #13

Make sure the unit test covers dataframes where each row has

None
NaN

Can use this for testing:

df = pd.DataFrame({'a':[1, None, 2, 3],
'b':['m', 'f', None, 'f'],
'c':[3, 4, 5, None],
'd':[None, 8, 1, 3],
'label':['Y', 'N', 'Y', 'N']})

Set up Appveyor

Create output json (or just flatfile) containing model performance measures

Would be great to include things like performance for the methods used, time spent on processing time, variable importance.

Put date/time stamp in file name.

Main axis of comparison is between lasso/logit and rf.

Make these metrics attributes, such that they're accessible via the environment object.

Coordinate with CAFE team on this (as they've done it already)--in particular, reach out to Justin (and copy Patrick Nelli), such that we can make sure to grab the metrics they're grabbing.

Be sure to include these metrics: PRAUC, AUROC, trimmed list of TPR/FPR at various cutpoints, var importance for each method
Be sure to include figures: Precision-recall, ROC

Automate important variables return from randomforest

Only return the most important features when using rf in DevelopSupervisedModel instead of returning all the variables with their importance (from 0 to 1).

We want to make it such that the user doesn't have to debate about where the cutoff should lie

So please use the Airline dataset (or some large data) and check with rf in Develop step as to whether taking out the bottom .05 or features from the initial df worsens the accuracy.

If removing those below .05 does no harm, then we can just automatically return those above that threshold.

Add COLLABORATE file

Base on collaboration doc Levi has.

Add

SQL Server instructions from HCRTools collab doc
Faris instructions
git config core.ignorecase false

repackage CSVs into a MANIFEST.in

We shouldn't store csvs for our example into the setup package itself. These docs state the the correct solution is to create a MANIFEST.in file which is a collection of regex references to the files in question.

healthcatalyst / healthcareai-py Goto Github PK

healthcareai-py's Introduction

healthcareai

Installation

Windows

Why Anaconda?

Linux

macOS

Linux and macOS (via docker)

Verify Installation

Getting started

For Issues

healthcareai-py's People

Stargazers

Watchers

Forkers

healthcareai-py's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs