GithubHelp home page GithubHelp logo

healthcatalyst / healthcareai-py Goto Github PK

View Code? Open in Web Editor NEW
308.0 51.0 185.0 6.77 MB

Python tools for healthcare machine learning

Home Page: http://healthcare.ai

License: MIT License

Python 95.84% PowerShell 2.13% Batchfile 1.30% Shell 0.46% Makefile 0.18% Dockerfile 0.10%
python machine-learning healthcare

healthcareai-py's Introduction

healthcareai

Code Health Appveyor build status Build Status

PyPI version DOI GitHub license

The aim of healthcareai is to streamline machine learning in healthcare. The package has two main goals:

  • Allow one to easily create models based on tabular data, and deploy a best model that pushes predictions to a database such as MSSQL, MySQL, SQLite or csv flat file.
  • Provide tools related to data cleaning, manipulation, and imputation.

Installation

Windows

  • If you haven't, install 64-bit Python 3.5 via the Anaconda distribution
    • Important When prompted for the Installation Type, select Just Me (recommended). This makes permissions later in the process much simpler.
  • Open the terminal (i.e., CMD or PowerShell, if using Windows)
  • Run conda install pyodbc
  • Upgrade to latest scipy (note that upgrade command took forever)
  • Run conda remove scipy
  • Run conda install scipy
  • Run conda install scikit-learn
  • Install healthcareai using one and only one of these three methods (ordered from easiest to hardest).
    1. Recommended: Install the latest release with pip run pip install healthcareai
    2. If you know what you're doing, and instead want the bleeding-edge version direct from our github repo, run pip install https://github.com/HealthCatalyst/healthcareai-py/zipball/master

Why Anaconda?

We recommend using the Anaconda python distribution when working on Windows. There are a number of reasons:

  • When running anaconda and installing packages using the conda command, you don't need to worry about dependency hell, particularly because packages aren't compiled on your machine; conda installs pre-compiled binaries.
  • A great example of the pain the using conda saves you is with the python package scipy, which, by their own admission "is difficult".

Linux

You may need to install the following dependencies:

  • sudo apt-get install python-tk
  • sudo pip install pyodbc
    • Note you'll might run into trouble with the pyodbc dependency. You may first need to run sudo apt-get install unixodbc-dev then retry sudo pip install pyodbc. Credit stackoverflow

Once you have the dependencies satisfied run pip install healthcareai or sudo pip install healthcareai

macOS

  • pip install healthcareai or sudo pip install healthcareai

Linux and macOS (via docker)

  • Install docker
  • Clone this repo (look for the green button on the repo main page)
  • cd into the cloned directory
  • run docker build -t healthcareai .
  • run the docker instance with docker run -p 8888:8888 healthcareai
  • You should then have a jupyter notebook available on http://localhost:8888.

Verify Installation

To verify that healthcareai installed correctly, open a terminal and run python. This opens an interactive python console (also known as a REPL). Then enter this command: from healthcareai import SupervisedModelTrainer and hit enter. If no error is thrown, you are ready to rock.

If you did get an error, or run into other installation issues, please let us know or better yet post on Stack Overflow (with the healthcare-ai tag) so we can help others along this process.

Getting started

  1. Read through the Getting Started section of the healthcareai-py documentation.

  2. Read through the example files to learn how to use the healthcareai-py API.

    • For examples of how to train and evaluate a supervised model, inspect and run either example_regression_1.py or example_classification_1.py using our sample diabetes dataset.
    • For examples of how to use a model to make predictions, inspect and run either example_regression_2.py or example_classification_2.py after running one of the first examples.
    • For examples of more advanced use cases, inspect and run example_advanced.py.
  3. To train and evaluate your own model, modify the queries and parameters in either example_regression_1.py or example_classification_1.py to match your own data.

  4. Decide what type of prediction output you want. See Choosing a Prediction Output Type for details.

  5. Set up your database tables to match the schema of the output type you chose.

  6. Congratulations! After running one of the example files with your own data, you should have a trained model. To use your model to make predictions, modify either example_regression_2.py or example_classification_2.py to use your new model. You can then run it to see the results.

For Issues

  • Double check that the code follows the examples here
  • If you're still seeing an error, create a post in Stack Overflow (with the healthcare-ai tag) that contains
    • Details on your environment (OS, database type, R vs Py)
    • Goals (ie, what are you trying to accomplish)
    • Crystal clear steps for reproducing the error
  • You can also log a new issue in the GitHub repo by clicking here

healthcareai-py's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

healthcareai-py's Issues

Add what-if calc/output to Deploy step to provide guidance on improving prob

Essentially what we want to do is provide better guidance compared to what the current deploy method provides.

Instead of just providing the top three columns that were most important in a prediction, we want to provide the top three columns that provide the quickest improvement in probability over a 0.5 std deviation.

For example, if BMI and BP were feeding a prediction of diabetes probability, the output of the method would not only be probability (as it is now), but also a ranking of BMI vs BP according to which one would provide the biggest drop in diabetes likelihood vs a .5 std deviation drop in BMI and LDL, the resulting (lower) BMI and LDL that we put the person down 0.5 std deviations, and the associated (lower) probability. So the output cols might looks like this:

Need a risk factor shared data mart!

Tracy Veyo: Archimedes does this what-if scenario vizualization

Current:
BinindID, BindingNM, GrainID, CurrentProbability, FirstFactor, SecondFactor, ThirdFactor
0 R 10001 0.05 BMI LDL Age

Ideal:
BindingID, BindingNM, GrainID, CurrentProbability, FirstFactor, FirstTarget, FirstAdjProb, SecondFactor, SecondTarget, SecondAdjPRob
0 R 10001 0.70 BMI 21 0.65 BP 120 .67

This allows us to not worry about per-row variable importance from our efforts in deconstructing logistic!

We'll start with numeric columns (we use this kind of check for our imputation)

We'll start with a list of alterable factors (as arg to method); eventually we'll pull from a SQL table (created via editable SAM)

Modifiable risk factors

Expand what was done in what-if work for PyTools issue (#6).

Now we need to add it for categorical cols, such that the what-if functionality works for orderset optimization and changes such as smoker Y/N, etc.

What this method will do, is allow one to see what col Z will result in if the clinician goes with treatment A, B, or C.

Add function to convert NLP col to 3,4, or 5 numeric cols

Based on PCA, we want to be able to convert text (using scikit-learn package) to a varying number of numeric columns.

Have an argument letting user specify which is the text col (since we'll expect a dataframe).

Two routes for this (and we can use an argument to specify which way the user wants):

  1. We have a method argument that specifies the number of resultant columns, and we just grab that many eigenvalues (from the large matrix that arises after vectorization)

  2. We have an argument such as percent.var.kept, which tells us how many columns (or eigenvectors) to grab that would make sure we get that much variance (from the super large matrix that, of course, contains all the variance).

So if percent.var.kept=10 and we grab three eigenvectors and have three cols in the resulting dataframe, then percent.var.kept=90 would pull back 30-40 columns. Of course, the number of columns will differ based on how much text is in the input text column.

Make DeploySupervisedModel compatible with DBs other than SQL Server

Currently, the DeploySupervisedModel class makes several assumptions about the target database. It assumes that the driver is SQL Server Native Client 11.0 and that the user is authenticating via a trusted connection.

Rather than handling the connection details directly, a better approach may be to refactor the DeploySupervisedModel class to accept a connection object as an argument. This would allow the end user to use the database of their choice.

Make Dev step work for cols that have names less than three characters

First, test that this is broken. Import a csv without col headers and see if you can run the Dev step with linear/rf.

If broken,

Look for the piece of code that removes the cols with DTS suffixes. To fix, you could use an if statement, that checks to make sure col names are three chars or longer.

Dockerize for cross-platform awesomeness

A docker-compose and Dockerfile would go a long way towards showing some deployment strategies and setting up a consistent python notebook across dev distros.

Create function to remove columns that are only NaN

Please make sure this works for cols that are 1) NA only and 2) NaN only

  • implement
  • unit tests

Can use this to test:

df = pd.DataFrame({'a':[1, None, 2, 3],
'b':['m', 'f', None, 'f'],
'c':[3, 4, 5, None],
'd':[None, 8, 1, 3],
'label':['Y', 'N', 'Y', 'N']})

Create grouped lasso functionality for DevelopSupervisedModel step

Using http://contrib.scikit-learn.org/lightning/

We probably are fine without CV, as lasso prevents over-fitting.

Can start with this code that includes CV and non-CV implementations using lightning's CDClassifier.

from hcpytools.impute_custom import DataFrameImputer
from hcpytools import modelutilities
from sklearn.linear_model import RandomizedLogisticRegression, LogisticRegression, Lasso, ElasticNet, ElasticNetCV
from sklearn.feature_selection import RFE, SelectFromModel
from lightning.classification import CDClassifier
from hcpytools.develop_supervised_model import DevelopSupervisedModel
import warnings
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from hcpytools.impute_custom import DataFrameImputer
from sklearn.metrics import roc_auc_score
from sklearn import cross_validation
from sklearn import datasets
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
import time
import sys
warnings.filterwarnings("ignore", category=DeprecationWarning)
pd.set_option('display.max_columns', None)

df = pd.read_csv('C:/Users/levi.thatcher/Desktop/Pyscratch/Airline2008.csv', nrows=100000)

print(len(df))

#df.drop(['UniqueCarrier','Month','TailNum','FlightNum','Origin','Dest','CancellationCode'], axis=1, inplace=True)

print(df.dtypes)

df = df[['Cancelled','Month','DayOfWeek','Diverted','ArrTime','TaxiIn','TaxiOut','NASDelay','DepTime','AirTime']]

# Convert from numeric to factor
df['Month'] = df['Month'].astype(object)
df['DayOfWeek'] = df['DayOfWeek'].astype(object)
df['Diverted'] = df['Diverted'].astype(object)

# Convert from 1/0 to Y/N
df['Cancelled'].replace([1,0],['Y','N'], inplace=True)

# Look at data that's been pulled in
print(df.head())
print(df.dtypes)


#Step 1: compare two models
o = DevelopSupervisedModel(modeltype='classification',
                           df=df,
                           predictedcol='Cancelled',
                           graincol='',  #OPTIONAL/ENCOURAGED
                           impute=True,
                           debug=False)

t0 = time.time()
o.linear(cores=1,
         debug=False,
         tune=True)
print('Time: {}\n'.format(time.time() - t0))

t0 = time.time()
o.randomforest(cores=1,
               debug=False)
print('Time: {}\n'.format(time.time() - t0))

# Convert back from Y/N to 1/0 for non-HCRTools measures
df['Cancelled'].replace(['Y','N'],[1,0], inplace=True)

sys.exit()

##############################################

y = df['Cancelled']
df.drop(['Cancelled'], axis=1, inplace=True)
X = df

X = DataFrameImputer().fit_transform(X)
print('After imputation and before dummy creation')
print(X.dtypes)
print(X.head())

X = pd.get_dummies(X, drop_first=True, prefix_sep='.')
print('After dummy creation')
print(X.dtypes)
print(X.head())

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y, test_size=0.2, random_state=0)

colname = X_train.columns.values
print(y_train.shape)
print(X_train.shape)
print(X_test.shape)

print(y_train.value_counts())


#################### Grouped Lasso from lightning WITH CV ###########################
t0 = time.time()
print('Group Lasso with CV')
clf = CDClassifier(penalty="l1/l2",
                   loss="log",
                   multiclass=False,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)

pipeline = Pipeline([('clf', clf)])
tuned_parameters = [{'clf__alpha': [0.01,0.25,0.5,0.75,.99, 5, 10]}]

estimator = GridSearchCV(pipeline, tuned_parameters, cv = 5)            

estimator.fit(X_train, y_train)

y_true, y_pred = y_test, estimator.predict_proba(X_test)
print('AUC: {}'.format(roc_auc_score(y_true, y_pred[:,1])))
#print(estimator.best_params_)
print(estimator.best_estimator_.named_steps['clf'].coef_)
print('Time: {}\n'.format(time.time() - t0))


#################### Grouped Lasso from lightning ###################################
t0 = time.time()
print('Group Lasso no CV')
clf1 = CDClassifier(penalty="l1/l2",
                   loss="log",
                   multiclass=False,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)

clf1.fit(X_train, y_train)

y_true, y_pred = y_test, clf1.predict_proba(X_test)
print('AUC: {}'.format(roc_auc_score(y_true, y_pred[:,1])))
print(clf1.coef_)
print('Time: {}\n'.format(time.time() - t0))

Re-factor Py constructors to use functions

There's way too much code in the develop and deploy constructors.

We want to move these steps to small functions in modelutilities that can be called in both dev and deploy constructors.

Fix scikit warning by using new syntax

C:\Python35\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
C:\Python35\lib\site-packages\sklearn\grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)

Fix python sdist (for creating built distribution)

After creating zip and unzipping, installation doesn't happen like it used to.

Also, add description summary and description files to setup.py (if applicable); also fix home page to Jive for now (will later change when our docs have an open url).

If each row of df has NA and impute=False, raise proper error message

  1. create or find data that has NA in each row of non-label cols

Can use this:
df = pd.DataFrame({'a':[1, None, 2, 3],
'b':['m', 'f', None, 'f'],
'c':[3, 4, 5, None],
'd':[None, 8, 1, 3],
'label':['Y', 'N', 'Y', 'N']})

  1. Run DevelopSupervisedModel to create object with predicted.col=label

  2. Note that this error arises:

"Traceback (most recent call last):
File "C:/Source/DataScience/HCPyTools/hcpytools/test.py", line 23, in
debug=False)
File "C:\Source\DataScience\HCPyTools\hcpytools\develop_supervised_model.py", line 70, in init
print(df.shape)
AttributeError: 'NoneType' object has no attribute 'shape'"

To fix: Check if each row contains NA and (if so) raise error before trying to remove rows with NAs. See here for proper way to do it: http://stackoverflow.com/a/24065533/5636012
--In our code, look for the line with df = df.dropna(axis=0, how='any', inplace=True)

  • write unit tests

Create RESTful API and call model prediction from another VM

So basically, we'll host the model and RESTful API on a workstation and make calls to the API from the ETL machine (which will tie in to the platform extensibility point).

In this task, I will use two Azure VMs within a VNet to test this functionality.

Note that the API call (from the SQL Server VM) will not pass values, but will instead make the call to the workstation VM to run the entire script (including pulling data in from the SQL Server VM).

Example notebook isn't fully functional

There are some method typos, missing imports, a broken csv path, and a few columns that don't match up to the example data in the csv.

I submitted a PR to resolve this.

Add function to convert NLP col to 3,4, or 5 numeric cols

Based on PCA, we want to be able to convert text (using tm package) to a varying number of numeric columns.

Have an argument letting user specify which is the text col (since we'll expect a dataframe).

Two routes for this (and we can use an argument to specify which way the user wants):

  1. We have a method argument that specifies the number of resultant columns, and we just grab that many eigenvalues (from the large matrix that arises after vectorization)

  2. We have an argument such as percent.var.kept, which tells us how many columns (or eigenvectors) to grab that would make sure we get that much variance (from the super large matrix that, of course, contains all the variance).

So if percent.var.kept=10 and we grab three eigenvectors and have three cols in the resulting dataframe, then percent.var.kept=90 would pull back 30-40 columns. Of course, the number of columns will differ based on how much text is in the input text column.

Create output json (or just flatfile) containing model performance measures

Would be great to include things like performance for the methods used, time spent on processing time, variable importance.

Put date/time stamp in file name.

Main axis of comparison is between lasso/logit and rf.

Make these metrics attributes, such that they're accessible via the environment object.

Coordinate with CAFE team on this (as they've done it already)--in particular, reach out to Justin (and copy Patrick Nelli), such that we can make sure to grab the metrics they're grabbing.

Be sure to include these metrics: PRAUC, AUROC, trimmed list of TPR/FPR at various cutpoints, var importance for each method
Be sure to include figures: Precision-recall, ROC

Automate important variables return from randomforest

Only return the most important features when using rf in DevelopSupervisedModel instead of returning all the variables with their importance (from 0 to 1).

We want to make it such that the user doesn't have to debate about where the cutoff should lie

So please use the Airline dataset (or some large data) and check with rf in Develop step as to whether taking out the bottom .05 or features from the initial df worsens the accuracy.

If removing those below .05 does no harm, then we can just automatically return those above that threshold.

Add COLLABORATE file

Base on collaboration doc Levi has.

Add

  • SQL Server instructions from HCRTools collab doc
  • Faris instructions
  • git config core.ignorecase false

repackage CSVs into a MANIFEST.in

We shouldn't store csvs for our example into the setup package itself. These docs state the the correct solution is to create a MANIFEST.in file which is a collection of regex references to the files in question.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.