GithubHelp home page GithubHelp logo

ericpan64 / covid19-hospitalization-prediction Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 2.0 64.39 MB

ML model that generates hospitalization probability for COVID-positive patients within 21 days of a positive test result. Part of the COVID-19 EHR DREAM Challenge.

Jupyter Notebook 53.31% Python 44.93% Shell 1.12% Dockerfile 0.64%

covid19-hospitalization-prediction's People

Contributors

ericpan64 avatar mohammadbakir avatar santina avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

covid19-hospitalization-prediction's Issues

Set-up Docker Infrastructure

Goal: have Dockerfiles+whatever else we need set-up per the DREAM website tutorial. Ideally once the model is trained, we can port it over with minimal effort!

Initial NLP Analysis

Goal: using the CORD dataset, write a script that aggregates word frequencies across the different texts (feel free to add/adjust analysis as you see fit). Incorporate a Python NLP library of your choice (e.g. spaCy, CoreNLP)

Choices of concept columns

Hey @ericpan64

I'm looking at the python notebook and FILENAME_CLIN_CONCEPT_MAP in etl.

I'm confused about the choices of the concept. For example, "visit_concept_id" is just the dates see here so it wouldn't make sense to make that a feature...

I honestly don't fully understand what is "source". According to this description on OMOP website it sounds like we wouldn't want to use source value, so if we're planning to use discrete values for features, we should exclude source_concepts.

Thread on any code issues

Let's use this issue to track anything we find in our code that may prevent reproducibility to get ready for code submission.
Or we can open individual issue for that use. Up for discussion :)

Set-up initial Model Training + Evaluation

Goal: set-up script to train and evaluate different ML approaches. In our case, we need the soft-labels from classification models (i.e. probabilities of 0/1). Ideally with the ETL/Docker infrastructure set-up, we can incorporate this part and get a submission in with minimal effort!

Keep me in the loop if there is particular features/formatting you need from the ETL (default is going to be format from the HWs)

Add discrete values when available as features

Currently for feature ETL, we're aggregating the row count for each concept_id. However, there are some more valuable contextual information in some cases that would be much better to use instead of counts.

From initial analysis, this involves the following files:

  • measurement.csv (measurement value, presence of abnormal measurement)
  • observation.csv (observation value)
  • person.csv (age calculated from birthday)

I'm working on adding this contextual information as additional features in the feature matrix. Current design as follows:

  • Update get_highest_correlation_concept_feature_id_map to include contextual values as new 10-digit concept_ids (the original concept_id padded with 0's until it reaches 10 digits). Provide option to include or exclude
    • Based on the Athena Concept ID search, the largest concept_id is 9 digits long, so using >9 digits will avoid any collisions
    • Also rename function to get_concept_feature_id_map_and_corr_series to make it clearer that the function returns 2 items
  • Update create_feature_df to include these new concept_ids. Add options for aggregation style and impute strategy (default both to mean)
  • Write simple test cases to confirm generation works in tests.py

Help our model "Get Good" (or: "Get Good (enough)")

With the initial infrastructure set-up for all parts of the project, we've been able get the features generated (average counts of concept_ids), perform some initial filtering using Pearson Correlation/PCA, run the data through the model selection framework (LR, SVM, Random Forest), and use NLP to identify concept_ids that are promising based on the separate CORD dataset. Good stuff!

However, our model still needs to "get good". Let's evaluate using framework below, feel free to add/expand/modify as you see fit. Add new posts with major updates (I can organize this during meetings)

Current best:

Local Test Results DREAM Test Results
AUC 0.66 ...
AUPR ... ...
Balanced Accuracy ... ...
Features Used 464 clinical-only features ...
Best Model LR ...
Other Notes Pulled info from paper submission ...

Set-up ETL Pipeline + Feature Generation

Goal: set-up initial framework for ETL and Feature Generation using the Q2 data. Ideally once the framework is established we can plug the data into model with minimal effort!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.