ML model that generates hospitalization probability for COVID-positive patients within 21 days of a positive test result. Part of the COVID-19 EHR DREAM Challenge.

Jupyter Notebook 53.31% Python 44.93% Shell 1.12% Dockerfile 0.64%

covid19-hospitalization-prediction's People

Contributors

Stargazers

Watchers

Forkers

mohammadbakir kylec-mayo

covid19-hospitalization-prediction's Issues

Set-up Docker Infrastructure

Goal: have Dockerfiles+whatever else we need set-up per the DREAM website tutorial. Ideally once the model is trained, we can port it over with minimal effort!

Goal: using the CORD dataset, write a script that aggregates word frequencies across the different texts (feel free to add/adjust analysis as you see fit). Incorporate a Python NLP library of your choice (e.g. spaCy, CoreNLP)

Choices of concept columns

Hey @ericpan64

I'm looking at the python notebook and FILENAME_CLIN_CONCEPT_MAP in etl.

I'm confused about the choices of the concept. For example, "visit_concept_id" is just the dates see here so it wouldn't make sense to make that a feature...

I honestly don't fully understand what is "source". According to this description on OMOP website it sounds like we wouldn't want to use source value, so if we're planning to use discrete values for features, we should exclude source_concepts.

Thread on any code issues

Let's use this issue to track anything we find in our code that may prevent reproducibility to get ready for code submission.
Or we can open individual issue for that use. Up for discussion :)

Set-up initial Model Training + Evaluation

Goal: set-up script to train and evaluate different ML approaches. In our case, we need the soft-labels from classification models (i.e. probabilities of 0/1). Ideally with the ETL/Docker infrastructure set-up, we can incorporate this part and get a submission in with minimal effort!

Keep me in the loop if there is particular features/formatting you need from the ETL (default is going to be format from the HWs)

Research how to process covid 19 literatures

Add discrete values when available as features

Currently for feature ETL, we're aggregating the row count for each concept_id. However, there are some more valuable contextual information in some cases that would be much better to use instead of counts.

From initial analysis, this involves the following files:

measurement.csv (measurement value, presence of abnormal measurement)
observation.csv (observation value)
person.csv (age calculated from birthday)

I'm working on adding this contextual information as additional features in the feature matrix. Current design as follows:

Update get_highest_correlation_concept_feature_id_map to include contextual values as new 10-digit concept_ids (the original concept_id padded with 0's until it reaches 10 digits). Provide option to include or exclude
- Based on the Athena Concept ID search, the largest concept_id is 9 digits long, so using >9 digits will avoid any collisions
- Also rename function to get_concept_feature_id_map_and_corr_series to make it clearer that the function returns 2 items
Update create_feature_df to include these new concept_ids. Add options for aggregation style and impute strategy (default both to mean)
Write simple test cases to confirm generation works in tests.py

Help our model "Get Good" (or: "Get Good (enough)")

With the initial infrastructure set-up for all parts of the project, we've been able get the features generated (average counts of concept_ids), perform some initial filtering using Pearson Correlation/PCA, run the data through the model selection framework (LR, SVM, Random Forest), and use NLP to identify concept_ids that are promising based on the separate CORD dataset. Good stuff!

However, our model still needs to "get good". Let's evaluate using framework below, feel free to add/expand/modify as you see fit. Add new posts with major updates (I can organize this during meetings)

Current best:

	Local Test Results	DREAM Test Results
AUC	0.66	...
AUPR	...	...
Balanced Accuracy	...	...
Features Used	464 clinical-only features	...
Best Model	LR	...
Other Notes	Pulled info from paper submission	...

Set-up ETL Pipeline + Feature Generation

Goal: set-up initial framework for ETL and Feature Generation using the Q2 data. Ideally once the framework is established we can plug the data into model with minimal effort!

ericpan64 / covid19-hospitalization-prediction Goto Github PK

covid19-hospitalization-prediction's People

Contributors

Stargazers

Watchers

Forkers

covid19-hospitalization-prediction's Issues

Set-up Docker Infrastructure

Initial NLP Analysis

Choices of concept columns

Thread on any code issues

Set-up initial Model Training + Evaluation

Research how to process covid 19 literatures

Add discrete values when available as features

Help our model "Get Good" (or: "Get Good (enough)")

Set-up ETL Pipeline + Feature Generation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs