ML model that generates hospitalization probability for COVID-positive patients within 21 days of a positive test result. Part of the COVID-19 EHR DREAM Challenge.
Goal: have Dockerfiles+whatever else we need set-up per the DREAM website tutorial. Ideally once the model is trained, we can port it over with minimal effort!
Goal: using the CORD dataset, write a script that aggregates word frequencies across the different texts (feel free to add/adjust analysis as you see fit). Incorporate a Python NLP library of your choice (e.g. spaCy, CoreNLP)
I'm looking at the python notebook and FILENAME_CLIN_CONCEPT_MAP in etl.
I'm confused about the choices of the concept. For example, "visit_concept_id" is just the dates see here so it wouldn't make sense to make that a feature...
I honestly don't fully understand what is "source". According to this description on OMOP website it sounds like we wouldn't want to use source value, so if we're planning to use discrete values for features, we should exclude source_concepts.
Let's use this issue to track anything we find in our code that may prevent reproducibility to get ready for code submission.
Or we can open individual issue for that use. Up for discussion :)
Goal: set-up script to train and evaluate different ML approaches. In our case, we need the soft-labels from classification models (i.e. probabilities of 0/1). Ideally with the ETL/Docker infrastructure set-up, we can incorporate this part and get a submission in with minimal effort!
Keep me in the loop if there is particular features/formatting you need from the ETL (default is going to be format from the HWs)
Currently for feature ETL, we're aggregating the row count for each concept_id. However, there are some more valuable contextual information in some cases that would be much better to use instead of counts.
From initial analysis, this involves the following files:
measurement.csv (measurement value, presence of abnormal measurement)
observation.csv (observation value)
person.csv (age calculated from birthday)
I'm working on adding this contextual information as additional features in the feature matrix. Current design as follows:
Update get_highest_correlation_concept_feature_id_map to include contextual values as new 10-digit concept_ids (the original concept_id padded with 0's until it reaches 10 digits). Provide option to include or exclude
Based on the Athena Concept ID search, the largest concept_id is 9 digits long, so using >9 digits will avoid any collisions
Also rename function to get_concept_feature_id_map_and_corr_series to make it clearer that the function returns 2 items
Update create_feature_df to include these new concept_ids. Add options for aggregation style and impute strategy (default both to mean)
Write simple test cases to confirm generation works in tests.py
With the initial infrastructure set-up for all parts of the project, we've been able get the features generated (average counts of concept_ids), perform some initial filtering using Pearson Correlation/PCA, run the data through the model selection framework (LR, SVM, Random Forest), and use NLP to identify concept_ids that are promising based on the separate CORD dataset. Good stuff!
However, our model still needs to "get good". Let's evaluate using framework below, feel free to add/expand/modify as you see fit. Add new posts with major updates (I can organize this during meetings)
Goal: set-up initial framework for ETL and Feature Generation using the Q2 data. Ideally once the framework is established we can plug the data into model with minimal effort!