lrsoenksen / haim Goto Github PK

This repository contains the code to replicate the data processing, modeling and reporting of our Holistic AI in Medicine (HAIM) Publication in Nature Machine Intelligence (Soenksen LR, Ma Y, Zeng C et al. 2022).

License: Apache License 2.0

Python 64.67% Jupyter Notebook 35.33%

haim's Introduction

Integrated multimodal artificial intelligence framework for healthcare applications

This repository contains the code to replicate the data processing, modeling and reporting of our Holistic AI in Medicine (HAIM) in Nature's NPJ Digital Medicine. Soenksen, L.R., Ma, Y., Zeng, C. et al. Integrated multimodal artificial intelligence framework for healthcare applications. npj Digit. Med. 5, 149 (2022). https://doi.org/10.1038/s41746-022-00689-4.

Authors:

Luis R. Soenksen, Yu Ma, Cynthia Zeng, Léonard Boussioux, Kimberly Villalobos Carballo, Liangyuan Na, Holly M. Wiberg, Michael L. Li, Ignacio Fuentes, Dimitris Bertsimas

Artificial intelligence (AI) systems hold great promise to improve healthcare over the next decades. Specifically, AI systems leveraging multiple data sources and input modalities are poised to become a viable method to deliver more accurate results and deployable pipelines across a wide range of applications. In this work, we propose and evaluate a unified Holistic AI in Medicine (HAIM) framework to facilitate the generation and testing of AI systems that leverage multimodal inputs. Our approach uses generalizable data pre-processing and machine learning modeling stages that can be readily adapted for research and deployment in healthcare environments. We evaluate our HAIM framework by training and characterizing 14,324 independent models based on HAIM-MIMIC-MM, a multimodal clinical database (N=34,537 samples) containing 7,279 unique hospitalizations and 6,485 patients, spanning all possible input combinations of 4 data modalities (i.e., tabular, time-series, text, and images), 11 unique data sources and 12 predictive tasks. We show that this framework can consistently and robustly produce models that outperform similar single-source approaches across various healthcare demonstrations (by 6-33%), including 10 distinct chest pathology diagnoses, along with length-of-stay and 48-hour mortality predictions. We also quantify the contribution of each modality and data source using Shapley values, which demonstrates the heterogeneity in data modality importance and the necessity of multimodal inputs across different healthcare-relevant tasks. The generalizable properties and flexibility of our Holistic AI in Medicine (HAIM) framework could offer a promising pathway for future multimodal predictive systems in clinical and operational healthcare settings.

Code

The code uses Python3.6.9 and is separated into four sections:

0 - Software Package requirement

1 - Data Preprocessing. Noteevents.csv are public and available for download at Physionet.org; however, other "NOTES" data requires pre-release direct permission from Physionet.org for download as "discharge notes", "radiology notes", "ECG notes" and "ECHO notes" are not yet publicly released for MIMIC-IV as of Sep 2022, these files are: ds_icustay.csv, ecg_icustay.csv, echo_icustay.csv, rad_icustay.csv). To run our code without them just comment import and usage of these notes.

2 - Modeling of our three tasks: mortality prediction, length of stay prediction, chest pathology classification

3 - Result Generating: Including reporting of the AUROC, AUPRC, F1 scores, as well as code to generate the plots reported in the paper.

Please be advised that sufficient RAM or cluster access to parallel processing is needed to run these experiments.

UPDATE (Jan. 6, 2023)

The radiology and the discharge notes for MIMIC-IV have been officially released on: https://physionet.org/content/mimic-iv-note/2.2/note/#files-panel

UPDATE (Jun. 12, 2023)

For the publication, our team generated the file 'mimic-cxr-2.0.0-jpeg-txt.csv' by compiling an early-release version of participant notes and text from the images in CXR corresponding to MIMIC-IV. We wanted to add these to this repository, but the data policy from PhysioNet.org states we cannot directly share this compiled data via Git Hub. Physionet is the only one with permission to do so or subsets of the data. This means users need to generate their own mimic-cxr-2.0.0-jpeg-txt.csv based on the released notes and CXR files from Physionet.org once all notes are released. The dataset structure can be inferred from the code. As of June 12, 2023, Physionet has not fully released these notes, but it is likely they are planning to do so as part of their full release of MIMIC-IV. We are very sorry for any inconvenience this may cause.

haim's People

Contributors

Stargazers

Watchers

haim's Issues

Experiments results

Hello!
I am trying to reproduce some of the results from your paper. In particular, I am interested in getting a plot like the one below to find out what combination of modalities justifies a multimodal approach compared to a visual only.

For example, for fracture, the smallest data, I was able to get a 5-fold cross-validation test average macro AUROC of about 0.78 for the unimodal model (fusing per-image and multi-image dense visual embeddings), but when I add new (and less informative) modalities to it, the results stay almost the same (somewhere getting a bit better, somewhere a bit worse). Perhaps because XGBoost handles the curse of dimensionality well. Since the number of combinations of input modalities is high,1023, I only tested a subset, but could not get close to 0.84 in average macro AUROC.

Could you please share the supportive information about the plot above like what combination of modalities is considered typical?

Also about the number of experiments performed in the article.
I understand how you got 1023 as the number of possible models for pathology diagnosis tasks. 1023 = Number of models of 1 modality + Number of models of 2 modality + Number of models of 3 modality + Number of models of 4 modality.

Where the number of models of 1 modality is calculated based on the number of combinations of the corresponding sources:
Tabular: 1
Time series: C(3, 1) + C(3, 2) + C(3,3) = 3 + 3 + 1 = 7
Notes (excluding radiology): C(2,1) + C(2,2) = 2 + 1 = 3
Visual: C(4,1) + C(4,2) + C(4,3) + C(4,4) = 4 + 6 + 4 + 1 = 15

Total: 26

And so on, up to 4 modalities. I also get a total of 1023 experiments.

However, I don't get the same number of experiments for the 48-hours length of stay and mortality prediction tasks, for which the difference is that radiology notes are included.
Could you please explain how you get 2047(2046)?

Thank you!

Tensorflow can removed from the requirements

I don't see any use of the TensorFlow library; I suppose it can be removed from requirements.txt, yaml and in MIMIC_IV_HAIM_API.py ?

Some files in the repo are missing in MIMIC datasets?

Hello,

First, Thanks for your great project and code.

While running the 1_Generate_HAIM-MIMIC-MM file. I got this error:

FileNotFoundError: [Errno 2] No such file or directory: './data/HAIM/physionet/files/mimiciv/1.0/mimic-cxr-jpg/2.0.0/mimic-cxr-2.0.0-jpeg-txt.csv'

The path and other things are correct and I have downloaded and extracted the following datasets as mentioned:
https://physionet.org/content/mimiciv/1.0/
https://physionet.org/content/mimic-cxr-jpg/2.0.0/

But in the second link, there is no file named "mimic-cxr-2.0.0-jpeg-txt.csv". How can I access that? And is the MIMIC-CXR version the same that you ran your code on it?

Thanks in advance

Number of training samples

Hi,

I'm currently trying to generate your dataset, however the number of embeddings I get does not match yours. I managed to create all 34537 pickle files. Then, as I understood, in "Generate Embeddings from Pickle Files" you iterate over all cxr images available within a patient stay and generate a row in the embedding csv for each of the images. For me this leads to over 125000 patients, however the embedding file you provided only has 45050 rows (which also matches the number of samples for mortality and discharge prediction you mention in your paper).

Do you have any idea what the issue could be?
For example, do you use all images of a patient as single sample, including each view from the same study?

Thanks a lot in advance!

Patient vs Admission level aggregation

I have looked at the code that is used to generate the multimodal dataset. From my understanding all data except the CXR scans are aggregated on the level of unique admissions (based on hadm_id) but the CXR scans are aggregated on a patient level (based on subject_id), meaning that CXR scans belong to a patient and its multiple admissions but are not attributed to a specific admission.

Can you confirm if I am right with my assumption?
Thank you!

Value of `fname` in `2_3-Pathology Diagnosis Modeling.py` (and also other files)

I am studying the file 2_3-Pathology Diagnosis Modeling.py. It has a variable fname which is supposed to hold the filename of the embedding file. I am wondering is it the file Extracted_HAIM_Embeddings/cxr_ic_fusion_1103.csv file from https://physionet.org/content/haim-multimodal/1.0.1/?

Input window for time-series data

Hi,

Thanks for your exciting work!

I was wondering if all data throughout the patient's stay is used to form the patient embedding.

Especially for Mortality and Discharge prediction, the paper mentions the labels are defined relative to patient admission. Does this mean no time-series data is used as it does not yet exist for the patient? Or is the entire time-series data used? If the complete data is used, wouldn't the length of the time-series records alone have a strong correlation to the final output label?

Thanks a lot in advance,

Chantal

How to use when missing files

Hi~Due to the missing file 'mimic-cxr-2.0.0-jpeg-txt.csv' , I intend to use the extracted HAIM embeddings you've posted on [https://physionet.org/content/haim-multimodal/1.0.1/] . When I use 'Sample_Multimodal_Patient_Files' in '2_Generate Embeddings from Pickle Files.ipynb', it reports an error : ModuleNotFoundError: No module named 'src' . have already downloaded all the files on this website, and I want to know how to use them in the repo.

Can't locate file

In your code, you refer to a file named: "mimic-cxr-2.0.0-jpeg-txt.csv", which seems to be a part of the files of the MIMIC-CXR-JPG dataset. However, I can't find this file anywhere in the data of Physionet...
Can you please refer to the location I can find this file?
Thank you

Missing file

To generate HAIM-MIMIC-MM data using 1_Generate HAIM-MIMIC-MM from downloaded MIMIC-IV and MIMIC-CXR.ipynb, it seems that mimic-cxr-2.0.0-jpeg-txt.csv you're loading, df_mimic-cxr_jpg =pd.read_csv(core_mimiciv_path + 'mimic-cxr-jpg/2.0.0/mimic-cxr-2.0.0-jpeg-txt.csv'), to get image paths plus some extras, is actually missing in MIMIC-CXR-JPG v.2.0.0 database available at physionet.org

How can I access this file?

Could not find "AUPRC_All_Modality_Resources. csv"

Hello,
First, Thanks for your great project and code.
But when I tried running 2_ 3, I found that the program could not find "AUPRC_All_Modality_Resources. csv". It seems that I did not find the relevant operation to generate this file in the project team. What should I do?
Thanks again!

Broken embeddings file on PhysioNet?

This might not be the right place for this issue, as it is about the data that you published on PhysioNet rather than the code you published here, so I would like to apologize in advance for misusing GitHub to bring this up:

I am having trouble with loading the cxr_ic_fusion_1103.csv file, i.e. the extracted HAIM embeddings, from your PhysioNet repository (https://doi.org/10.13026/3f8d-qe93), in particular with the last two lines:

Both of the last two lines hold 7173 entries, while all others hold 6405 entries. In other words, there are 768 entries more in the last two lines than in all others.
Moreover, both of the last two lines hold three consecutive runs of exactly repeating elements, starting from index 13 (zero-based) and having a length of 768 entries each, with no gaps (so the starting indices of the repetitions are 781 and 1549, respectively).

My first guess would have been that one embedding vector has been repeated accidentally, but this does not make sense as (1) there are three repetitions of 768 elements in each of the two lines, while the lines in total are only 768 elements longer and (2) the starting position at index 13 does not make any sense semantically if one looks at the header (line 0).

So my questions are: (1) Is this a known problem? (2) Is there anything that I can do to reconstruct the last two lines if I want to use all embeddings, or should I just ignore the last two lines? I checked the SHA256 hash of the file by the way, so the download should have not caused the problem.

Update: Just to clarify, by "exactly repeating elements" I do not mean that the entries at indices 13, 14, 15, … all have the same value, but that the entry at index 13 has the same value as the entries at index 781 and 1549, the entry at index 14 has the same value as the entries at index 782 and 1550, and so on.

Request for extracted dataset

Thanks for your excellent work! The MIMIC-CXR-JPG dataset is too large and it is difficult to download it in some special situations. Can you provide the extracted dataset after step 1? I would be very grateful.

Definitions of records

Hi,

I'm trying to create the dataset you created for the research. At MIMIC-CXR there are only the subject_id and study_id identifiers, but in MIMIC-IV, there are the subject_id, hadm_id adn stay_id. What is the correct way to link images from MIMIC-CXR to data from MIMIC-IV? I understand from your article that records are defined by {subject_id, hadm_id, stay_id}, but I don't understand how should I select the matched images to each records?
Perhaps you can describe in detail what a single record is composed of, in terms of data from all sources?
Thank you

get_chartevent_tsfresh_timeseries_embeddings has not been called

Hello, author：
Why is the chartevent embedding not invoked in '1_2-Generate Embeddings.py' through 'get_chartevent_tsfresh_timeseries_embeddings'?

Issues loading biobert

Hey guys, I was trying to run the same experiments you performed but I've run into an error when running the file 1_1-Create Pickle Files.py:
404 Client Error: Not Found for url: https://huggingface.co/pretrained_bert_tf/biobert_pretrain_output_all_notes_150000//resolve/main/config.json
Traceback (most recent call last):
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/configuration_utils.py", line 520, in get_config_dict
user_agent=user_agent,
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/file_utils.py", line 1371, in cached_path
local_files_only=local_files_only,
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/file_utils.py", line 1534, in get_from_cache
r.raise_for_status()
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/pretrained_bert_tf/biobert_pretrain_output_all_notes_150000//resolve/main/config.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "1_1-Create Pickle Files.py", line 28, in
from MIMIC_IV_HAIM_API import *
File "/home/saia/files/HAIM/MIMIC_IV_HAIM_API.py", line 114, in
biobert_tokenizer = AutoTokenizer.from_pretrained(biobert_path)
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py", line 534, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/models/auto/configuration_auto.py", line 450, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/configuration_utils.py", line 532, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for 'pretrained_bert_tf/biobert_pretrain_output_all_notes_150000/'. Make sure that:

'pretrained_bert_tf/biobert_pretrain_output_all_notes_150000/' is a correct model identifier listed on 'https://huggingface.co/models'
or 'pretrained_bert_tf/biobert_pretrain_output_all_notes_150000/' is the correct path to a directory containing a config.json file

It seems that the script is trying to load a BioBERT model from Hugging Face's model hub, but the specified path is not found. Do you know why this migth be happening?