AI for radiographic COVID-19 detection selects shortcuts over signal

Code for paper "AI for radiographic COVID-19 detection selects shortcuts over signal". Please read our preprint at the following link: https://doi.org/10.1101/2020.09.13.20193565

Datasets can be downloaded at the following links:
Dataset I
Cohen et al. Covid-Chestxray-Dataset
ChestXray-14

Dataset II
BIMCV-COVID-19 +
PadChest

Dataset III
BIMCV-COVID-19 +
BIMCV-COVID-19 −

System requirements

This software was originally designed and run on a system running CentOS 7.8.2003, with Python 3.8, PyTorch 1.4, and CUDA 10.1. For a full list of software packages and version numbers, see the Conda environment file environment.yml.

This software leverages graphical processing units (GPUs) to accelerate neural network training and evaluation; systems lacking a suitable GPU will likely take an extremely long time to train or evaluate models. The software was tested with the NVIDIA RTX 2080 TI GPU, though we anticipate that other GPUs will also work, provided that the unit offers sufficient memory (networks may consume upward of 8 GB).

Installation guide

We recommend installation of the required packages using the Conda package manager, available through the Anaconda Python distribution. Anaconda is available free of charge for non-commercial use through Anaconda Inc. After installing Anaconda and cloning this repository, use the conda command to install necessary packages: conda env create -f environment.yml

Total install time is approximately 30 minutes, including 15 minutes for installation of Anaconda and 15 minutes for installation of the required packages. Beyond downloading this repository, no addition time is required for its installation.

Setting up the datasets

While we provide code to load radiographs and associated metadata for training a deep-learning model, you will first need to download images from the above repositories. Be aware that these repositories amount to multiple terabytes of data.

Organize the downloaded data as follows:

./data/
    ChestX-ray14/
        labels/
            Data_Entry_2017.csv
            test_list.txt
            train_val_list.txt
        images/
            (many image files)
    GitHub-COVID/
        metadata.csv
        images/
            (many image files)
    PadChest/
        PADCHEST_chest_x_ray_images_labels_160K_01.02.19.csv
        images/
            (many image files)
    bimcv+ 
        participants.tsv
        derivatives/
            labels/
                labels_covid19_posi.tsv
        sub-S0*/
            (subdirectories containing png images and json metadata)
    bimcv-
        participants.tsv
        derivatives/
            labels
                labels_SARS-cov-2_nega.tsv
        sub-S0*/
            (subdirectories containing png images and json metadata)

ChestX-ray14

Download the files listed above under ./data/ChestX-ray14/. You will need to download and extract all of the zip files from the images directory and organize all of the images into a single directory (./data/Chestx-ray14/images). Note that some file names may change (e.g., Data_Entry_2017.csv may have been renamed to Data_Entry_2017_v2020.csv depending on your download date). It is important that you rename files to match the above scheme.

Cohen et al. Covid-Chestxray-Dataset (a.k.a. "GitHub-COVID" in our manuscript)

Simply clone the repository, check out the following specific revision (git checkout 9b9c2d5) and rename the directory as ./data/GitHub-COVID.

PadChest

Download each of the image zip files as well as the csv file containing metadata. Extract all of the images and organize them into a single directory at ./data/PadChest/images.

BIMCV-COVID19+

Download all of the zip files, which contain both the images and metadata. Place all of the zip files in ./data/bimcv+ and extract them. You should end up with a subdirectory named "derivatives" which includes some of the metadata, as well as many folders named "sub-SXXXXX" (where XXXXX is a number) which contain the images and more metadata.

Since the json files that contain metadata regarding the BIMCV-COVID-19+ radiographs can be unwieldy to work with, parse them to create a csv file that contains key metadata:

cd ./data
python make_csv_bimcv_positive.py

BIMCV-COVID19−

The download process is similar to that of BIMCV-COVID19+. Download all of the zip files, which contain both the images and metadata. Place all of the zip files in ./data/bimcv- and extract them. You should end up with a subdirectory named "derivatives" which includes some of the metadata, as well as many folders named "sub-SXXXXX" (where XXXXX is a number) which contain the images and more metadata.

Since the json files that contain metadata regarding the BIMCV-COVID-19- radiographs can be unwieldy to work with, parse them to create a csv file that contains key metadata:

cd ./data #(if not already in the ./data directory)
python make_csv_bimcv_negative.py

HDF5 Files

For improved data loading performance, create HDF5 files for the image repositories. Note that due to its small size, we do not provide scripts for loading the GitHub-COVID dataset from HDF5 files.

To generate the files, run the following commands:

cd ./data
python make_h5.py -i ChestX-ray14 -o ChestX-ray14/chestxray14.h5
python make_h5.py -i PadChest -o PadChest/padchest.h5
python make_h5.py -i bimcv+ -o bimcv+/bimcv+.h5 
python make_h5.py -i bimcv- -o bimcv-/bimcv-.h5

Check to make sure the output files are organized as follows:

data/
    ChestX-ray14/
        chestxray14.h5
    PadChest/
        padchest.h5
    bimcv+ 
        bimcv+.h5
    bimcv-
        bimcv-.h5

Training the models

After setting up the datasets, train models using the train_covid.py script. This script works via the command line; for more information on using the script, run python train_covid.py --help. The expected training time for a single replicate on an NVIDIA RTX 2080 TI is approximately 5 hours.

Evaluating the models

Once you have trained models on both datasets, evaluate the models using the script roc.py. This will calculate receiver operating characteristic curves for both internal and external test data. First, edit the "options" section of roc.py to match the output paths from model training; the checkpoint files may be found in ./checkpoints. Then, call python roc.py to generate the ROC curves. The outputs of the roc.py script are expected to be similar to Fig. 1c in our manuscript.

To examine the performance of models trained on dataset III, you will need to use the separate roc_bimcv.py script. Similar to the main roc.py script, open the file and edit the options section to point to the checkpoint files of your models trained on dataset III. Then, call python roc_bimcv.py to generate the ROC curves. The outputs of the roc.py script are expected to be similar to Fig. 5 in our manuscript.

Update githubcovid.py

Error Replication
Running python train_models.py --dataset 1 gives the following Error

Traceback (most recent call last):                                                                                                                    
  File "train_covid.py", line 202, in <module>
    main()
  File "train_covid.py", line 193, in main
    train_githubcxr14(args.seed, 
  File "train_covid.py", line 52, in train_githubcxr14
    classifier.train(trainds,
  File "/uss/xrai/nick_folder/cxr_covid/models/cxrclassifier.py", line 178, in train
    valloss, valauroc = self._val_epoch(val_dataloader)
  File "/uss/xrai/nick_folder/cxr_covid/models/cxrclassifier.py", line 262, in _val_epoch
    auroc = sklearn.metrics.roc_auc_score(true[:,-1], probs[:,-1])
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 387, in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_base.py", line 77, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 221, in _binary_roc_auc_score
    raise ValueError("Only one class present in y_true. ROC AUC score "
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Problem Identification
Looking at the metadata, we see that values in finding columns may have been updated to new values.
Github-COVID feature engineering datasets/githubcovid.py needs to be updated
Current solution gives false for every datapoint, because of line 71: covid_set = ['COVID-19','COVID-19, ARDS']

Solution
Patients w/ COVID now have the following string 'Pneumonia/Viral/COVID-19' instead of 'COVID-19','COVID-19, ARDS']
Pneumonia patients and healthy set must also be updated to correspond with the new changes

suinleelab / cxr_covid Goto Github PK

cxr_covid's Introduction

AI for radiographic COVID-19 detection selects shortcuts over signal

System requirements

Installation guide

Setting up the datasets

ChestX-ray14

Cohen et al. Covid-Chestxray-Dataset (a.k.a. "GitHub-COVID" in our manuscript)

PadChest

BIMCV-COVID19+

BIMCV-COVID19−

HDF5 Files

Training the models

Evaluating the models

cxr_covid's People

Contributors

Stargazers

Watchers

Forkers

cxr_covid's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs