GithubHelp home page GithubHelp logo

suinleelab / cxr_covid Goto Github PK

View Code? Open in Web Editor NEW
29.0 10.0 7.0 11.77 MB

Code for paper "AI for radiographic COVID-19 detection selects shortcuts over signal"

License: Other

Python 39.51% Jupyter Notebook 60.49%

cxr_covid's Introduction

AI for radiographic COVID-19 detection selects shortcuts over signal

Code for paper "AI for radiographic COVID-19 detection selects shortcuts over signal". Please read our preprint at the following link: https://doi.org/10.1101/2020.09.13.20193565

Datasets can be downloaded at the following links:
Dataset I
Cohen et al. Covid-Chestxray-Dataset
ChestXray-14

Dataset II
BIMCV-COVID-19 +
PadChest

Dataset III
BIMCV-COVID-19 +
BIMCV-COVID-19 −

System requirements

This software was originally designed and run on a system running CentOS 7.8.2003, with Python 3.8, PyTorch 1.4, and CUDA 10.1. For a full list of software packages and version numbers, see the Conda environment file environment.yml.

This software leverages graphical processing units (GPUs) to accelerate neural network training and evaluation; systems lacking a suitable GPU will likely take an extremely long time to train or evaluate models. The software was tested with the NVIDIA RTX 2080 TI GPU, though we anticipate that other GPUs will also work, provided that the unit offers sufficient memory (networks may consume upward of 8 GB).

Installation guide

We recommend installation of the required packages using the Conda package manager, available through the Anaconda Python distribution. Anaconda is available free of charge for non-commercial use through Anaconda Inc. After installing Anaconda and cloning this repository, use the conda command to install necessary packages: conda env create -f environment.yml

Total install time is approximately 30 minutes, including 15 minutes for installation of Anaconda and 15 minutes for installation of the required packages. Beyond downloading this repository, no addition time is required for its installation.

Setting up the datasets

While we provide code to load radiographs and associated metadata for training a deep-learning model, you will first need to download images from the above repositories. Be aware that these repositories amount to multiple terabytes of data.

Organize the downloaded data as follows:

./data/
    ChestX-ray14/
        labels/
            Data_Entry_2017.csv
            test_list.txt
            train_val_list.txt
        images/
            (many image files)
    GitHub-COVID/
        metadata.csv
        images/
            (many image files)
    PadChest/
        PADCHEST_chest_x_ray_images_labels_160K_01.02.19.csv
        images/
            (many image files)
    bimcv+ 
        participants.tsv
        derivatives/
            labels/
                labels_covid19_posi.tsv
        sub-S0*/
            (subdirectories containing png images and json metadata)
    bimcv-
        participants.tsv
        derivatives/
            labels
                labels_SARS-cov-2_nega.tsv
        sub-S0*/
            (subdirectories containing png images and json metadata)

ChestX-ray14

Download the files listed above under ./data/ChestX-ray14/. You will need to download and extract all of the zip files from the images directory and organize all of the images into a single directory (./data/Chestx-ray14/images). Note that some file names may change (e.g., Data_Entry_2017.csv may have been renamed to Data_Entry_2017_v2020.csv depending on your download date). It is important that you rename files to match the above scheme.

Cohen et al. Covid-Chestxray-Dataset (a.k.a. "GitHub-COVID" in our manuscript)

Simply clone the repository, check out the following specific revision (git checkout 9b9c2d5) and rename the directory as ./data/GitHub-COVID.

PadChest

Download each of the image zip files as well as the csv file containing metadata. Extract all of the images and organize them into a single directory at ./data/PadChest/images.

BIMCV-COVID19+

Download all of the zip files, which contain both the images and metadata. Place all of the zip files in ./data/bimcv+ and extract them. You should end up with a subdirectory named "derivatives" which includes some of the metadata, as well as many folders named "sub-SXXXXX" (where XXXXX is a number) which contain the images and more metadata.

Since the json files that contain metadata regarding the BIMCV-COVID-19+ radiographs can be unwieldy to work with, parse them to create a csv file that contains key metadata:

cd ./data
python make_csv_bimcv_positive.py 

BIMCV-COVID19−

The download process is similar to that of BIMCV-COVID19+. Download all of the zip files, which contain both the images and metadata. Place all of the zip files in ./data/bimcv- and extract them. You should end up with a subdirectory named "derivatives" which includes some of the metadata, as well as many folders named "sub-SXXXXX" (where XXXXX is a number) which contain the images and more metadata.

Since the json files that contain metadata regarding the BIMCV-COVID-19- radiographs can be unwieldy to work with, parse them to create a csv file that contains key metadata:

cd ./data #(if not already in the ./data directory)
python make_csv_bimcv_negative.py 

HDF5 Files

For improved data loading performance, create HDF5 files for the image repositories. Note that due to its small size, we do not provide scripts for loading the GitHub-COVID dataset from HDF5 files.

To generate the files, run the following commands:

cd ./data
python make_h5.py -i ChestX-ray14 -o ChestX-ray14/chestxray14.h5
python make_h5.py -i PadChest -o PadChest/padchest.h5
python make_h5.py -i bimcv+ -o bimcv+/bimcv+.h5 
python make_h5.py -i bimcv- -o bimcv-/bimcv-.h5 

Check to make sure the output files are organized as follows:

data/
    ChestX-ray14/
        chestxray14.h5
    PadChest/
        padchest.h5
    bimcv+ 
        bimcv+.h5
    bimcv-
        bimcv-.h5

Training the models

After setting up the datasets, train models using the train_covid.py script. This script works via the command line; for more information on using the script, run python train_covid.py --help. The expected training time for a single replicate on an NVIDIA RTX 2080 TI is approximately 5 hours.

Evaluating the models

Once you have trained models on both datasets, evaluate the models using the script roc.py. This will calculate receiver operating characteristic curves for both internal and external test data. First, edit the "options" section of roc.py to match the output paths from model training; the checkpoint files may be found in ./checkpoints. Then, call python roc.py to generate the ROC curves. The outputs of the roc.py script are expected to be similar to Fig. 1c in our manuscript.

To examine the performance of models trained on dataset III, you will need to use the separate roc_bimcv.py script. Similar to the main roc.py script, open the file and edit the options section to point to the checkpoint files of your models trained on dataset III. Then, call python roc_bimcv.py to generate the ROC curves. The outputs of the roc.py script are expected to be similar to Fig. 5 in our manuscript.

cxr_covid's People

Contributors

ajd98 avatar jjanizek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cxr_covid's Issues

Update githubcovid.py

Error Replication
Running python train_models.py --dataset 1 gives the following Error

Traceback (most recent call last):                                                                                                                    
  File "train_covid.py", line 202, in <module>
    main()
  File "train_covid.py", line 193, in main
    train_githubcxr14(args.seed, 
  File "train_covid.py", line 52, in train_githubcxr14
    classifier.train(trainds,
  File "/uss/xrai/nick_folder/cxr_covid/models/cxrclassifier.py", line 178, in train
    valloss, valauroc = self._val_epoch(val_dataloader)
  File "/uss/xrai/nick_folder/cxr_covid/models/cxrclassifier.py", line 262, in _val_epoch
    auroc = sklearn.metrics.roc_auc_score(true[:,-1], probs[:,-1])
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 387, in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_base.py", line 77, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 221, in _binary_roc_auc_score
    raise ValueError("Only one class present in y_true. ROC AUC score "
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Problem Identification
Looking at the metadata, we see that values in finding columns may have been updated to new values.
Github-COVID feature engineering datasets/githubcovid.py needs to be updated
Current solution gives false for every datapoint, because of line 71: covid_set = ['COVID-19','COVID-19, ARDS']

Solution
Patients w/ COVID now have the following string 'Pneumonia/Viral/COVID-19' instead of 'COVID-19','COVID-19, ARDS']
Pneumonia patients and healthy set must also be updated to correspond with the new changes

Update make_csv.py

include these into series_description_map:

'TORAX  AP': 'AP',
'TÓRAX AP': 'AP',
'TORAX BIPE AP': 'AP',
'W034 TÓRAX LAT': 'LAT',
'W033 TÓRAX P.A.': 'PA',

Add another exception to series_description:

try:
    series_description = metadata['0008103E']['Value'][0]
except Exception as e:
    try:
        series_description = metadata['00081032']['Value'][0]['00080104']['Value'][0]
    except Exception as e:
        try:
            series_description = metadata["00185101"]["Value"][0]
        except Exception as e:
            raise e

compile error: make_csv.py

(1)
Replicate Error: run make_csv.py

Error Message:

  File "make_csv.py", line 109
    contains_CR_DX = lambda x: return ('CR' in x) or ('DX' in x)
                               ^
SyntaxError: invalid syntax

Problem:
lambda statement doesn't allow return statement

FIX:
Before: contains_CR_DX = lambda x: return ('CR' in x) or ('DX' in x)
After: contains_CR_DX = lambda x: ('CR' in x) or ('DX' in x)

(2)
Replicate Error: run make_csv.py

Error Message:

  File "make_csv.py", line 113
    is_dir = os.path.isdir(os.path.join(datapath, subject, sessionfile))
    ^
SyntaxError: invalid syntax

Problem:
Missing closing parentheses

FIX:
Before: image_candidates_dir = os.listdir(os.path.join(datapath, subject, sessionfile, 'mod-rx')
After: image_candidates_dir = os.listdir(os.path.join(datapath, subject, sessionfile, 'mod-rx'))

python train_covid.py --dataset 1 returns a ROC AUC ValueError

Ran on environment provided.

-------- Epoch 000 --------
Traceback (most recent call last):                                                                                                               
  File "train_covid.py", line 202, in <module>
    main()
  File "train_covid.py", line 193, in main
    train_githubcxr14(args.seed, 
  File "train_covid.py", line 52, in train_githubcxr14
    classifier.train(trainds,
  File "/uss/xrai/nick_folder/cxr_covid/models/cxrclassifier.py", line 178, in train
    valloss, valauroc = self._val_epoch(val_dataloader)
  File "/uss/xrai/nick_folder/cxr_covid/models/cxrclassifier.py", line 261, in _val_epoch
    auroc = sklearn.metrics.roc_auc_score(true[:,-1], probs[:,-1])
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 387, in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_base.py", line 77, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/datasets/home/00/300/nil021/.conda/envs/cxr_covid/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 221, in _binary_roc_auc_score
    raise ValueError("Only one class present in y_true. ROC AUC score "
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.