GithubHelp home page GithubHelp logo

ff98li / pathofmhub Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 158 KB

Official repository for Benchmarking the utilities of pathology foundation models in whole-slide image analysis (on-going)

License: GNU General Public License v3.0

Python 99.39% Shell 0.61%

pathofmhub's Introduction

PathoFMHub

This repository contains the codebase for Pathology Foundation Models for Ovarian Cancer Subtype Classification in Whole-slide Images. We used the dataset from the UBC Ovarian Cancer Subtype Classification and Outlier Detection (UBC-OCEAN) competition - 2023 for classification of five ovarian cancer subtypes from histopathology whole slide images (WSI).

Requirements

Development environment:

  • Python 3.11.5 | CUDA 11.8 | Pytorch 2.2.0 | Tensorflow 2.14.0

Training and feature extraction were performed on an NVIDIA T4 GPU (16GB memory).

Dependencies and installation

This codebase is adapted from the CLAM project[1]. All the additional scripts for reproducibility are located in the scripts directory. It assumes that you have a working virtual environment (e.g. conda or virtualenv) with Python 3.11 and Pytorch 2.2.0 installed.

  1. Install PyTorch and Torchvision following the official instructions, e.g.:
pip install torch torchvision
  1. Install required dependencies:
pip install -r requirements.txt
  1. [Optional] For REMEDIS models, install TensorFlow 2 and TensorFlow Hub:
pip install "tensorflow>=2.0.0"
pip install --upgrade tensorflow-hub

To use the SVM loss[2], please navigate to the smooth-topk directory and install with the following command:

python setup.py install

Data

The ovarian cancer dataset (totaling 795 GB before unzipping), can be downloaded from the UBC-OCEAN competition[3] on Kaggle under the CC BY-NC-ND 4.0 license. It includes five classes of ovarian cancer subtypes: high-grade serous carcinoma (HGSC), clear-cell ovarian carcinoma (CC), endometrioid (EC), low-grade serous (LC), and mucinous carcinoma (MC).

Dataset Preparation

Assuming the dataset is downloaded and unzipped in /path/to/UBC-OCEAN. Follow these steps for preparation:

  1. Convert .png to pyramidal .tif format (total output ~3.2 TB):
python ./scripts/convert_png_to_tif.py \
    --png_dir /path/to/UBC-OCEAN/train_images \ ## original png images
    --output_dir /path/to/UBC_OCEAN_WSI/tiff \ ## output tiff directory
    --train_csv /path/to/UBC-OCEAN/train.csv ## train csv file
  1. Generate dataset CSV

This step is optional as the dataset CSV is already provided in the dataset_csv directory. To reproduce the CSV file creation:

python ./scripts/make_dataset_csv_UBC.py --train_csv /path/to/UBC-OCEAN/train.csv
  1. Extract patch coordinates for each WSI:
python create_patches_fp.py \
    --source path/to/UBC_OCEAN_WSI/tiff \
    --save_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
    --step_size 224 \
    --patch_size 224 \
    --seg \
    --stitch \
    --patch
  1. Generate training, validation, and testing splits

This step is optional as the split csv files are already provided under the splits directory. To reproduce the split creation:

python create_splits_seq.py --task task_1_UBC_OCEAN_WSI --seed 2024

Feature extraction

ResNet

For feature extraction using ResNet baseline models, follow the instructions below for each variant:

ResNet-50

python extract_features_fp_resnet.py \
    --data_h5_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
    --data_slide_dir /path/to/UBC_OCEAN_WSI/tiff \
    --slide_ext .tif \
    --csv_path ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
    --feat_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI/resnet50 \ ## path to save the extracted features
    --model resnet50 \
	--batch_size 256

ResNet-152

python extract_features_fp_resnet.py \
    --data_h5_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
    --data_slide_dir /path/to/UBC-OCEAN-WSI/tiff \
    --slide_ext .tif \
    --csv_path ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
    --feat_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI/resnet152 \
    --model resnet152 \
	--batch_size 256

Where:

  • data_h5_dir: the directory containing the preprocessed data for the current task.
  • data_slide_dir: the directory containing the WSI images.
  • slide_ext: the file extension of the WSI images.
  • csv_path: the path to the CSV file containing the slide names and their corresponding cancer subtype labels.
  • feat_dir: the directory to save the extracted features.
  • model: the ResNet model variant to use for feature extraction.

Try adding --compile for accelerated feature extraction.

Offline Use of Pre-trained Models

To use pre-trained models locally, download the weights from PyTorch model zoo to the workdir directory. These weights will be loaded from there.

For ResNet-50:

wget -O ./workdir/resnet50-19c8e357.pth https://download.pytorch.org/models/resnet50-19c8e357.pth

For ResNet-152:

wget -O ./workdir/resnet152-b121ed2d.pth https://download.pytorch.org/models/resnet152-b121ed2d.pth 

Downloading the Pre-trained Model

Start by downloading the PLIP[4] model with the following command, which saves it to the workdir directory:

python ./scripts/download_plip.py

Extracting Features

To extract features using the PLIP model:

python extract_features_fp_plip.py \
	--data_h5_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
    --data_slide_dir path/to/UBC-OCEAN-WSI/tiff \
    --slide_ext .tif \
    --csv_path ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
    --feat_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI/plip \
    --batch_size 256

We are not permitted to redistribute either the REMEDIS[5] model weights or image features extracted using the model weights according to its Terms of Service. However, one can request access to the model weights at the Medical AI Research Foundations PhysioNet after registering as a credentialed user acknowledging the usage license.

Once the model weights are obtained following the instruction, the features can be extracted using the following command, for example, for path-50x1-remedis-m:

python extract_features_fp_remedis.py \
    --data_h5_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
    --data_slide_dir path/to/UBC-OCEAN-WSI/tiff \
    --slide_ext .tif \
    --csv_path ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
    --feat_dir  ./data/CLAM_preprocessed/UBC_OCEAN_WSI/path-50x1-remedis-m \
    --remedis_model path-50x1-remedis-m \
    --remedis_weights path/to/medical-ai-research-foundation/1.0.0/path-50x1-remedis-m \
	--batch_size 256

Where:

  • remedis_model: the name of the REMEDIS model variant.
  • remedis_weights: the path to the REMEDIS model weights.

All pathology REMEDIS models listed here are supported.

Note: Loading the REMEDIS model requires TensorFlow 2 and TensorFlow Hub.

Training

To train a classifier using the extracted features:

python main.py \
    --data_root_dir ./data/CLAM_preprocessed \
    --results_dir ./data/CLAM_results \
    --task task_1_UBC_OCEAN_WSI \
    --split_dir ./splits/task_1_UBC_OCEAN_WSI_100 \
    --model_type clam_sb \
    --model_size $model_size \
    --feat_name $feat_name \
    --exp_code clam_sb_${model_size}_dropout \
    --log_data \
    --early_stopping \
    --bag_loss ce \
    --inst_loss svm \
    --weighted_sample \
    --drop_out

Where:

  • feat_name specifies the feature extractor used (e.g., resnet50, resnet152, plip, path-50x1-remedis-m, path-152x2-remedis-m).
  • model_size varies based on the feature extractor. For example, plip comes with options plip_small and plip_big. This is consistent across ResNet and REMEDIS models.
  • results_dir: the master directory for saving the trained models.
  • split_dir: the directory containing the split csv files.
  • exp_code: the directory name for saving the trained model.
  • log_data: Enable TensorboardX logging (optional).
  • bag_loss: the loss function for the bag-level classification. Options include ce (cross-entropy) and svm (SVM).
  • inst_loss: the loss function for the instance-level clustering. Options include ce (cross-entropy) and svm (SVM).
  • weighted_sample: Enable weighted sampling for the training data (optional).
  • drop_out: Enable dropout (p=0.25).

Note: The SVM instance loss option (--inst_loss svm) requires smooth-topk to be installed beforehand.

Evaluation

To evaluate the trained models:

python eval.py \
    --data_root_dir ./data/CLAM_preprocessed \
    --results_dir ./data/CLAM_results/task_1_UBC_OCEAN_WSI \
    --task task_1_UBC_OCEAN_WSI \
    --splits_dir ./splits/task_1_UBC_OCEAN_WSI_100 \
    --model_type clam_sb \
    --feat_name $feat_name \
    --model_size $model_size \
    --models_exp_code clam_sb_${model_size}_dropout_s2024 \
    --save_exp_code clam_sb_${model_size}_dropout \
    --save_dir ./data/CLAM_evaluation/task_1_UBC_OCEAN_WSI \
    --split test \
    --drop_out

Where:

  • model_size and feat_name remain consistent as in the training command.
  • results_dir: the master directory containing the trained models for the current task.
  • models_exp_code: the directory name of the saved trained model.
  • save_exp_code: the directory name for saving the evaluation results.
  • save_dir: the master directory for saving the evaluation results for the current task.

Heatmap

We provide a script create_heatmap_plip.py adapted from create_heatmap.py for generating heatmaps using the PLIP model. This process requires both a .yaml config file and a CSV file containing the slide names along with their corresponding cancer subtype labels. Optional parameters, such as patch size, can also be specified in the CSV file. Sample files and further guidance are available in the heatmaps directory.

To reproduce the creation of the heatmap csv file for the testing set:

python ./scripts/make_heatmap_csv_test.py \
    --process_csv ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
    --split_csv ./splits/task_1_UBC_OCEAN_WSI_100/splits_0.csv \
    --dataset_csv ./dataset_csv/UBC_OCEAN_WSI.csv \
    --save_csv ./heatmaps/process_list.csv

For generating the .yaml configuration files needed for heatmap generation for the trained models:

python ./scripts/create_heatmap_config_from_results.py \
    --results_dir ./data/CLAM_results \
    --task task_1_UBC_OCEAN_WSI \
    --data_dir path/to/UBC_OCEAN_WSI/tiff \
    --overlap 0 \
    --patch_size 224 \
    --num_workers 2

The generated yaml config file will be saved to the heatmaps/configs directory, named after the corresponding checkpoint directory.

To generate heatmaps with PLIP as feature extractor:

python create_heatmaps_plip.py --config_file ./heatmaps/configs/clam_sb_plip_small_dropout_0.yaml

Acknowledgements

We appreciate the authors of CLAM, PLIP, and REMEDIS for open-sourcing the codebase and the pre-trained models for feature extraction. We also thank the organisers of the UBC-OCEAN competition for making the dataset publicly available.

References

  1. Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood,F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering 5(6), 555–570 (2021)

  2. Berrada, L., Zisserman, A., Kumar, M.P.: Smooth loss functions for deep top-k classification. International Conference on Learning Representations (2018)

  3. Bashashati, A., Farahani, H., OTTA Consortium, Karnezis, A., Akbari, A., Kim, S., Chow, A., Dane, S., Zhang, A., Asadi, M.: UBC Ovarian Cancer Subtype Classification and Outlier Detection (UBC-OCEAN) (2023)

  4. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T.J., Zou, J.: A visual–language foundation model for pathology image analysis using medical twitter. Nature Medicine 29(9), 2307–2316 (2023)

  5. Azizi, S., Culp, L., Freyberg, J., Mustafa, B., Baur, S., Kornblith, S., Chen, T., Tomasev, N., Mitrović, J., Strachan, P., et al.: Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering pp. 1–24 (2023)

pathofmhub's People

Contributors

ff98li avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.