GithubHelp home page GithubHelp logo

diagnijmegen / picai_eval Goto Github PK

View Code? Open in Web Editor NEW
19.0 5.0 9.0 806 KB

Evaluation of 3D detection and diagnosis performance —geared towards prostate cancer detection in MRI.

Home Page: https://pi-cai.grand-challenge.org/

License: Apache License 2.0

Python 100.00%
detection diagnosis evaluation metrics froc precision-recall roc

picai_eval's Introduction

Evaluation Utilities for 3D Detection and Diagnosis in Medical Imaging

Tests

This repository contains standardized functions to evaluate 3D detection and diagnosis performance in medical imaging —with its evaluation strategy being geared towards clinically significant prostate cancer (csPCa) detection in MRI. It is used for the official evaluation pipeline of the PI-CAI challenge.

Supported Evaluation Metrics

  • Average Precision (AP)
  • Area Under the Receiver Operating Characteristic curve (AUROC)
  • Overall AI Ranking Metric of the PI-CAI challenge: (AUROC + AP) / 2
  • Precision-Recall (PR) curve
  • Receiver Operating Characteristic (ROC) curve
  • Free-Response Receiver Operating Characteristic (FROC) curve

Additional Supported Functionalities

  • Subset Analysis: By providing a list of case identifiers, performance can be evaluated for only that specific subset.
  • Case-Wise Sample Weighting: Sample weighting can help facilitate inverse probability weighting. Note, when this feature is used in conjunction with lesion-level evaluation, the same weight is applied to all lesion candidates of the same case. Lesion-wise sample weighting is currently not supported.
  • Statistical Tests: Permutation tests and bootstrapping techniques to facilitate AI vs AI/radiologists comparisons in the PI-CAI challenge.

Installation

picai_eval is pip-installable:

pip install picai_eval

Evaluation Pipeline

Detection pipeline overview Figure: Detection/diagnosis evaluation pipeline of the PI-CAI challenge. (top) Lesion-level csPCa detection (modeled by 'AI'): For a given patient case, using the bpMRI exam, predict a 3D detection map of non-overlapping, non-connected csPCa lesions (with the same dimensions and resolution as the T2W image). For each predicted lesion, all voxels must comprise a single floating point value between 0-1, representing that lesion’s likelihood of harboring csPCa. (bottom) Patient-level csPCa diagnosis (modeled by 'f(x)'): For a given patient case, using the predicted csPCa lesion detection map, compute a single floating point value between 0-1, representing that patient’s overall likelihood of harboring csPCa. For instance, f(x) can simply be a function that takes the maximum of the csPCa lesion detection map, or it can be a more complex heuristic (defined by the AI developer).

Usage

Expected Predictions and Annotations

Our evaluation pipeline expects detection maps and annotations in the following format:

  • Detection Maps: 3D volumes with non-connected, non-overlapping lesion detections. Each lesion detection is a connected component (in 3D) with the same confidence or likelihood score (floating point) per voxel. Each detection map may contain an arbitrary number of such lesion detections.

  • Annotations: 3D volumes of the same shape as their corresponding detection maps, with non-connected, non-overlapping ground-truth lesions. Each ground-truth lesion is a connected component (in 3D) with the integer value 1 per voxel. Background voxels are represented by the integer value 0.

Note, we define a connected component as all non-zero voxels with squared connectivity equal to three. This means that in a 3×3×3 neighbourhood all voxels are connected to the centre voxel. See 26-Connectivity for an illustration.

Evaluate Detection Maps with Python

To run evaluation scripts from Python, import the evaluate function and provide detection maps (y_det) and annotations (y_true):

from picai_eval import evaluate

subject_list = [
    "case-0",
    "case-1",
    "case-2",
]

metrics = evaluate(
    y_det=y_det,
    y_true=y_true,
    subject_list=subject_list,  # optional
)
  • y_det: Iterable of all detection maps to evaluate. Each detection map is a 3D volume with non-connected, non-overlapping lesion detections. Each lesion detection is a connected component (in 3D) with the same confidence or likelihood score per voxel. Each detection map may contain an arbitrary number of such lesion detections. Alternatively, y_det may contain filenames of detection maps ending in .nii.gz/.mha/.mhd/.npy/.npz, which will be loaded on-the-fly.

  • y_true: Iterable of all ground-truth annotations. Each annotation should be a 3D volume of the same shape as its corresponding detection map, with non-connected, non-overlapping ground-truth lesions. 1 is used to encode ground-truth lesions, and 0 is to encode the background. Alternatively, y_true may contain filenames of binary annotations ending in .nii.gz/.mha/.mhd/.npy/.npz, which will be loaded on-the-fly.

Default parameters will perform evaluation as per the specifications of the PI-CAI challenge. Optionally, the specifications for evaluation can be adapted using the following parameters:

  • sample_weight: Case-level sample weight. When this feature is used in conjunction with lesion-level evaluation, the same weight is applied to all lesion candidates of the same case. Default: equal weight for all cases.

  • subject_list: List of sample identifiers, to give recognizable names to the evaluation results.

  • min_overlap: Defines the threshold of the hit criterion, i.e. the minimal required Intersection over Union (IoU) or Dice similarity coefficient (DSC) between predicted lesion candidates and ground-truth lesions, for predicted lesions to be counted as true positive detections. Default: 0.1.

  • overlap_func: Function used to calculate the basis of the hit criterion, i.e. the object overlap between predicted lesion candidates and ground-truth lesions. This can be set as 'IoU' to use Intersection over Union, or 'DSC' to use Dice similarity coefficient. Alternatively, any other function can also be provided with the signature func(detection_map, annotation) -> overlap [0, 1]. Default: 'IoU'.

  • case_confidence_func: Function used to derive case-level prediction or confidence, from lesion-level detections or confidences (as denoted by 'f(x)' in 'Evaluation Pipeline'. Default: 'max' (which simply takes the maximum of the detection map, as the case-level prediction).

  • multiple_lesion_candidates_selection_criteria: Used to account for split-merge scenarios. When multiple lesion candidates have sufficient overlap with the ground-truth lesion, this condition determines which lesion candidate is selected as the true positive, and which lesion candidates are discarded or counted as false positives. Default: 'overlap' (which selects the lesion candidate with the highest degree of overlap).

  • allow_unmatched_candidates_with_minimal_overlap: Used to account for split-merge scenarios. When multiple lesion candidates have sufficient overlap with the ground-truth lesion, this condition determines whether non-selected lesion candidates are discarded or count as false positives. Default: True (i.e. non-selected lesion candidates are not counted as false positives).

  • num_parallel_calls: Number of CPU threads used to process evaluation. Default: 3.

Evaluate all Detection Maps stored in a specific folder

To evaluate numerous detection maps stored on disk, prepare input folders in the following format:

path/to/detection_maps/
├── [case-0]_detection_map.nii.gz
├── [case-1]_detection_map.nii.gz
├── [case-2]_detection_map.nii.gz
...

path/to/annotations/
├── [case-0]_label.nii.gz
├── [case-1]_label.nii.gz
├── [case-2]_label.nii.gz

See here for an example. If the folders containing all detection maps and annotations are different, then the _detection_map and _label suffixes are optional. Allowed file extensions are: .npz (as used in the nnU-Net framework), .npy, .nii.gz, .nii, .mha and .mhd. First file matching one of these extensions (in the order stated in the previous sentence) is selected.

Using Python:
Evaluates all cases specified in subject_list. Function evaluate_folder also accepts all parameters described above.

from picai_eval import evaluate_folder

subject_list = [
    "case-0",
    "case-1",
    "case-2",
]

metrics = evaluate_folder(
    y_det_dir="path/to/detection_maps",
    y_true_dir="path/to/annotations",
    subject_list=subject_list,          # optional
)

Using the command line:
Evaluates all cases found in path/to/detection_maps against the annotations in path/to/annotations, and store the metrics in path/to/detection_maps/metrics.json. Optionally, the --labels parameter may be omitted, which will then default to the --input folder. To specify the output location of the metrics, use --output /path/to/metrics.json.

python -m picai_eval --input path/to/detection_maps --labels path/to/annotations

Evaluate Softmax Volumes (instead of Detection Maps)

To evaluate softmax predictions (instead of detection maps), a function to extract lesion candidates from the softmax volume must be provided. For instance, the dynamic lesion extraction method from the report_guided_annotation module can be used for this (see mechanism for a depiction of its working principle).

Evaluating Softmax Volumes using Python:

from picai_eval import evaluate
from report_guided_annotation import extract_lesion_candidates

metrics = evaluate(
    y_det=y_pred,
    y_true=y_true,
    subject_list=subject_list,  # may be omitted
    y_det_postprocess_func=lambda pred: extract_lesion_candidates(pred)[0],
)

For the structure of the inputs and additional parameters, see Evaluate Detection Maps with Python.

Evaluating all Softmax Volumes stored in a specific folder:

from picai_eval import evaluate_folder
from report_guided_annotation import extract_lesion_candidates

metrics = evaluate_folder(
    y_det_dir=in_dir_softmax,
    y_true_dir=in_dir_annot,
    y_det_postprocess_func=lambda pred: extract_lesion_candidates(pred)[0],
)

Accessing Metrics after Evaluation

To access metrics after evaluation, we recommend using the Metrics class:

metrics = ...  # from evaluate, evaluate_folder, or Metrics("/path/to/metrics.json")

# aggregate metrics
AP = metrics.AP
auroc = metrics.auroc
picai_score = metrics.score

# Precision-Recall (PR) curve
precision = metrics.precision
recall = metrics.recall

# Receiver Operating Characteristic (ROC) curve
tpr = metrics.case_TPR
fpr = metrics.case_FPR

# Free-Response Receiver Operating Characteristic (FROC) curve
sensitivity = metrics.lesion_TPR
fp_per_case = metrics.lesion_FPR

For example, these can be used to plot performance curves:

import matplotlib.pyplot as plt
from sklearn.metrics import PrecisionRecallDisplay, RocCurveDisplay

# plot recision-Recall (PR) curve
disp = PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=AP)
disp.plot()
plt.show()

# plot Receiver Operating Characteristic (ROC) curve
disp = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auroc)
disp.plot()
plt.show()

# plot Free-Response Receiver Operating Characteristic (FROC) curve
f, ax = plt.subplots()
disp = RocCurveDisplay(fpr=fp_per_case, tpr=sensitivity)
disp.plot(ax=ax)
ax.set_xlim(0.001, 5.0); ax.set_xscale('log')
ax.set_xlabel("False positives per case"); ax.set_ylabel("Sensitivity")
plt.show()

To perform subset analysis, a list of subject IDs can be provided. To view all available subject IDs, run print(metrics.subject_list).

subject_list = [..., ...]  # list of case identifiers
metrics = Metrics("path/to/metrics.json", subject_list=subject_list)
print(metrics)  # prints performance for specified subset

Or, with existing metrics:

metrics = ...  # from evaluate, evaluate_folder, or Metrics("/path/to/metrics.json")
metrics.subject_list = subject_list
print(metrics)  # prints performance for specified subset

All performance metrics for a subset, can be accessed in the same manner as for the full set.

Storing and Reading Metrics

Metrics can be easily saved and loaded to/from disk, to facilitate evaluation of multiple models, and subsequent (statistical) analyses. To read metrics, simply provide the path to the saved .json file:

from picai_eval import Metrics

metrics = Metrics("path/to/metrics.json")

To save metrics, provide the path to save a corresponding .json file:

metrics.save("path/to/metrics.json")
# metrics.save_full("path/to/metrics.json")     # also store derived curves
# metrics.save_minimal("path/to/metrics.json")  # only store minimal information to reload Metrics instance

Command line interface described in 'Evaluate All Detection Maps stored in a Specific Folder') will automatically save metrics to disk. Its output path can be controlled with the --output parameter.


Statistical Tests

The PI-CAI challenge features 'AI vs AI', 'AI vs Radiologists from Clinical Routine' and 'AI vs Radiologists from Reader Study' comparisons. Each of these comparisons come with a statistical test. For 'AI vs AI', a permuations test with the overall ranking metric is performend. Readers cannot be assigned a ranking metric without introducing bias, so for 'AI vs Radiologists from Reader Study' and 'AI vs Radiologists from Clinical Routine', we compare performance at matched operating points. See each section below for more details.

For the following tests, we assume that each AI algorithm is trained on the same training dataset and evaluated on the same testing dataset, multiple times (5-10x), and all of these independently trained instances are used in each statistical test. By doing so, we account for the performance variance resulting from the stochastic optimization of machine/deep learning models (due to which, the same AI architecture, trained on the same data, for the same number of training steps, typically can exhibit different performance each time). Our goal is to avoid basing any conclusions off of one arbitrary training run (which may prove “lucky” or “unlucky” for a given AI algorithm), and to promote reproducibility. Thus, we statistically evaluate the overall AI algorithm, and not just a single trained instance of that algorithm.

Note: Extended tests to verify whether a given statistical test is well-calibrated (i.e. it does not over-/under-estimate the p-value), will be incorporated in the future.

AI vs AI

Comparison: Between a given pair of AI algorithms, with multiple independently trained instances per AI algorithm.

Statistical Question: What is the probability that one AI algorithm outperforms another, while accounting for the performance variance stemming from each AI algorithm’s training method?

Statistical Test: Permutation test (as applied in Bosma et al., 2021). In each replication, performance metrics (ranking score, AP or AUROC) are shuffled across methods (different AI algorithms) and their instances (independently trained samples of each method). As such, the performance variance between training runs is accounted for. This is different from the permutation test used in e.g. Ruamviboonsuk et al., 2022, McKinney et al., 2020 and Bulten et al., 2022, where predictions are permuted rather than performance metrics. When permuting predictions, the trained model instances are compared, rather than the training methods.

Permutation test can be used as follows:

from picai_eval.statistical_helper import perform_permutation_test

scores_algorithm_a = [0.96, 0.91, 0.90, 0.85, 0.81, 0.80]
scores_algorithm_b = [0.92, 0.94, 0.95, 0.81, 0.82, 0.86]

# perform permutation tests
p = perform_permutation_test(
    scores_alternative=scores_algorithm_a,
    scores_baseline=scores_algorithm_b,
)

# p-value should be 0.7218614718614719

This will calculate the p-value for the null hypothesis Performance(baseline algorithm) > Performance(alternative algorithm) (given the provided or observed performance metrics). Note, the scores shown above (0.92, 0.94, etc.) are performance metrics (e.g. AUROC, AP), not model predictions (i.e. likelihood score predicted per case). While using individual predictions provides more, correlated samples (numerous individual predictions are associated with the same, single trained instance of an AI model) for the permutation test, using overall performance metrics provides relatively fewer, but independent samples (only a single overall performance metric is associated with the same, single trained instance of an AI model). Hence, we opt for the latter and use multiple training runs to facilitate the same. Performance metrics can be obtained from the evaluation pipeline, as follows:

from picai_eval import Metrics

scores_algorithm = [
    Metrics(path).score
    for path in [
        "/path/to/algorithm/metrics-restart-1.json",
        "/path/to/algorithm/metrics-restart-2.json",
        "/path/to/algorithm/metrics-restart-3.json",
        ...
    ]
]

AI vs Radiologists from Clinical Routine

Comparison: Between multiple independently trained instances of a given AI algorithm, and the historical reads made by radiologists during clinical routine.

Statistical Question: What is the probability that a given trained AI algorithm outperforms radiologists from clinical routine, while accounting for the performance variance stemming from different cases and the AI algorithm’s training method?

Statistical Test: Paired bootstrapping (as applied in Ruamviboonsuk et al., 2022, McKinney et al., 2020, Rodriguez-Ruiz et al., 2019), using predictions from a given operating point. Here, the operating point is that of radiologists (PI-RADS ≥ 3 or PI-RADS ≥ 4) from clinical routine. Trained AI algorithms are thresholded to match the radiologist's sensitivity/specificity (for patient diagnosis) or recall/precision (for lesion detection). In each of 1M replications, ∼U(0,N) cases are sampled with replacement, and used to calculate the test statistic. Iterations that sample only one class are rejected. Here, the test statistic is the rank of historical reads made by radiologists, with respect to the predictions made by trained AI algorithms, where the rank is determined by the conjugate performance metric.

Note: In contrast to the permutation test, bootstrapping approximates the statistical question. As a result, the p-value from bootstrapping can be miscalibrated (i.e. giving p-values that are higher or lower than they should be). The permutation test does not have this issue, but cannot be applied in this scenario, because we have only a single radiologist prediction per case.

Matched bootstrapping test can be used as follows:

import numpy as np
from picai_eval.statistical_helper import perform_matched_boostrapping

# predictions: 3 restarts (rows) of 4 cases (columns)
y_pred_ai = [
    [0.92, 0.23, 0.12, 0.95],
    [0.42, 0.81, 0.13, 0.86],
    [0.26, 0.15, 0.14, 0.67]
]
y_pred_reader = np.array([5, 4, 2, 3]) >= 3
y_true = [1, 1, 0, 0]

# perform matched bootstrapping
p = perform_matched_boostrapping(
    y_true=y_true,
    y_pred_ai=y_pred_ai,
    y_pred_reader=y_pred_reader,
    match='sensitivity',
    iterations=int(1e4),
)

# Probability for Performance(AI) > Performance(Reader): p = 0.3 (approximately)

Note, the numbers shown above (0.92, 0.23, ..., 0.95 and 5, 4, ..., 2 etc.) are predictions (i.e. likelihood score predicted per case), not performance metrics (e.g. AUROC, AP). All radiologist predictions must be binarized (e.g. thresholded at PI-RADS ≥ 3 or PI-RADS ≥ 4), while all predictions for the algorithm must be likelihood scores between 0 and 1 inclusive. Predictions be obtained from the evaluation pipeline, as follows:

from picai_eval import Metrics

y_pred_ai = [
    Metrics(path).case_pred
    for path in [
        "/path/to/algorithm/metrics-restart-1.json",
        "/path/to/algorithm/metrics-restart-2.json",
        "/path/to/algorithm/metrics-restart-3.json",
        ...
    ]
]

AI vs Radiologists from Reader Study

Comparison: Between multiple independently trained instances of a given AI algorithm, and a given panel of radiologists or readers.

Statistical Question: What is the probability that a given AI algorithm outperforms the typical reader from a given panel of radiologists, while accounting for the performance variance stemming from different readers, and the AI algorithm’s training method?

Statistical Test: Permutation test (as applied in Ruamviboonsuk et al., 2022, McKinney et al., 2020 and Bulten et al., 2022). Permutation tests are used to statistically compare lesion-level detection and patient-level diagnosis performance at PI-RADS operating points. Here, in each of the replications, performance metrics (reader performance w.r.t. AI performance at reader’s operating point) are shuffled across methods (AI, radiologists) and their instances (independently trained samples of AI algorithm, different readers).

import numpy as np
from picai_eval.statistical_helper import perform_matched_permutation_test

# predictions: 3 restarts (rows) of 4 cases (columns)
y_pred_ai = [
    [0.92, 0.23, 0.12, 0.95],
    [0.82, 0.81, 0.13, 0.42],
    [0.26, 0.90, 0.14, 0.67]
]
y_pred_readers = np.array([
    [5, 4, 2, 2],
    [4, 5, 1, 2],
    [5, 2, 3, 2]
]) >= 3
y_true = [1, 1, 0, 0]

p = perform_matched_permutation_test(
    y_true=y_true,
    y_pred_ai=y_pred_ai,
    y_pred_readers=y_pred_readers,
    match="sensitivity",
    iterations=int(1e4),
)

# Probability for Performance(Panel of readers) > Performance(AI): p = 0.8 (approximately)

Note, the numbers shown above (0.92, 0.23, ..., 0.95 and 5, 4, ..., 2 etc.) are predictions (i.e. likelihood score predicted per case), not performance metrics (e.g. AUROC, AP). All radiologist predictions must be binarized (e.g. thresholded at PI-RADS ≥ 3 or PI-RADS ≥ 4), while all predictions for the algorithm must be likelihood scores between 0 and 1 inclusive. Predictions be obtained from the evaluation pipeline, as follows:

from picai_eval import Metrics

y_pred_ai = [
    Metrics(path).case_pred
    for path in [
        "/path/to/algorithm/metrics-restart-1.json",
        "/path/to/algorithm/metrics-restart-2.json",
        "/path/to/algorithm/metrics-restart-3.json",
        ...
    ]
]

Reference

If you are using this codebase or some part of it, please cite the following article:

Saha A, Bosma JS, Twilt JJ, et al. Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI): an international, paired, non-inferiority, confirmatory study. Lancet Oncol 2024; 25: 879–887

BibTeX:

@ARTICLE{SahaBosmaTwilt2024,
  title = {Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI): an international, paired, non-inferiority, confirmatory study},
  journal = {The Lancet Oncology},
  year = {2024},
  issn = {1470-2045},
  volume={25},
  number={7},
  pages={879--887},
  doi = {https://doi.org/10.1016/S1470-2045(24)00220-1},
  author = {Anindo Saha and Joeran S Bosma and Jasper J Twilt and Bram {van Ginneken} and Anders Bjartell and Anwar R Padhani and David Bonekamp and Geert Villeirs and Georg Salomon and Gianluca Giannarini and Jayashree Kalpathy-Cramer and Jelle Barentsz and Klaus H Maier-Hein and Mirabela Rusu and Olivier Rouvière and Roderick {van den Bergh} and Valeria Panebianco and Veeru Kasivisvanathan and Nancy A Obuchowski and Derya Yakar and Mattijs Elschot and Jeroen Veltman and Jurgen J Fütterer and Constant R. Noordman and Ivan Slootweg and Christian Roest and Stefan J. Fransen and Mohammed R.S. Sunoqrot and Tone F. Bathen and Dennis Rouw and Jos Immerzeel and Jeroen Geerdink and Chris {van Run} and Miriam Groeneveld and James Meakin and Ahmet Karagöz and Alexandre Bône and Alexandre Routier and Arnaud Marcoux and Clément Abi-Nader and Cynthia Xinran Li and Dagan Feng and Deniz Alis and Ercan Karaarslan and Euijoon Ahn and François Nicolas and Geoffrey A. Sonn and Indrani Bhattacharya and Jinman Kim and Jun Shi and Hassan Jahanandish and Hong An and Hongyu Kan and Ilkay Oksuz and Liang Qiao and Marc-Michel Rohé and Mert Yergin and Mohamed Khadra and Mustafa E. Şeker and Mustafa S. Kartal and Noëlie Debs and Richard E. Fan and Sara Saunders and Simon J.C. Soerensen and Stefania Moroianu and Sulaiman Vesal and Yuan Yuan and Afsoun Malakoti-Fard and Agnė Mačiūnien and Akira Kawashima and Ana M.M. de M.G. {de Sousa Machadov} and Ana Sofia L. Moreira and Andrea Ponsiglione and Annelies Rappaport and Arnaldo Stanzione and Arturas Ciuvasovas and Baris Turkbey and Bart {de Keyzer} and Bodil G. Pedersen and Bram Eijlers and Christine Chen and Ciabattoni Riccardo and Deniz Alis and Ewout F.W. {Courrech Staal} and Fredrik Jäderling and Fredrik Langkilde and Giacomo Aringhieri and Giorgio Brembilla and Hannah Son and Hans Vanderlelij and Henricus P.J. Raat and Ingrida Pikūnienė and Iva Macova and Ivo Schoots and Iztok Caglic and Jeries P. Zawaideh and Jonas Wallström and Leonardo K. Bittencourt and Misbah Khurram and Moon H. Choi and Naoki Takahashi and Nelly Tan and Paolo N. Franco and Patricia A. Gutierrez and Per Erik Thimansson and Pieter Hanus and Philippe Puech and Philipp R. Rau and Pieter {de Visschere} and Ramette Guillaume and Renato Cuocolo and Ricardo O. Falcão and Rogier S.A. {van Stiphout} and Rossano Girometti and Ruta Briediene and Rūta Grigienė and Samuel Gitau and Samuel Withey and Sangeet Ghai and Tobias Penzkofer and Tristan Barrett and Varaha S. Tammisetti and Vibeke B. Løgager and Vladimír Černý and Wulphert Venderink and Yan M. Law and Young J. Lee and Maarten {de Rooij} and Henkjan Huisman},
}

Managed By

Diagnostic Image Analysis Group, Radboud University Medical Center, Nijmegen, The Netherlands

Contact Information

picai_eval's People

Contributors

anindox8 avatar joeranbosma avatar nataliaalves13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

picai_eval's Issues

Implementation of evaluation metrics (lesion assignment & patient-level scoring)

Dear organizers,
I am trying to understand how the evaluation metrics provided with the python code works (many thanks for implementing and sharing the code). I set it in relation to what is described in the illustrative document [1].

  1. Concerning the evaluation of the lesion detection maps:
  • The evaluation metric (i.e. the Average Precision (AP) and the corresponding Precision-Recall (PR) curve) seems to depend on both: i) the confidence level/scores provided in the detection map (dependent in an explicit manner when constructing the PR curve); ii) the size of the detected lesion blobs (dependent in a more implicit manner by disregarding detected blobs with a too low overlap (in terms of IoU) against ground truth). In the function evaluate_case() (in eval.py) a single threshold (default: 0.1) is used to define a "hit criterion". On p.21 of [1] it is suggested to average an evaluation over multiple different thresholds on IuO.
    Does it make sense to extend the evaluation scripts to sweep over a set of thresholds? Or should the AI system get constructed in a way to generate blob sizes in the detection map that are very specifically tailored to the "hit criterion" with a fixed threshold of 0.1 (as this appears useful from a clinical perspective)?
  • In the case of having multiple matching detected blobs and/or multiple ground truth lesions an assignment is done to remove ambiguities between detected and ground truth blobs. On p.22 of [1] it is suggested to either use the "Greedy by Score strategy" or to optimize a cost function (e.g. "Hungarian Matching") to find an optimal assignment. In contrast to that, in the provided code (evaluate_case() in eval.py) the assignment in the case of multiple ground truth lesions looks like: rank the ground truth blobs according to spatial location and make assignments in that order. That can lead to the situation that by flipping or rotation by 180degr. of the ground truth and of the detection map a different assignment is generated compared to the original pose. Is such an assignment suitable in a sense of the evaluation metric.
  1. Concerning the evaluation on the patient-level:
  • Do I understand correctly that the participating teams should/may provide an implementation of the function that generates a patient-level scoring (from the detection map) which must be compatible with the interfaces of the evaluation scripts? In that sense, the patient level scores are generated during performing the evaluation?
  • If no function to generate patient-level scores is provided, the evaluation script will pick the maximum confidence value contained anywhere in the detection map as patient score? It does not matter at which spatial location that confidence value occurs (even if it does not match with a ground truth lesion)?

many thanks,
Christoph

Refs:
[1] Reinke, et al., "Common Limitations of Image Processing Metrics: A Picture Story"

high auroc values and consistently 0 AP

Hello I am writing here becouse I do not have an idea why during training I get increasing auroc but constant 0 AP. I evaluate the model every 9 epochs on the validation part of the dataset.

below shortened example with single batch How I use evaluate method

#first I extract the lesion candidates
y_det=extract_lesion_candidates( y_det.cpu().detach().numpy()[1,:,:,:])[0]
#then I save y_det and y_true in files in given folder

on the validation end

valid_metrics = evaluate(y_det=self.list_yHat_val,
                                y_true=self.list_gold_val,
                                num_parallel_calls= os.cpu_count()

meanPiecaiMetr_auroc=valid_metrics.auroc
meanPiecaiMetr_AP=valid_metrics.AP
meanPiecaiMetr_score=valid_metrics.score

below image showing diffrent trials and recorded values auroc's are increasing with training but AP keep near 0 as both are more or less based on confusion matrix I have no idea how it can be possible

image

Evaluate Individual Detection Maps with Python

The documentation is unclear for both Evaluate Individual Detection Maps with Python and Evaluating Individual Softmax Volumes using Python, in both cases it is not specified that the evaluate() function is expecting an iterable for both y_pred and y_true.

lesion level evaluation only considers positive lesions?

Hi, I am using the eval script for a case where we have TN on lesions, which result in lesion results like this
'11025_1001045': [], '10822_1000838': [],
in the metric.json.

It follows that in the this line, it is not included since the list is empty, which result in all true negative cases not taken into account for lesion level evaluation, is it intended to do so?

Thank you in advance!

Example for evaluation

Hi,

I try to understand behind of the idea of the code of pica_eval. Could you give me an small example how to evaluate dataset with terminal codes? Just some codes preparing or starting evaluation which are using on your machine, that's enough for me.

I couldn't start the evaluation with that code:

python -m picai_eval --input path/to/detection_maps --labels path/to/annotations

Title: Issues with Evaluating Results using picai_eval in nnU-Net

Hello,

I am currently working on a project using nnU-Net and have encountered an issue while trying to evaluate my results. I have successfully used nn-UNet for training and nnUNetv2_predict for predicting my test dataset. However, I am facing difficulties in properly evaluating the results, neither with Detection Maps nor with Softmax Volumes.

Here are the details of the issue:

Environment Details:

Path: e:\Hesam\envs\3dunetAghdam\lib\site-packages\sklearn\metrics_ranking.py
Warning: UndefinedMetricWarning: No negative samples in y_true, false positive value should be meaningless
This warning suggests an issue with the evaluation metrics due to the absence of negative samples in y_true.
Evaluation Results:

For Detection Maps:
auroc: NaN
AP: 0.7681818181818182
num_cases: 43
num_lesions: 45
picai_eval_version: 1.4.x
lesion_results: Various values (e.g., 10021_1000021: [1, 1.0, 0.8476386663616025])
For Softmax Volumes:
auroc: NaN
AP: 0.8255997853760775
num_cases: 43
num_lesions: 45
picai_eval_version: 1.4.x
I am unsure why the auroc value is NaN and whether the warning about no negative samples is impacting the evaluation results. Any guidance or suggestions on how to resolve this issue would be greatly appreciated.

Evaluation progress bar

When evaluating predictions, a progress bar should be shown when verbose=1. However, during evaluation the progress bar is not shown (it can be seen from the CPU metrics that evaluation is running). After finishing evaluation, a meaningless progress bar is shown:

Evaluating: 100%|█████████████████████████████████████████████████████| 100/100 [00:00<00:00, 74327.56it/s]

I've traced this issue to the fact that we create a ThreadPoolExecutor pool within the make_evaluation_iterator function:

with ThreadPoolExecutor(max_workers=num_threads) as pool:

The iterator is then returned at the end of the function, but at that point the with ThreadPoolExecutor block is already exited, causing all jobs to be executed. Kinda the point of with, but not the intended behavior here.

Nan values in metrics outputs

Hello I am invoking evaluate function and I get auroc and score metric as NaN
I invoke function like

from report_guided_annotation import extract_lesion_candidates


y_hat=torch.sigmoid(y_hat) # y_hat is an output of the model
y_det=[extract_lesion_candidates( np.argmax(x.cpu().detach().numpy(),axis=0) )[0] for x in y_hat]
y_true=[np.argmax(x.cpu().detach().numpy(),axis=0) for x in labels]
#argmax to undo hot encoding
print( f"suums y_det {np.sum(y_det[0])} y_true  {np.sum(y_true[0])} len { len(y_det) } shapes  y_det {np.shape(y_det[0])} y_true  {np.shape(y_true[0])} ")


valid_metrics = evaluate(y_det=y_det,    y_true=y_true)

printed example output

suums y_det 77 y_true  1084 len 1 shapes  y_det (192, 192, 64) y_true  (192, 192, 64) 
No negative samples in y_true, false positive value should be meaningless
metrics.auroc nan metrics.AP -0.0  metrics.score nan  

as you see np.sum() return non zero values for both labels and algorithm output for both cases in a list
Hence for two validation cases in each case algorithm output and gold standard,situation repeats for multiple cases just number changes but every time

  1. y_hat and y have no Nan values
  2. are non zero
  3. have the same shape

Hence both information about no negative samples in y_true, and presence of nan values are highly mysterious for me

Additionally I get valid (approximately decreasing not Nan valued) loss function output

Thank you for help !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.