GithubHelp home page GithubHelp logo

driver_evaluation's Introduction

Introduction

This repository contains scripts to evaluate cancer driver prediction methods and compare them against predictions obtained from 18 different existing methods on 15 TCGA cancer types. For each cancer type, we provide ready to use genomic (point mutations and copy number variation) and transcriptomic data. In addition, we provide a database containing results from the different driver prediction methods (SIFT, PolyPhen2, MutationTaster, MutationAssessor, CHASM, transFIC, fathmm, ActiveDriver, MutSigCV, OncodriveCLUST, OncodriveFM, OncodriveCIS, S2NNetBox, HotNet2, DriverNet, DawnRank and OncoIMPACT). For further details see Bertrand et al, 2017.

Database containing cancer genomic data and driver predictions

Evaluation datasets and predictions for the 18 different methods listed above. The database will be automatically downloaded during the software installation (see below). The evaluation dataset is organized as follows:

Genomic and transcriptomic data from 15 TCGA cancer types

Point mutation and copy number variation data for all cancer types was obtained from GDAC via Firehose. Expression data for tumor and normal samples for all cancer types was downloaded from the TCGA website (level 3). Samples for which the 3 data types were not available were excluded.

The directory EVALUATION_DATA_SET/DATA contains the following files for each of the cancer types:

  • GDAC_somatic_mutations.filtered.maf
    File that contains point mutations in maf format
  • point_mutation_matrix.txt
    Matrix (column: sample ID, row: gene name) where a cell is equal to 1 if the gene is mutated (indels, missense, nonsense and splice site variants), 0 otherwise
  • CNA_matrix.txt
    Matrix (column: sample ID, row: gene name) where a cell is equal to 1 if the gene is part of a focal amplification, -1 if the gene is part of a focal deletion, 0 otherwise
  • normalized_expression_matrix.txt
    Matrix (column: sample ID, row: gene name) where a cell represents the normalized expression value obtained using DESeq
  • differential_expression_matrix.txt
    Matrix (column: sample ID, row: gene name) where a cell represents the fold change obtained using DESeq by comparing each tumor to a set of normal samples (see [Bertrand et al, 2017])

Predictions from 18 methods on 15 cancer types

The .result files for the different methods are provided for each cancer type. They can be found in EVALUATION_DATA_SET/RESULT and are in the following unified format:

  • Gene_name: HUGO gene name

  • Sample: In the case of methods that provide patient specific predictions, list of patient IDs where the gene is predicted as a driver (separated by ';'), otherwise ALL

  • Rank: Rank of the gene according to the method based on the reported score or p-value

  • Score: Score or p-value reported by the method

  • Info: Additional information reported by the method

  • Sample-specific_score: In the case of methods that provide patient specific predictions, list of scores or p-values provided by the method per patient (separated by ';')

Installing the evaluation scripts

  • Download the latest version of the software and unzip it or git clone https://github.com/CSB5/driver_evaluation.git

  • Run the ./install.pl command to install and download the required databases.

Evaluate new methods using the evaluation scripts

  1. Analyze data for the different cancer types using the new method

  2. Convert the output file of the method into the unified result format defined above. The result files should be named according to the cancer type analysed with a '.result' extention (e.g. GBM.result) and organized in a single directory. The name of the directory will be used as the method name in the evaluation result files.

  3. Run the evaluation script driver_evaluation.pl using the following options:

    • --method_dir: Directory that contains the '.result' files of the evaluated method

    • --out_dir: The directory that will contain the evaluation result files

You can perfom a test run using the following commands:

cd TEST_DATA_SET/
perl ../bin/driver_evaluation.pl --method_dir ConsensusDriver/ --out_dir eval_result

Description of the evaluation result files

Cohort level evaluation

Concordance with gold standard

The methods were evaluated on how well their predictions identified cancer driver genes based on three standard measures: precision (fraction of predictions that belong to the gold standard), recall (fraction of the gold standard contained in the predictions) and the F1 score that combines both precision and recall. The gold standard gene lists can be found in driver_evaluation/GOLD_STANDARD/.

The following files contain the evaluation results using the top 50 predictions (similar files are also available for the top 10 predictions):

  • cancer_gene_CANCER_UNION_precision_RANK_50.dat: Matrix (column: method, row: cancer type) where a cell represents the precision

  • cancer_gene_CANCER_UNION_recall_RANK_50.dat:
    Matrix (column: method, row: cancer type) where a cell represents the recall

  • cancer_gene_CANCER_UNION_F1_RANK_50.dat:
    Matrix (column: method, row: cancer type) where a cell represents the F1 score

  • method_name_precision_RANK_50.dat:
    Precision as a function of the number of predictions for method_name.

Patient level evaluation

Number of driver per patients

  • sample_nb_driver_cat_RANK_ALL.dat:
    Matrix (column: method, row: number of predicted driver category [0, 1, 2-3, 4-8, 9-15, 16-25, >26]) where a cell represents the fraction of patients for which the predicted number of drivers falls in the given category.

Concordance with gold standard

The methods were evaluated on how well their patient-specific predictions identified cancer driver genes at the patient level based on three standard measures: precision, recall and the F1 score.

The following files contain the evaluation results using the top 5 predictions (similar files are also available for the top 3 and top 10 predictions):

  • sample_precision_RANK_5.dat:
    Matrix (column: method, row: sample ID) where a cell represents the precision

  • sample_recall_RANK_5.dat:
    Matrix (column: method, row: sample ID) where a cell represents the recall

  • sample_F1_RANK_5.dat:
    Matrix (column: method, row: sample ID) where a cell represents the F1 score

Prediction of actionable genes

The method were evaluated on their ability to identify patient specific drivers that are potentially actionable. The actionable gene list was obtained by combining actionable gene lists from intOGen and OncoKB, and can be found in driver_evaluation/ACTIONABLE_GENES/combine_target.dat. The following files contain the evaluation results using the top 5 predictions (similar files are also available for top 10 predictions):

  • sample_actionable_profile_5_all.dat:
    Matrix (column: method, row: actionable gene category [0: approved drug, 1: investigational target, 2: research target, 3: not actionable]) for each patient we retain the predicted driver that falls in the best actionable gene category (with 0 being the best one and 3 the worst), and each cell represents the fraction of patients that falls in a given actionable gene category.

  • sample_actionable_profile_5_cancer_type.dat:
    Matrix (column: method, row: cancer type) where a cell represents the fraction of patients with a predicted actionable gene.

Additional files

  • driver_number.dat:
    Matrix (column: method, row: cancer type) where a cell represents the number of drivers predicted

Contact:

Please direct any questions or feedback to Denis Bertrand ([email protected]) and Niranjan Nagarajan ([email protected]).

driver_evaluation's People

Contributors

dbertran78 avatar nnnagara avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

apataracheal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.