GithubHelp home page GithubHelp logo

lima1 / purecn Goto Github PK

View Code? Open in Web Editor NEW
120.0 7.0 31.0 19.38 MB

Copy number calling and variant classification using targeted short read sequencing

Home Page: https://bioconductor.org/packages/devel/bioc/html/PureCN.html

License: Artistic License 2.0

R 97.56% Dockerfile 0.61% TeX 1.84%
copy-number cell-free-dna tumor-heterogeneity tumor-purity tumor-mutational-burden loh bioconductor-package

purecn's Introduction

R-CMD-check-bioc BioC status Platforms Coverage License: Artistic-2.0

PureCN

A tool developed for tumor-only diagnostic sequencing using hybrid-capture protocols. It provides copy number adjusted for purity and ploidy and can classify mutations by somatic status and clonality. It requires a pool of process-matched normals for coverage normalization and artifact filtering. PureCN was parameterized using large collections of diverse samples, ranging from low coverage whole-exome to ultra-deep sequenced plasma gene-panels.

Installation

To install this package, start R and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("PureCN")

If your R/Bioconductor version is outdated, this will install an old and unsupported version.

For outdated R/Bioconductor versions, you can try backporting the latest stable version (this should work fine for Bioconductor 3.3 and later):

BiocManager::install("lima1/PureCN", ref = "RELEASE_3_19")

If you want the latest and greatest from the developer branch:

BiocManager::install("lima1/PureCN")

To get the lastest stable version from Conda (unstable is currently only available from GitHub directly):

conda install -c bioconda bioconductor-purecn=2.8.1

A Dockerhub image of the latest stable version with recommended dependencies such as GenomicsDB and GATK 4 pre-installed:

docker pull markusriester/purecn:latest

Tutorials

To get started:

vignette("Quick", package = "PureCN")

For the R package and more detailed information:

vignette("PureCN", package = "PureCN")

These tutorials are also available on the Bioconductor project page (devel, stable).

Bugs

Before posting a bug report:

  • update to the latest version
  • confirm with sessionInfo() that the latest version is used
  • if this is a first PureCN attempt, closely follow the Quick vignette (devel, stable)
  • make sure that the issue is not covered in the Support section of the main vignette

Papers

  • Main paper describing the likelihood model:

    Riester M, Singh A, Brannon A, Yu K, Campbell C, Chiang D and Morrissey M (2016). “PureCN: Copy number calling and SNV classification using targeted short read sequencing.” Source Code for Biology and Medicine, 11, pp. 13. doi: 10.1186/s13029-016-0060-z.

  • Validation paper, including description of novel additions, such as off-target support, tangent normalization and tweaks to the likelihood model:

    Oh S, Geistlinger L, Ramos M, Morgan M, Waldron L, Riester M (2020). Reliable analysis of clinical tumor-only whole exome sequencing data. JCO Clinical Cancer Informatics. doi: 10.1200/CCI.19.00130;
    bioRxiv. doi: 10.1101/552711

Selected citations

Pereira et al. (2021). "Cell-free DNA captures tumor heterogeneity and driver alterations in rapid autopsies with pre-treated metastatic cancer". Nature Communications. doi: 10.1038/s41467-021-23394-4.

Dummer et al. (2020). "Combined PD-1, BRAF and MEK inhibition in advanced BRAF-mutant melanoma: safety run-in and biomarker cohorts of COMBI-i". Nature Medicine. doi: 10.1038/s41591-020-1082-2.

Bertucci et al. (2019). "Genomic characterization of metastatic breast cancers". Nature. doi: 10.1038/s41586-019-1056-z.

Dagogo-Jack et al. (2018). "Tracking the evolution of resistance to ALK tyrosine kinase inhibitors through longitudinal analysis of circulating tumor DNA". JCO Precision Oncology. doi: 10.1200/PO.17.00160.

Orlando et al. (2018). "Genetic mechanisms of target antigen loss in CAR19 therapy of acute lymphoblastic leukemia". Nature Medicine. doi: 10.1038/s41591-018-0146-z.

Pal et al. (2018). "Efficacy of BGJ398, a fibroblast growth factor receptor 1-3 inhibitor, in patients with previously treated advanced urothelial carcinoma with FGFR3 alterations". Cancer Discovery. doi: 10.1158/2159-8290.CD-18-0229.

Pitt et al. (2018). "Characterization of Nigerian breast cancer reveals prevalent homologous recombination deficiency and aggressive molecular features". Nature Communications. doi: 10.1038/s41467-018-06616-0.

purecn's People

Contributors

andrewrech avatar aoles avatar chapmanb avatar ddrichel avatar dtenenba avatar hpages avatar jwokaty avatar lima1 avatar link-ny avatar lshep avatar nturaga avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

purecn's Issues

calculateMappingBiasVcf with pool of normals containing duplicate samples

Dear Markus,

I have a pool of processed matched normals that I am including in a workflow where there are no tumour-normal pairs. Some of these normals are the same sample sequenced in different batches.

My question is - if I am including duplicate samples in the normal panel vcf, should I alter the parameters of calculateMappingBiasVcf?

Thank you.

Copy ratio raw and adjusted, table column documentation

The segments, genes, and variants output files all contain columns with log2 coverage ratio values in them, unadjusted for purity and ploidy. Here are several suggestions related to this:

  1. Make the tables describing output file columns easier to find. Have an entry in the table of contents for them, and perhaps put each one on a separate page. And, make their titles refer to filename arguments to PureCN R program files, not to PureCN functions (which I think you are no longer recommending the user directly call; in fact, the entire document should be changed to reflect the recommended method of using the PureCN programs rather than internal functions).
  2. Make sure each time a ratio or copy number column appears, the description says whether or not it is adjusted for purity or ploidy (e.g. log.ratio column in variants file), and whether or not it has been rounded to an integer value. For log ratio, always say log2 in the description.
  3. Add a short section to the guide that gives the relationships between raw and adjusted copy number and copy ratio. Use the equations from the paper you referred me to, Zack, Travis I., et al. "Pan-cancer patterns of somatic copy number alteration." Nature genetics 45.10 (2013): 1134, but look into my assertion that his equation for R'(x) is wrong according to his preceding equations.
  4. Consider adding additional output file columns containing: (a) raw copy ratio (apply 2^ to log2 ratio); (b) adjusted copy ratio; (c) ploidy; (d) purity (c and d copied from the QC file). If not, just make sure (3) above makes it clear to user what to do to transform the log2 ratio into an adjusted copy ratio.

VCF output is not generated

Hello,

I ran PureCN 1.10 with --outvcf parameter but it did not generate any VCF files. I want to know which somatic variants are clonal or subclonal. Here is the script that I use to construct the running command:

Rscript --vanilla /scratch/users/berguener/bin/PureCN/inst/extdata/PureCN.R
--out $SAMPLE_NAME
--tumor "${SAMPLE_NAME}_realigned_coverage_loess.txt"
--sampleid $SAMPLE_NAME
--vcf mutect1/$SAMPLE_NAME.vcf
--statsfile mutect1/$SAMPLE_NAME.mutect
--normaldb ref_bam/purecn_coverage/normalDB_TruSeq_b37.rds
--normal_panel ref_bam/purecn_coverage/mapping_bias_TruSeq_b37.rds
--intervals TruSeq_Exome_targeted_regions_b37_intervals_annotated.txt
--targetweightfile ref_bam/purecn_coverage/target_weights_TruSeq_b37.txt
--outvcf "${SAMPLE_NAME}_purecn.vcf"
--minpurity 0.75
--genome b37 --force --postoptimize --seed 123

Can you please check.
Best, bekir

split.default error is not due to single bp intervals

Related to #28

Intervals

R:       normalDB <- PureCN::createNormalDatabase(ncovf[5:6])
WARN [2018-05-07 14:03:48] Found 2 overlapping intervals, starting at line 5352.
WARN [2018-05-07 14:03:49] Found 2 overlapping intervals, starting at line 5352.
Error in split.default(x$average.coverage, as.character(seqnames(x))) :
  first argument must be a vector
Calls: <Anonymous> -> sapply -> lapply -> FUN -> split -> split.default
No traceback available
	library(data.table)
	dt <- fread(interval.file)
	dt[, c("chr", "interval") := Target %>% tstrsplit(":")]
	dt[, c("start", "stop") := interval %>% tstrsplit("-")]
	dt %>% str
Classes ‘data.table’ and 'data.frame':  78060 obs. of  9 variables:
 $ Target          : chr  "chr1:931039-931089" "chr1:939272-939460" "chr1:942410-942488" "chr1:943698-944011" ...
 $ total_coverage  : num  3239 26849 2278 37733 83135 ...
 $ counts          : num  101 489.7 50.9 659.8 1260 ...
 $ on_target       : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ duplication_rate: num  0.26 0.198 0.154 0.223 0.251 ...
 $ chr             : chr  "chr1" "chr1" "chr1" "chr1" ...
 $ interval        : chr  "931039-931089" "939272-939460" "942410-942488" "943698-944011" ...
 $ start           : chr  "931039" "939272" "942410" "943698" ...
 $ stop            : chr  "931089" "939460" "942488" "944011" ...
 - attr(*, ".internal.selfref")=<externalptr>
dt[start == stop] %>% nrow
[1] 0

Perhaps these 1bp regions?

      dt[, start := start %>% as.numeric]
      dt[, stop := stop %>% as.numeric]
      dt[stop-start < 2]
                      Target total_coverage       counts on_target duplication_rate   chr
 1: chr2:112584648-112584649             NA           NA      TRUE     0.2157894737  chr2
 2: chr2:170178288-170178289             NA           NA      TRUE     0.1917098446  chr2
 3: chr2:179583975-179583976             NA           NA      TRUE     0.2403100775  chr2
 4:   chr3:77646054-77646055  128.295666733 64.147833366      TRUE     0.2560975610  chr3
 5:   chr3:15414374-15414375    6.309622954  3.154811477      TRUE     0.0000000000  chr3
 6:   chr3:50299213-50299214             NA           NA      TRUE     0.2352941176  chr3
 7:   chr7:35801277-35801278  154.585762375 77.818683100      TRUE     0.1395348837  chr7
 8: chr7:150949409-150949410    0.000000000  0.000000000      TRUE               NA  chr7
 9: chr8:143986358-143986359             NA           NA      TRUE     0.2528735632  chr8
10:  chr10:59221683-59221684    0.000000000  0.000000000      TRUE               NA chr10
11:  chr10:76883991-76883992    0.000000000  0.000000000      TRUE               NA chr10
12:  chr14:23421768-23421769             NA           NA      TRUE     0.0000000000 chr14
13:  chr16:66547971-66547972   79.921890752 41.012549202      TRUE     0.2352941176 chr16
               interval     start      stop
 1: 112584648-112584649 112584648 112584649
 2: 170178288-170178289 170178288 170178289
 3: 179583975-179583976 179583975 179583976
 4:   77646054-77646055  77646054  77646055
 5:   15414374-15414375  15414374  15414375
 6:   50299213-50299214  50299213  50299214
 7:   35801277-35801278  35801277  35801278
 8: 150949409-150949410 150949409 150949410
 9: 143986358-143986359 143986358 143986359
10:   59221683-59221684  59221683  59221684
11:   76883991-76883992  76883991  76883992
12:   23421768-23421769  23421768  23421769
13:   66547971-66547972  66547971  66547972

Cleanup setPriorVcf

setPriorVcf grew organically and cutoffs are ad hoc and might not work well for all combinations of germline and somatic databases and their corresponding versions.

  • Use population allele frequencies, PoN counts and COSMIC counts jointly when available (e.g. cases with high allele frequencies are wrongly set to a prior of 0.5 when found in COSMIC with >= 6 counts, despite being also found in the PoN).

Crash in callLOH with NCBI-style chromosome names (1,2,...X,Y)

Error message:
Error in $<-.data.frame(*tmp*, "arm", value = "p") :
replacement has 1 row, data has 0
Calls: write.csv ... callLOH -> .getArmLocations -> $&lt;- -&gt; $&lt;-.data.frame
Execution halted

Problem is data(centromeres) is not converted.

CNV burden and duplication homologous chromosomes

I want to calculate CNV burden, which is fraction of genome containing a CNV. However, I realized that in some cases a CNV segment might be present on two homologous chromosomes, so its length would need to be counted twice.

For example, if region 10-30 Mbp of one copy of chr1 is duplicated, and region 20-40 Mbp of the OTHER copy of chr1 is duplicated, I presume PureCN would produce 3 CNV segments:

  1. 10-20 Mbp at copy number = 3
  2. 20-30 Mbp at copy number = 4
  3. 30-40 Mbp at copy number = 3

Is there any output from PureCN that would let me distinguish whether a segment is a CNV on both homologous chromosomes? I would think the germline het SNPs in the region could help distinguish this, although you are already pushing those to find purity and ploidy. Still, once a purity/ploidy solution is obtained, it should be possible to do this. Or maybe this is already done?

And of course, clones in the sample further mess with this.

gene-level calls not working - callAlterations

Hi,

I am testing the PureCN and it is running fine excepting that the gene calls is not working so I can't identify which genes have CNVs just by checking the result table.

I added a "Gene" column to my interval file so I dont understand why this issue happens. Below is an example of my interval file:

Target gc_bias Gene
chr1:2488048-2488227 0.65 TNFRSF14
chr1:2489099-2489338 0.6625 TNFRSF14
chr1:2489725-2489964 0.641666666666667 TNFRSF14
chr1:2491220-2491459 0.691666666666667 TNFRSF14
chr1:2492018-2492197 0.65 TNFRSF14
chr1:2492909-2493302 0.619289340101523 TNFRSF14

Below is the output of the runAbsoluteCN function (this step is okay).

ret <-runAbsoluteCN(normal.coverage.file=pool, tumor.coverage.file=tumor.coverage.file, genome="hg19", sampleid="HORIZON", gc.gene.file=intervals, normalDB=normalDB, args.segmentation=list(target.weight.file=target.weight.file), post.optimize=FALSE, plot.cnv=FALSE, verbose=TRUE)
INFO [2017-10-09 17:38:29] ------------------------------------------------------------
INFO [2017-10-09 17:38:29] PureCN 1.6.3
INFO [2017-10-09 17:38:29] ------------------------------------------------------------
INFO [2017-10-09 17:38:29] Arguments: -tumor.coverage.file /home/rramalho/projects/CNV/PURECN/manual/coverage_HORIZON.txt -genome hg19 -args.segmentation target_weights.txt -sampleid HORIZON -gc.gene.file /home/rramalho/projects/CNV/PURECN/gc_file.txt -plot.cnv FALSE -post.optimize FALSE -verbose TRUE -normal.coverage.file -normalDB
INFO [2017-10-09 17:38:29] Loading coverage files...
INFO [2017-10-09 17:38:29] Mean coverages: 1006X (tumor) 1032X (normal).
INFO [2017-10-09 17:38:29] Mean coverages: chrX: 730.57, chrY: 0.00, chr1-22: 1021.39.
WARN [2017-10-09 17:38:29] Allosome coverage missing, cannot determine sex.
WARN [2017-10-09 17:38:29] Sex tumor/normal mismatch: tumor =
INFO [2017-10-09 17:38:29] Removing 902 targets with low coverage in normalDB.
INFO [2017-10-09 17:38:29] Removing 474 low coverage (< 15X) targets.
INFO [2017-10-09 17:38:29] Using 5959 targets.
INFO [2017-10-09 17:38:29] AT/GC dropout: 0.55 (tumor), 1.11 (normal).
WARN [2017-10-09 17:38:29] High GC-bias in normal or tumor. Is data GC-normalized?
INFO [2017-10-09 17:38:29] No Gene column in gc.gene.file. You won't get gene-level calls.
INFO [2017-10-09 17:38:29] Sample sex: F
INFO [2017-10-09 17:38:29] Segmenting data...
INFO [2017-10-09 17:38:29] Target weights found, will use weighted CBS.
INFO [2017-10-09 17:38:29] Setting undo.SD parameter to 1.250000.
INFO [2017-10-09 17:38:32] Mean standard deviation of log-ratios: 0.76
INFO [2017-10-09 17:38:32] 2D-grid search of purity and ploidy...
INFO [2017-10-09 17:38:42] Local optima: 0.55/6, 0.82/6, 0.68/6, 0.95/6, 0.68/3.8, 0.72/3.4,
0.75/3, 0.78/2.6, 0.82/2.2, 0.85/1.8, 0.88/1.4, 0.75/2
INFO [2017-10-09 17:38:42] Testing local optimum 1/12 at purity 0.55 and total ploidy 6.00...
INFO [2017-10-09 17:38:42] Recalibrating log-ratios...
INFO [2017-10-09 17:38:42] Testing local optimum 1/12 at purity 0.55 and total ploidy 6.00...
INFO [2017-10-09 17:38:43] Recalibrating log-ratios...
INFO [2017-10-09 17:38:43] Testing local optimum 1/12 at purity 0.55 and total ploidy 6.00...
INFO [2017-10-09 17:38:45] Recalibrating log-ratios...
INFO [2017-10-09 17:38:45] Testing local optimum 1/12 at purity 0.55 and total ploidy 6.00...
INFO [2017-10-09 17:38:46] Testing local optimum 2/12 at purity 0.82 and total ploidy 6.00...
INFO [2017-10-09 17:38:47] Recalibrating log-ratios...
INFO [2017-10-09 17:38:47] Testing local optimum 2/12 at purity 0.82 and total ploidy 6.00...
INFO [2017-10-09 17:38:48] Recalibrating log-ratios...
INFO [2017-10-09 17:38:48] Testing local optimum 2/12 at purity 0.82 and total ploidy 6.00...
INFO [2017-10-09 17:38:50] Recalibrating log-ratios...
INFO [2017-10-09 17:38:50] Testing local optimum 2/12 at purity 0.82 and total ploidy 6.00...
INFO [2017-10-09 17:38:51] Testing local optimum 3/12 at purity 0.68 and total ploidy 6.00...
INFO [2017-10-09 17:38:52] Recalibrating log-ratios...
INFO [2017-10-09 17:38:52] Testing local optimum 3/12 at purity 0.68 and total ploidy 6.00...
INFO [2017-10-09 17:38:53] Recalibrating log-ratios...
INFO [2017-10-09 17:38:53] Testing local optimum 3/12 at purity 0.68 and total ploidy 6.00...
INFO [2017-10-09 17:38:54] Recalibrating log-ratios...
INFO [2017-10-09 17:38:54] Testing local optimum 3/12 at purity 0.68 and total ploidy 6.00...
INFO [2017-10-09 17:38:56] Testing local optimum 4/12 at purity 0.95 and total ploidy 6.00...
INFO [2017-10-09 17:38:57] Recalibrating log-ratios...
INFO [2017-10-09 17:38:57] Testing local optimum 4/12 at purity 0.95 and total ploidy 6.00...
INFO [2017-10-09 17:38:58] Recalibrating log-ratios...
INFO [2017-10-09 17:38:58] Testing local optimum 4/12 at purity 0.95 and total ploidy 6.00...
INFO [2017-10-09 17:38:59] Recalibrating log-ratios...
INFO [2017-10-09 17:38:59] Testing local optimum 4/12 at purity 0.95 and total ploidy 6.00...
INFO [2017-10-09 17:39:01] Testing local optimum 5/12 at purity 0.68 and total ploidy 3.80...
INFO [2017-10-09 17:39:02] Recalibrating log-ratios...
INFO [2017-10-09 17:39:02] Testing local optimum 5/12 at purity 0.68 and total ploidy 3.80...
INFO [2017-10-09 17:39:03] Recalibrating log-ratios...
INFO [2017-10-09 17:39:03] Testing local optimum 5/12 at purity 0.68 and total ploidy 3.80...
INFO [2017-10-09 17:39:05] Recalibrating log-ratios...
INFO [2017-10-09 17:39:05] Testing local optimum 5/12 at purity 0.68 and total ploidy 3.80...
INFO [2017-10-09 17:39:06] Testing local optimum 6/12 at purity 0.72 and total ploidy 3.40...
INFO [2017-10-09 17:39:07] Recalibrating log-ratios...
INFO [2017-10-09 17:39:07] Testing local optimum 6/12 at purity 0.72 and total ploidy 3.40...
INFO [2017-10-09 17:39:08] Recalibrating log-ratios...
INFO [2017-10-09 17:39:08] Testing local optimum 6/12 at purity 0.72 and total ploidy 3.40...
INFO [2017-10-09 17:39:10] Recalibrating log-ratios...
INFO [2017-10-09 17:39:10] Testing local optimum 6/12 at purity 0.72 and total ploidy 3.40...
INFO [2017-10-09 17:39:12] Testing local optimum 7/12 at purity 0.75 and total ploidy 3.00...
INFO [2017-10-09 17:39:12] Recalibrating log-ratios...
INFO [2017-10-09 17:39:12] Testing local optimum 7/12 at purity 0.75 and total ploidy 3.00...
INFO [2017-10-09 17:39:13] Recalibrating log-ratios...
INFO [2017-10-09 17:39:13] Testing local optimum 7/12 at purity 0.75 and total ploidy 3.00...
INFO [2017-10-09 17:39:15] Recalibrating log-ratios...
INFO [2017-10-09 17:39:15] Testing local optimum 7/12 at purity 0.75 and total ploidy 3.00...
INFO [2017-10-09 17:39:17] Testing local optimum 8/12 at purity 0.78 and total ploidy 2.60...
INFO [2017-10-09 17:39:17] Recalibrating log-ratios...
INFO [2017-10-09 17:39:17] Testing local optimum 8/12 at purity 0.78 and total ploidy 2.60...
INFO [2017-10-09 17:39:18] Recalibrating log-ratios...
INFO [2017-10-09 17:39:18] Testing local optimum 8/12 at purity 0.78 and total ploidy 2.60...
INFO [2017-10-09 17:39:20] Recalibrating log-ratios...
INFO [2017-10-09 17:39:20] Testing local optimum 8/12 at purity 0.78 and total ploidy 2.60...
INFO [2017-10-09 17:39:22] Testing local optimum 9/12 at purity 0.82 and total ploidy 2.20...
INFO [2017-10-09 17:39:22] Recalibrating log-ratios...
INFO [2017-10-09 17:39:22] Testing local optimum 9/12 at purity 0.82 and total ploidy 2.20...
INFO [2017-10-09 17:39:24] Recalibrating log-ratios...
INFO [2017-10-09 17:39:24] Testing local optimum 9/12 at purity 0.82 and total ploidy 2.20...
INFO [2017-10-09 17:39:25] Recalibrating log-ratios...
INFO [2017-10-09 17:39:25] Testing local optimum 9/12 at purity 0.82 and total ploidy 2.20...
INFO [2017-10-09 17:39:27] Testing local optimum 10/12 at purity 0.85 and total ploidy 1.80...
INFO [2017-10-09 17:39:28] Recalibrating log-ratios...
INFO [2017-10-09 17:39:28] Testing local optimum 10/12 at purity 0.85 and total ploidy 1.80...
INFO [2017-10-09 17:39:29] Recalibrating log-ratios...
INFO [2017-10-09 17:39:29] Testing local optimum 10/12 at purity 0.85 and total ploidy 1.80...
INFO [2017-10-09 17:39:30] Recalibrating log-ratios...
INFO [2017-10-09 17:39:30] Testing local optimum 10/12 at purity 0.85 and total ploidy 1.80...
INFO [2017-10-09 17:39:32] Testing local optimum 11/12 at purity 0.88 and total ploidy 1.40...
INFO [2017-10-09 17:39:33] Recalibrating log-ratios...
INFO [2017-10-09 17:39:33] Testing local optimum 11/12 at purity 0.88 and total ploidy 1.40...
INFO [2017-10-09 17:39:34] Recalibrating log-ratios...
INFO [2017-10-09 17:39:34] Testing local optimum 11/12 at purity 0.88 and total ploidy 1.40...
INFO [2017-10-09 17:39:35] Recalibrating log-ratios...
INFO [2017-10-09 17:39:35] Testing local optimum 11/12 at purity 0.88 and total ploidy 1.40...
INFO [2017-10-09 17:39:37] Testing local optimum 12/12 at purity 0.75 and total ploidy 2.00...
INFO [2017-10-09 17:39:38] Recalibrating log-ratios...
INFO [2017-10-09 17:39:38] Testing local optimum 12/12 at purity 0.75 and total ploidy 2.00...
INFO [2017-10-09 17:39:39] Recalibrating log-ratios...
INFO [2017-10-09 17:39:39] Testing local optimum 12/12 at purity 0.75 and total ploidy 2.00...
INFO [2017-10-09 17:39:40] Recalibrating log-ratios...
INFO [2017-10-09 17:39:40] Testing local optimum 12/12 at purity 0.75 and total ploidy 2.00...
INFO [2017-10-09 17:39:43] Done.
INFO [2017-10-09 17:39:43] -------------------------

However when I run callAlterations(ret) I had the following error:

gene.calls <- callAlterations(ret)
FATAL [2017-10-09 17:40:01] This function requires gene-level calls. Please add a column 'Gene'

FATAL [2017-10-09 17:40:01] containing gene symbols to the gc.gene.file.

FATAL [2017-10-09 17:40:01]

FATAL [2017-10-09 17:40:01] This is most likely a user error due to invalid input data or

FATAL [2017-10-09 17:40:01] parameters (PureCN 1.6.3).

Erro: This function requires gene-level calls. Please add a column 'Gene'
containing gene symbols to the gc.gene.file.

This is most likely a user error due to invalid input data or
parameters (PureCN 1.6.3).

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /opt/R-3.4.1/lib/R/lib/libRblas.so
LAPACK: /opt/R-3.4.1/lib/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=pt_BR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=pt_BR.UTF-8 LC_COLLATE=pt_BR.UTF-8
[5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=pt_BR.UTF-8
[7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] PureCN_1.6.3 VariantAnnotation_1.22.3
[3] Rsamtools_1.28.0 Biostrings_2.44.2
[5] XVector_0.16.0 SummarizedExperiment_1.6.5
[7] DelayedArray_0.2.7 matrixStats_0.52.2
[9] Biobase_2.36.2 GenomicRanges_1.28.6
[11] GenomeInfoDb_1.12.2 IRanges_2.10.4
[13] S4Vectors_0.14.6 BiocGenerics_0.22.0
[15] DNAcopy_1.50.1

loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 splines_3.4.1 lattice_0.20-35
[4] colorspace_1.3-2 rtracklayer_1.36.5 GenomicFeatures_1.28.5
[7] blob_1.1.0 XML_3.98-1.9 rlang_0.1.2
[10] DBI_0.7 BiocParallel_1.10.1 bit64_0.9-7
[13] RColorBrewer_1.1-2 lambda.r_1.2 GenomeInfoDbData_0.99.0
[16] plyr_1.8.4 zlibbioc_1.22.0 munsell_0.4.3
[19] gtable_0.2.0 futile.logger_1.4.3 VGAM_1.0-4
[22] memoise_1.1.0 labeling_0.3 biomaRt_2.32.1
[25] AnnotationDbi_1.38.2 Rcpp_0.12.13 edgeR_3.18.1
[28] scales_0.5.0 BSgenome_1.44.2 limma_3.32.7
[31] bit_1.1-12 ggplot2_2.2.1 digest_0.6.12
[34] grid_3.4.1 tools_3.4.1 bitops_1.0-6
[37] RCurl_1.95-4.8 lazyeval_0.2.0 tibble_1.3.4
[40] RSQLite_2.0 futile.options_1.0.0 Matrix_1.2-11
[43] data.table_1.10.4 GenomicAlignments_1.12.2 compiler_3.4.1

Crash in IntervalFile.R

INFO [2018-05-10 15:33:03] Averaging reptiming into bins of size 100000...
Error in aggregate.data.frame(as.data.frame(x), ...) :
no rows to aggregate
Calls: ... aggregate -> aggregate.default -> aggregate.data.frame
In addition: Warning message:
In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)

Happens when reptiming has different chromosome naming style.

NormalDB.R has different requirements to its CLI despite what it is written in the documentation

The docs ( "Quick.pdf" from the current Bioconductor release) state:

See the main vignette for more details and file formats.
For a production pipeline run we provide again more information about the assay and genome.
Here an CNVkit example (CNVkit runs without normal reference samples are not recom-
mended):

# Provide a normal panel VCF to remove mapping biases, pre-compute
# position-specific bias for much faster runtimes with large panels
# This needs to be done only once for each assay
Rscript $PURECN/NormalDB.R --outdir $OUT_REF --normal_panel $NORMAL_PANEL \
--assay agilent_v6 --genome hg19 --force

But trying to do something similar:

Rscript /usr/lib64/R/library/PureCN/extdata/NormalDB.R --normal_panel normals.vcf.gz --assay agilent_oneseq --genome hg19 --outdir purecn
INFO [2018-07-30 11:47:58] Loading PureCN 1.10.0...
INFO [2018-07-30 11:48:11] Creating mapping bias database.
INFO [2018-07-30 11:48:14] Processing variants 1 to 5000...
INFO [2018-07-30 11:48:18] Processing variants 5001 to 10000...
INFO [2018-07-30 11:48:22] Processing variants 10001 to 15000...
INFO [2018-07-30 11:48:27] Processing variants 15001 to 20000...
INFO [2018-07-30 11:48:31] Processing variants 20001 to 25000...
INFO [2018-07-30 11:48:35] Processing variants 25001 to 30000...
INFO [2018-07-30 11:48:40] Processing variants 30001 to 35000...
INFO [2018-07-30 11:48:49] Processing variants 35001 to 40000...
INFO [2018-07-30 11:48:54] Processing variants 40001 to 45000...
INFO [2018-07-30 11:48:58] Processing variants 45001 to 50000...
INFO [2018-07-30 11:48:58] Position chr17:78934088
INFO [2018-07-30 11:49:02] Processing variants 50001 to 55000...
WARN [2018-07-30 11:49:05] No --coveragefiles provided. Cannot generate normal database.

Error using PureCN.R

...
Removing 259 variants outside intervals.
INFO [2018-02-25 11:14:01] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise.
Error: logical subscript contains NAs
Execution halted

Support for phased VCFs

  • Obtain test samples
  • Jointly infer M for phased SNPs in a segment
  • In predictSomatic VCF export, provide phasing in unbalanced segments.

Split Simulated Annealing and SNV fit steps

Right now, for every local optimum, we first do the SA step and then fit variants.

Often multiple local optima converge, so we might fit variants unnecessarily multiple times.

Splitting also benefits the parallel processing since there is then slightly less runtime difference between optima.

preprocessIntervals failed

Hi
I changed all chromosome name to that corresponding to b37, and build the tragte information with preprocessIntervals but encounter an error

> reference.file
  A DNAStringSet instance of length 86
         width seq                                          names
 [1] 249250621 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN 1 dna:chromosome ...
 [2] 243199373 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN 2 dna:chromosome ...
 [3] 198022430 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN 3 dna:chromosome ...
 [4] 191154276 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN 4 dna:chromosome ...
 [5] 180915260 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN 5 dna:chromosome ...
 ...       ... ...
[82]    191469 GATCCCTGCCCTAAAACTTTC...AGAAACACCCAAGAATGATC GL000194.1 dna:su...
[83]    211173 GAATTCTCACATGTCTCTGCA...AGAGTTTTGCTGGTGAATTC GL000225.1 dna:su...
[84]    547496 GAATTCATTCACCATTATTCT...TAGGTGCCTATCAGGAATTC GL000192.1 dna:su...
[85]    171823 AGAATTCGTCTTGCTCTATTC...TCGCCCTTATTGCCCTGTTT NC_007605
[86]  35477943 AAGAAACAGAACAAAACAACT...TGCTTTTCTGTTTTTCCCTG hs37d5
> intervals
GRanges object with 989595 ranges and 0 metadata columns:
             seqnames           ranges strand
                <Rle>        <IRanges>  <Rle>
       [1]          1   [14426, 14627]      *
       [2]          1   [14426, 14627]      *
       [3]          1   [14568, 14646]      *
       [4]          1   [14638, 14883]      *
       [5]          1   [14896, 14968]      *
       ...        ...              ...    ...
  [989591] GL000228.1 [103932, 104032]      *
  [989592] GL000228.1 [105686, 105805]      *
  [989593] GL000228.1 [107238, 107338]      *
  [989594] GL000228.1 [112298, 112417]      *
  [989595] GL000228.1 [113850, 113950]      *
  -------
  seqinfo: 27 sequences from an unspecified genome; no seqlengths
> mappability
GRanges object with 21591493 ranges and 1 metadata column:
             seqnames               ranges strand |        name
                <Rle>            <IRanges>  <Rle> | <character>
         [1]        1       [10001, 10014]      * |  0.00277778
         [2]        1       [10015, 10015]      * |    0.333333
         [3]        1       [10016, 10026]      * |         0.5
         [4]        1       [10027, 10031]      * |           1
         [5]        1       [10032, 10036]      * |         0.5
         ...      ...                  ...    ... .         ...
  [21591489]        Y [59363020, 59363314]      * |  0.00277778
  [21591490]        Y [59363315, 59363317]      * |    0.333333
  [21591491]        Y [59363318, 59363318]      * |        0.25
  [21591492]        Y [59363319, 59363320]      * |    0.333333
  [21591493]        Y [59363321, 59363517]      * |         0.5
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths
preprocessIntervals(intervals, reference.file, mappability=mappability, output.file = "human_g1k_v37_decoy_gc_file.txt")

I encounter the following error:

WARN [2018-05-18 12:47:36] Found 2558162 overlapping intervals, starting at line 2.
INFO [2018-05-18 12:47:50] Splitting 51059 large targets to an average width of 400.
WARN [2018-05-18 13:26:59] 372106 intervals without mapping score.
INFO [2018-05-18 13:26:59] Removing 372106 targets with low mappability score (<0.50).
Error in [[<-(*tmp*, label, value = NA) :
1 elements in value to replace 0 elements
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Warning messages:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
4: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
5: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA

traceback()

6: stop(paste(lv, "elements in value to replace", nrx, "elements"))
5: [[<-(*tmp*, label, value = NA)
4: [[<-(*tmp*, label, value = NA)
3: .addScoreToGr(interval.gr, reptiming, "reptiming")
2: .annotateIntervalsReptiming(interval.gr, reptiming)
1: preprocessIntervals(intervals, reference.file, mappability = mappability,
output.file = "human_g1k_v37_decoy_gc_file.txt")

PureCN and WGS data

Hi,

I am very interesting in applying this program to our data. We are currently working with WGS pairs and I was curious if there are options/modifications that can be applied to test PureCN on this type of data.

Thanks for your time.

Best,
Rob

createNormalDatabase returns error if bed has overlapping intervals

Hello Marcus,

I note that if and only if an input .bed file has overlapping intervals:

WARN [2018-05-06 21:03:02] Coverage data contains single nucleotide intervals.
WARN [2018-05-06 21:03:02] Found 2 overlapping intervals, starting at line 5352.
WARN [2018-05-06 21:03:03] Coverage data contains single nucleotide intervals.
WARN [2018-05-06 21:03:03] Found 2 overlapping intervals, starting at line 5352.

createNormalDatabase returns an error due to

split(x$average.coverage, as.character(seqnames(x)))

in PureCN:::getSexFromCoverage:

Error in split.default(x$average.coverage, as.character(seqnames(x))) :
  first argument must be a vector
Calls: <Anonymous> -> sapply -> lapply -> FUN -> split -> split.default

I am using the latest dev version of the package on the latest R. I have created about 40 normal databases without error if the overlapping intervals message does not appear.

Minimun reproducible example:

PureCN::createNormalDatabase(c("file1", "file2"))

Should I pre-filter these, or perhaps add a tryCatch here?

Thank you very much,

Andrew

Improve predictSomatic VCF annotation

  • All sample-specific annotation should go into FORMAT, not INFO.

  • Add all columns in the data.frame version to the VCF

  • Provide option to generate minimal VCF annotation that only provides the most likely state plus posterior probability instead of all states

error using Mutect2 vcf

Hi,
I am trying to use PureCN with a vcf generated by Mutect2, but I get an error as follows when I run filterVcfMuTect2(vcf):
Error in if (fractionContaminated > 0) { : missing value where TRUE/FALSE needed

The vcf was obtained in tumour-only mode with the following command for GATK4 (as recommended here but without the --panel-of-normals option:

  gatk Mutect2 \
   -R ref_fasta.fa \
   -I tumor.bam \
   -tumor tumor_sample_name \
   --germline-resource af-only-gnomad.vcf.gz \
   -O tumor_unmatched_m2_snvs_indels.vcf.gz

It turns out PureCN is looking for a geno(vcf) field named FA, but this field doesn't exist in my vcf. Using GATK3 MuTect2 vcf also gives the same error. I feel like I may be missing something, so apologies for posting this as an issue, any help would be appreciated!

Instead of removing bad variants, flag and ignore them

In order to use PureCN's variant annotation in production settings, it would be better to flag filtered variants, not completely remove them. Low priority for now, since users can easily re-annotate the input VCF.

Pass explicit tab separator when reading CNVkit files

Some CNVkit files have missing data in the RepeatMasker column (in particular cnn files) when the data is not available for that position.
PureCN uses read.table to load them, however it uses the default separator (sep="") which means that all whitespace is treated as separator. In case of these missing data, there is nothing in the column, so it ends up being as \t\t. This is processed by read.table as a single field, and thus parsing files (header has 9 fields, the collapsed field makes them 8 in that specific line).
Setting an explicit tab separator (which is what CNVkit outputs to, anyway) should fix this behavior.

Dx.R message "Loading required package: deconstructSigs"

When I ran Dx.R without the --signature argument, it nevertheless produced the message:

"Loading required package: deconstructSigs"

(which worked because I do happen to have that package installed, but it should skip the load if not being used).

Crash in PureCN.R when interval.file contains chrM

Error in $<-.data.frame(*tmp*, "end", value = 27325) :
replacement has 1 row, data has 0
Calls: plotAbs -> .plotTypeBAF -> $&lt;- -&gt; $&lt;-.data.frame

Workaround until fix: Don't provide chrM in baits BED files.

BAF segmentation

Hello,

I am observing erroneous segmentation on the BAF plots. Some of the segments also don't seem to align well with the average BAF values. Is BAF segmentation performed independently from the log2 coverage ratios?

Please see the attached image as an example case. In this image you can also see 2 cnLOH events were missed by the segmentation.

Best,
Bekir
purecn_baf_segmentation

Priors for private germline rate in callMutationBurden

We currently use flat priors for private variants. This can result in artificially high private germline rates in hyper-mutated samples or low rates in silent genomes, especially in high purity samples. (Thus resulting in under- and overestimating somatic rates, respectively.)

  • Provide function to estimate rate based on callMutationBurden output from many samples

  • Adjust probabilities, probably only in callMutationBurden, not in fitting.

Allow unmatched cols in .annotatePosteriorsVcf

.annotatePosteriorsVcf fails when gene.symbol is missing because match returns NA. Allow omission?

idxColsMisc <- match(idxColsMisc, colnames(pp))
idxColsMisc <- na.omit(idxColsMisc)

Happy to explain further if unclear

Love your work, Markus, wonderful job with this package.

Overlapping reads

Thank you for this very well designed and tested software, it is greatly appreciated.

I have a question about PureCN's function for calculating coverage in a bam file - does this function count the overlapping ends of a paired read twice? In FFPE samples, the fragments are often short so the ends of the read overlap.
Reading the source code I can't tell if the coverage function takes this into account or not.
GATK3 DepthOfCoverage apparently does not count overlapping reads twice.

Calculate Chromosomal Instability

  • Literature review about common metrics (fraction altered vs metrics that use both focal and broad)

  • Implement callCIN function

  • Add to Dx.R script

PureCN.R --rds option no change in output files

After running PureCN.R once on a sample, I edited the .csv file that holds the purity/ploidy and Curated flag, and changed the purity and ploidy to a different PureCN solution than the maximum-likelihood solution, and set the Curated flag TRUE. I used readCurationFile() to read the .rds file so that it would reorder the solutions (but I did not save the new rds object, maybe that's the problem). Then I moved all PureCN output files except the .rds and curation .csv files to another location and reran PureCN.R, this time specifying the --rds option. PureCN.R regenerated all the output files. However, when I compared them to the original files, they were identical (except the curation .csv file was left untouched and still had the curated values in it).

Is this expected??? I expected that some values in some of the output .csv files would be a function of the selected solution's purity and ploidy, and that some plots in the .pdf files would be based on the selected solution's purity and ploidy. For example, the _segmentation.pdf file I expected to be different, but it wasn't.

New mapping bias database

Parsing the normal panel VCF is slow and unnecessary. A BED file with mapping bias and PON counts would be sufficient. Probably no need for HDF5.

  • Add function that converts normal panel VCF in BED/tabix format

  • Add feature to NormalDB.R script that calls this function

  • Support this BED/tabix file format in setMappingBiasVcf

Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")

Hello Marcus,

Here is another reproducible error using input files processed identically to other samples that completed successfully, with no apparent qualitative difference in input files. All files were created using the indicated PureCN version in the log.

Error in .Call2("solve_user_SEW0", start, end,
width, PACKAGE = "IRanges") : 
solving row 9: negative widths are not allowed

Better flagging of samples

We currently flag so many samples that this feature becomes rather useless. Need to become better at figuring out if a PureCN run was successful.

  • Document flags in main vignette
  • Incorporate noise and fit in a smarter way
  • Lower purity cutoff when sample quality is high and unlikely genome duplication or significant heterogeneity

Eliminate unlikely solutions early

PureCN still spends a lot of time optimizing unlikely solutions. Ironically, this mostly affects clearly diploid solutions with few somatic events.

  • Test for balanced allelic fractions and set max.ploidy to 3 when highly balanced

  • Looks at many datasets to find a good cutoff for "highly balanced", probably around 80%.

Suggest allowing multiple samples in VCF file.

Suggestion: add arg to specify normal sample ID also. Then, allow the VCF file to contain more than two samples. In my case with multiregional sequencing, the easiest way for me to make an appropriate VCF for PureCN is to make one that has all samples of a person, one normal and multiple tumor samples. I would run PureCN once for each tumor sample, specifying the same VCF each time, the same normal sample ID, and a different tumor sample ID.

Whitelist for homozygous deletions

The current cutoff for homozygous deletions is large because we rarely see such huge deletions at MTAP/CDKN2A. Might be better to decrease the default and add a whitelist of known deletions.

  • Query TCGA for most common deletions.

  • Provide whitelist for hg19/hg38

  • Use size distribution of whitelist for finding good size cutoffs, understand how SNP6 cutoffs translate to small targets with 150-200kb off-target bins.

  • Do not count known deletions to max.homozygous.loss cutoffs.

Minor documentation issue

In Quick vignette Dx.R description, add ".R" to:

Rscript $PureCN/FilterCallableLoci --genome hg19

Crash with indels from Mutect2 VCFs

Error in data.frame(seqnames = as.factor(seqnames(x)), start = start(x), :
duplicate row.names: rs147304884, rs774418471, rs72106118
Calls: runAbsoluteCN ... eval -> as.data.frame -> as.data.frame -> data.frame

Happens when indels overlap with segment breakpoints

Error when fitting variants (!sum(mapping.bias.ok))

See #29 for hardware and session info.

Note that this file is aggressively downsampled for testing purposes, which may be contributing to the error. More than 50 other identically pipelined and processed samples completed successfully. This is one of three files that generated errors, all with the same message as indicated below.

Thank you very much for your help, as always.

ERROR

Error in if (!sum(mapping.bias.ok)) { :
  missing value where TRUE/FALSE needed

LOGGING

WARN [2018-05-08 16:16:44] gc.gene.file was renamed to interval.file.
INFO [2018-05-08 16:16:44] ------------------------------------------------------------
INFO [2018-05-08 16:16:44] PureCN 1.11.1
INFO [2018-05-08 16:16:44] ------------------------------------------------------------
INFO [2018-05-08 16:16:53] Arguments: -tumor.coverage.file Homo_sapiens_assembly38_FFPEv2_Hg38_10c3a27f-484a-4367-b288-2ae95ad75143_TN_M_R_xE_T_dna_rg_br_loess.txt -vcf.file 12-211_T.sorted_sortn_dna_md_rg_br.vcf -genome hg38 -args.segmentation Homo_sapiens_assembly38_FFPEv2_Hg38_tw.txt -max.candidate.solutions 50 -max.non.clonal 0.1 -max.pon 5 -iterations 60 -gc.gene.file Homo_sapiens_assembly38_FFPEv2_Hg38_gc_file.txt -plot.cnv FALSE -cosmic.vcf.file /mnt/ephm/refs//CosmicCodingMuts.vcf.gz -post.optimize TRUE -speedup.heuristics 1 -log.file H_38_FFPE2_H38_2635-69-4552-98-1040040_TN_M_R_E_T_mc-15.ip-50.mbq-25.tp-015.tc-7.mp-6.mnc-01.pcn.log -verbose TRUE -normal.coverage.file <data> -normalDB <data> -args.filterVcf <data> -args.setMappingBiasVcf <data> -fun.segmentation <data> -max.ploidy <data> -test.num.copy <data> -test.purity <data> -min.coverage <data>
INFO [2018-05-08 16:16:53] Loading coverage files...
INFO [2018-05-08 16:16:53] Mean target coverages: 174X (tumor) 172X (normal).
WARN [2018-05-08 16:16:53] Allosome coverage missing, cannot determine sex.
WARN [2018-05-08 16:16:53] Allosome coverage missing, cannot determine sex.
INFO [2018-05-08 16:16:54] Removing 13 intervals with missing log.ratio.
INFO [2018-05-08 16:16:54] Removing 16 targets excluded in normalDB.
INFO [2018-05-08 16:16:54] normalDB provided. Setting minimum coverage for segmentation to 0.0015X.
INFO [2018-05-08 16:16:54] Removing 232 low coverage (< 0.0015X) targets.
INFO [2018-05-08 16:16:54] Using 2574 intervals (2574 on-target, 0 off-target).
INFO [2018-05-08 16:16:54] No off-target intervals. If this is hybrid-capture data, consider adding them.
INFO [2018-05-08 16:16:54] AT/GC dropout: 1.01 (tumor), 1.01 (normal).
INFO [2018-05-08 16:16:54] Loading VCF...
INFO [2018-05-08 16:17:51] Found 3140875 variants in VCF file.
INFO [2018-05-08 16:17:58] 740586 (23.6%) variants annotated as likely germline (DB INFO flag).
INFO [2018-05-08 16:18:07] 36747998-e48f-436f-a372-827f882cf91b_TN_M_R_xE_T_dna is tumor in VCF file.
INFO [2018-05-08 16:18:19] 928 homozygous and 10440 heterozygous variants on chrX.
INFO [2018-05-08 16:18:19] Sex from VCF: F (Fisher's p-value: < 0.0001, odds-ratio: 1.93).
WARN [2018-05-08 16:18:19] Duplicated arguments in filterVcf
INFO [2018-05-08 16:18:31] Removing 2966776 non heterozygous (in matched normal) germline SNPs.
INFO [2018-05-08 16:18:45] Initial testing for significant sample cross-contamination: maybe
INFO [2018-05-08 16:18:46] Removing 109928 variants with AF < 0.030 or AF >= 1.000 or less than 3 supporting reads or depth < 15.
INFO [2018-05-08 16:18:46] Removing 33977 low quality variants with BQ < 25.
INFO [2018-05-08 16:18:46] Total size of targeted genomic region: 0.45Mb (0.69Mb with 50bp padding).
INFO [2018-05-08 16:18:47] 8.3% of targets contain variants.
INFO [2018-05-08 16:18:47] Removing 29971 variants outside intervals.
INFO [2018-05-08 16:18:47] Reading COSMIC VCF...
INFO [2018-05-08 16:18:49] Found SOMATIC annotation in VCF.
INFO [2018-05-08 16:18:49] Setting somatic prior probabilities for somatic variants to 0.999000 or to 0.000100 otherwise.
INFO [2018-05-08 16:18:49] Found SOMATIC annotation in VCF. Setting mapping bias to 1.021.
INFO [2018-05-08 16:18:49] Scanning artfdet_comb5.vcf.gz...
INFO [2018-05-08 16:18:52] Imputing mapping bias for 97 variants...
INFO [2018-05-08 16:18:55] Sample sex: ?
INFO [2018-05-08 16:18:55] Segmenting data...
INFO [2018-05-08 16:18:55] Target weights found, will use weighted CBS.
INFO [2018-05-08 16:18:55] Loading pre-computed boundaries for DNAcopy...
INFO [2018-05-08 16:18:55] Setting undo.SD parameter to 0.500000.
INFO [2018-05-08 16:18:55] Setting prune.hclust.h parameter to 0.100000.
INFO [2018-05-08 16:18:55] Found 25 segments with median size of 68.28Mb.
INFO [2018-05-08 16:18:55] Using 223 variants.
INFO [2018-05-08 16:18:55] Mean standard deviation of log-ratios: 0.17
INFO [2018-05-08 16:18:55] 2D-grid search of purity and ploidy...
INFO [2018-05-08 16:19:00] Local optima: 0.17/2.8, 0.15/2.4, 0.17/2, 0.29/3.2, 0.19/1.8, 0.48/4.4, 0.45/3.8, 0.31/2.6, 0.33/3, 0.68/5.4, 0.55/4.2, 0.82/6, 0.38/2.4, 0.42/2.8, 0.48/2, 0.78/2.8,
0.58/2.6, 0.52/3, 0.42/1.6, 0.62/1.4, 0.82/1.2, 0.95/3, 0.95/1
INFO [2018-05-08 16:19:00] Testing local optimum 1/23 at purity 0.17 and total ploidy 2.80...
INFO [2018-05-08 16:19:00] Recalibrating log-ratios...
INFO [2018-05-08 16:19:00] Testing local optimum 1/23 at purity 0.17 and total ploidy 2.80...
INFO [2018-05-08 16:19:01] Recalibrating log-ratios...
INFO [2018-05-08 16:19:01] Testing local optimum 1/23 at purity 0.17 and total ploidy 2.80...
INFO [2018-05-08 16:19:01] Recalibrating log-ratios...
INFO [2018-05-08 16:19:01] Testing local optimum 1/23 at purity 0.17 and total ploidy 2.80...
INFO [2018-05-08 16:19:01] Testing local optimum 2/23 at purity 0.15 and total ploidy 2.40...
INFO [2018-05-08 16:19:01] Fitting variants for purity 0.15, tumor ploidy 4.62 and contamination 0.01.

Copy Ratio adjustments for purity/ploidy are incorrect

In some places in PureCN there are adjustments of copy ratio to account for purity and ploidy, or vice-versa, to take an adjusted value and return it to an unadjusted "observed" ratio value. Marcus Riester gave the following reference for the method used to do this:

https://www.nature.com/articles/ng.2760 (section Impurity-corrected GISTIC)

The equations in that section contain an algebra mistake. I'm concerned that PureCN might incorporate that mistake in one or more places. I have verified that it uses the correct equation in at least two places.

Details:

The above reference shows this derivation:

  R(x) = (aq(x)+2(1-a))/D
  D = aT + 2(1-a)
  q(x) = DR(x)/a - 2(1-a)/a
  R'(x) = q(x)/T = R(x)/a - 2(1-a)/aT

where:

  R(x) = raw (observed) coverage ratio (PureCN's seg.mean, gene.mean, etc.)
  R'(x) = adjusted coverage ratio (in tumor cells)
  q(x) = integer copy number in cancer cells
  D = average ploidy across all cells of tumor (of sample)
  a = sample purity
  T = tumor ploidy

However, in the last step where q(x) is substituted in q(x)/T, the algebra is wrong. The correct algebra is:

R'(x) = q(x)/T = DR(x)/aT - 2(1-a)/aT = (aT + 2(1-a))R(x)/aT - 2(1-a)/aT
      = R(x) + 2(1-a)R(x)/aT - 2(1-a)/aT
      = [aTR(x) + 2(1-a)R(x) - 2(1-a)]/aT

As a test, say that purity=a=0.5, tumor ploidy=T=2, and raw coverage ratio=1.5. Then we expect the adjusted coverage ratio to be 2 (tumor segment is 2X amplify (4 copies) and this becomes raw ratio of 1.5 when purity is 1/2 [0.54 + 0.52] / 2 = 1.5):

Paper:     R'(x) = 1.5/0.5 - 2(0.5)/(0.5 * 2) = 3 - 2(0.5) = 2 (correct)
Corrected: R'(x) = [0.5*2*1.5 + 2(0.5)1.5 - 2(0.5)] / (0.5*2) = 1.5 + 1.5 - 1 = 2 (correct)

But now suppose that tumor ploidy=T=4, and we still have purity=a=0.5. Say raw coverage ratio=1.0, meaning there is no tumor amplification and the number of copies at any locus is the same as the mean number of copies, in both the 2X normal and 4X tumor tissue. We expect the adjusted coverage ratio to also be 1:

Paper:     R'(x) = 1/0.5 - 2(0.5)/(0.5 * 4) = 2 - 2(0.5)/2 = 2 - 1/2 = 1.5 (wrong)
Corrected: R'(x) = [0.5*4*1 + 2(0.5)1 - 2(0.5)] / (0.5 * 4) = [2 + 1 - 1] / 2 = 2 / 2 = 1 (correct)

Marcus Riester indicated PureCN uses the following:

    rds <- readRDS("Sampleid.rds")
    r <- rds$results[[1]]
    r$seg$seg.mean.adjusted <- r$seg$seg.mean/r$purity - 2*(1-r$purity)/(r$purity*r$ploidy)

The above equation matches the incorrect one in the paper, so PureCN must be wrong wherever the above appears in the PureCN code, although I haven't found where that is (but I didn't search exhaustively).

The PureCN function .calcExpectedRatio() does the inverse operation, computing R(x) from R'(x), and it is using the correct equation.

PureCN function runAbsoluteCN() has this line:

    opt.C <- (2^(seg$seg.mean + log.ratio.offset) *  total.ploidy)/p - ((2 * (1 - p))/p)

and since tumor copy number C = tumor coverage ratio * tumor ploidy, this equation at first appears to be the paper's (incorrect) R'(x) * ploidy. However, "total.ploidy" is probably the ploidy in the sample as a whole, which corresponds to the paper's "D", and in that case, the above equation is correct. The algebra error only creeps in in the last derivation, R'(x) = ...

In summary, my concerns are (1) the above equation for seg.mean.adjusted (I don't know where in the code that is); (2) Could there be other places in PureCN where the algebra bug crept in?

Full support for GATK4

  • Parsing coverage

  • Parsing log-ratios

  • Mutect2 tumor-only

  • Mutect2 tumor/normal

  • CallableBases equivalent

  • Parsing segmentations

  • PoN VCFs

LOH Database

  • Find a good test dataset we are allowed to use, maybe TCGA ABSOLUTE or ASCAT.

  • Test on "difficult" indications like lung

  • Add it as an lohDB runAbsoluteCN argument

PureCN.R Error

I'm trying to run PureCN.R after Coverage.R and I'm getting a cryptic error (see at the very bottom) that I can't debug. I notice there's a lot of warnings for low coverage but I'm not sure if this is what is causing the error. Any thoughts?

-todd

INFO [2018-06-29 11:17:15] Loading PureCN 1.11.10...
WARN [2018-06-29 11:17:34] Multiple chromosomes with very low coverage: 1,10,11,12,13,14,15,16,17,18,19,2,20,21,22,3,4,5,6,7,8,9,X,Y
INFO [2018-06-29 11:17:58] Mean coverages: chrX: 0.08, chrY: 0.05, chr1-22: 0.12.
INFO [2018-06-29 11:17:58] Sample sex: M
INFO [2018-06-29 11:18:06] ------------------------------------------------------------
INFO [2018-06-29 11:18:06] PureCN 1.11.10
INFO [2018-06-29 11:18:06] ------------------------------------------------------------
INFO [2018-06-29 11:18:06] Arguments: -tumor.coverage.file /ts18/ngs/studies/ngs_000261/bwa/G04971TB02_S24.sorted_pos.dedup.recal_coverage_loess.txt -seg.file -vcf.file /ts18/ngs/studies/ngs_000261/bwa/G04971TB02_S24.mutect2_sel.dbsnp.vcf.gz -genome hg19 -sex ? -args.setMappingBiasVcf NULL -args.segmentation NULL,/Biomarker/sih/Databases/PureCN_data/NormalDB/b37/target_weights_SS_v6_b37.txt,0.005,NULL -sampleid G04971TB02_S24 -min.ploidy 1 -max.ploidy 6 -max.non.clonal 0.2 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /Biomarker/ngs/database/pureCN/b37/baits_b37_V6_intervals.txt -max.segments 300 -plot.cnv TRUE -DB.info.flag DB -model beta -post.optimize FALSE -BPPARAM -log.file /ts18/ngs/studies/ngs_000261/bwa/G04971TB02_S24.log -normal.coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heuristics
INFO [2018-06-29 11:18:06] Loading coverage files...
WARN [2018-06-29 11:18:15] Multiple chromosomes with very low coverage: 1,10,11,12,13,14,15,16,17,18,19,2,20,21,22,3,4,5,6,7,8,9,X,Y
INFO [2018-06-29 11:18:20] Mean target coverages: 0X (tumor) 0X (normal).
WARN [2018-06-29 11:18:20] Large difference in coverage of tumor and normal.
INFO [2018-06-29 11:18:23] Mean coverages: chrX: 0.08, chrY: 0.05, chr1-22: 0.12.
INFO [2018-06-29 11:18:25] Mean coverages: chrX: 0.00, chrY: 0.03, chr1-22: 0.00.
INFO [2018-06-29 11:19:10] Removing 232918 intervals with missing log.ratio.
INFO [2018-06-29 11:19:29] Removing 492 intervals excluded in normalDB.
INFO [2018-06-29 11:19:29] Removing 54020 intervals with low total coverage in normal (< 150.00 reads).
INFO [2018-06-29 11:19:29] normalDB provided. Setting minimum coverage for segmentation to 0.0015X.
INFO [2018-06-29 11:19:29] Using 0 intervals (0 on-target, 0 off-target).
INFO [2018-06-29 11:19:29] No off-target intervals. If this is hybrid-capture data, consider adding them.
Error in sort(abs(diff(genomdat)))[1:n.keep] :
only 0's may be mixed with negative subscripts
Calls: runAbsoluteCN -> smooth.CNA -> trimmed.variance
Execution halted

Sex column contents better as FEMALE / MALE

The PureCN.R output file containing the purity and sex estimation uses "M" and "F" in the "Sex" column. R reads "F" as "FALSE" unless you specify the type of each column. I suggest you change the contents of the column to "MALE" and "FEMALE".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.