GithubHelp home page GithubHelp logo

sztup / scarhrd Goto Github PK

View Code? Open in Web Editor NEW
102.0 4.0 50.0 12.36 MB

License: MIT License

R 17.54% TeX 3.77% HTML 78.69%
hrd-score genomic-scar-scores wxs cancer hrd-loh telomeric-allelic-imbalances heterozygosity loh

scarhrd's Introduction

scarHRD R package Manual

Introduction

scarHRD is an R package which determines the levels of homologous recombination deficiency (telomeric allelic imbalance, loss off heterozygosity, number of large-scale transitions) based on NGS (WES, WGS) data.

The first genomic scar based homologous recombination deficiency measures were produced using SNP arrays. Since this technology has been largely replaced by next generation sequencing it has become important to develop algorithms that derive the same type of genomic scar-scores from next generation sequencing (WXS, WGS) data. In order to perform this analysis, here we introduce the scarHRD R package and show that using this method the SNP-array based and next generation sequencing based derivation of HRD scores show good correlation.

Contact

Zsofia Sztupinszki
Boston Children's Hospital
contact: [email protected]

Getting started

Minimum requirements

library(devtools)
install_bitbucket('sequenza_tools/sequenza')

Installation

scarHRD can be installed via devtools from github:

library(devtools)
install_github('sztup/scarHRD',build_vignettes = TRUE)

Running on GRCh38

A modification of the copynumber R package needs to be used which can be installed via devtools from github:

library(devtools)
install_github('aroneklund/copynumber')

Citation

Please cite the following paper: Sztupinszki et al, Migrating the SNP array-based homologous recombination deficiency measures to next generation sequencing data of breast cancer, npj Breast Cancer, https://www.nature.com/articles/s41523-018-0066-6.



Workflow overview

A typical workflow of determining the genomic scar scores for a tumor sample has the following steps:

  1. Call allele specific copy number profile on paired normal-tumor BAM files. This step has to be executed before running scarHRD. We recommend using Sequenza (Favero et al. 2015) http://www.cbs.dtu.dk/biotools/sequenza/ for copy number segmentation, Other tools (e.g. ASCAT (Van Loo et al. 2010)) may also be used in this step.
    This step is time-consuming and compute-intensive.
    Example for using Sequenza:

sequenza-utils bam2seqz -gc /reference/GRCh38.gc50Base.txt.gz --fasta /reference/GRCh38.d1.vd1.fa -n /data/normal.bam --tumor /data/tumor.bam -C chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chr23 chr24 chrX | sequenza-utils seqz_binning -w 50 -s - | gzip > /results/tumor_small.seqz.gz Further details can be found in the Vignette of Sequenza: https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.pdf

  1. Determine the scar scores with scarHRD R package.
    This step only takes a few minutes.

Input file examples

The scarHRD input may be a detailed segmentation file from Sequenza, in case there is a reliable estimation of ploidy of the tumor sample is known, it should be sumbitted in the ploidy argument of the scarHRD function, otherwise ploidy between 1 and 5.5 will be tested:

a<-read.table("/examples/test1.small.seqz.gz", header=T)
head(a)
##   chromosome position base.ref depth.normal depth.tumor depth.ratio    Af
## 1       chr1    12975        N            7          20       2.841 1.000
## 2       chr1    13020        A            8          28       3.500 0.964
## 3       chr1    13026        N           15          43       2.964 1.000
## 4       chr1    13038        T           11          35       3.182 0.971
## 5       chr1    13041        A           11          37       3.364 0.946
## 6       chr1    13077        N           26          65       2.465 1.000
##   Bf zygosity.normal GC.percent good.reads AB.normal AB.tumor tumor.strand
## 1  0             hom         60         51         N        .            0
## 2  0             hom         60         28         A   G0.036         G1.0
## 3  0             hom         59         51         N        .            0
## 4  0             hom         59         35         T   C0.029         C1.0
## 5  0             hom         59         37         A   G0.054         G0.5
## 6  0             hom         62         51         N        .            0

or a simplified file, including the total, and allele-specific copy-number:

a<-read.table("/examples/test2.txt", header=T)
head(a)
##         SampleID Chromosome Start_position End_position total_cn A_cn B_cn
## 1 SamplePatient1       chr1          14574       952448        5    0    5
## 2 SamplePatient1       chr1         953394      1259701        3    0    3
## 3 SamplePatient1       chr1        1278085      4551743        2    0    2
## 4 SamplePatient1       chr1        4551885     14124232        2    0    2
## 5 SamplePatient1       chr1       14161231     31062374        3    1    2
## 6 SamplePatient1       chr1       31074785     47428120        4    2    2
##   ploidy
## 1    3.7
## 2    3.7
## 3    3.7
## 4    3.7
## 5    3.7
## 6    3.7

Usage example

library("scarHRD")
scar_score("F:/Documents/scarHRD/examples/test1.small.seqz.gz",reference = "grch38", seqz=TRUE)
## Preprocessing started...

## Processing chr1: 18 variant calls; 6290 heterozygous positions; 549112 homozygous positions.
## Processing chr2: 22 variant calls; 4934 heterozygous positions; 394216 homozygous positions.

##                                                                     
  |=================================================================| 100%
## Preprocessing finished 
## Determining HRD-LOH, LST, TAI

##      HRD Telomeric AI LST HRD-sum
## [1,]   1            2   0       3
scar_score("F:/Documents/scarHRD/examples/test2.txt",reference = "grch38", seqz=FALSE)
## Determining HRD-LOH, LST, TAI

##      HRD Telomeric AI LST HRD-sum
## [1,]  25           35  33      93

Parameters

seg -- input file name
reference -- the reference genome used, grch38 or grch37 or mouse (default: grch38)
seqz -- TRUE if the input file is a smallo.seqz.gz file, otherwise FALSE (default: TRUE)
ploidy -- optional, previously estimated ploidy of the sample outputdir -- optional, the path to the output directory
chr.in.names -- optional, default: TRUE, set to FALSE if input file does not contain 'chr' in chromosome names.



Genomic scar scores

Loss of Heterozygosity (HRD-LOH)

The HRD-LOH score was described based on investigation in SNP-array-based copy number profiles of ovarian cancer (Abkevich et al. 2012). In this paper the authors showed that the samples with deficient BRCA1, BRCA2 have higher HRD-LOH scores compared to BRCA-intact samples, thus this measurement may be a reliable tool to estimate the sample's homologous recombination capacity.
The definition of a sample's HRD-LOH score is the number of 15 Mb exceeding LOH regions which do not cover the whole chromosome. In the first paper publishing HRD-LOH-score (Abkevich et al., 2012) the authors examine the correlation between HRD-LOH-score and HR deficiency calculated for different LOH region length cut-offs. In that paper the cut-off of 15 Mb approximately in the middle of the interval was arbitrarily selected for further analysis. The authors argue that the rational for this selection rather than selecting the cut-off with the lowest p-value is that the latter cut-off is more sensitive to statistical noise present in the data.
In our manuscript we also investigated if this 15 Mb cutoff is appropriate for WXS-based HRD-LOH score.We followed the same principles as Abkievits et al, thus while there was small difference between the p-values for the different minimum length cutoff values, we chose to use the same, 15 Mb limit as Abkevich et al. We also performed Spearman rank correlation between the SNP-array-based and WXS-based HRD-LOH scores for the different cutoff minimum LOH length cutoff (manuscript, Supplementary Figure S3C). Here the 14 Mb and 15 Mb cutoff-based WXS-HRD-LOH score had the highest correlation with the SNP-based HRD score. (0.700 and 0.695 respectively). This result reassured our choice of using the 15 Mb cutoff like in the SNP-array-based HRD-LOH score.

Figure 1.A Visual representation of the HRD-LOH score on short theoretical chromosomes. Figure 1.B: Calculating HRD-LOH from a biallelic copy-number profile; LOH regions a, and c, would both increase the score by 1, while neither b, or d, would add to its value (b, does not pass the length requirement, and d covers a whole chromosome)

Large Scale Transitions (LST)

The presence of Large Scale Transitions in connection with homologous recombination deficiency was first studied in basal-like breast cancer (Popova et al. 2012). Based on SNP-array derived copy number profiles BRCA1-inactivated cases had showed higher number of large scale transitions.
A large scale transition is defined as a chromosomal break between adjacent regions of at least 10 Mb, with a distance between them not larger than 3Mb.

Figure 2.A: Visual representation of the LST score on short theoretical chromosomes. Figure 2.B: Calculating LST scores from a biallelic copy-number profile; events that are marked with green "marked" signs would increase the score, while events marked with red crosses would not. The grey areas represent the centromeric regions. (From left to right; Chromosome 1: the first event passes the definition of an LST, the second bounded by a shorter than 10 Mb segment from the right, the third is bounded by a segment from the left, which extends to the centromere, the fourth’s gap is greater than 3 Mb. Chromosome 2: The first event is a valid LST, the second and third are not because they are bounded by centromeric segments, and the fourth is a valid LST)

Number of Telomeric Allelic Imbalances

Allelic imbalance (AI) is the unequal contribution of parental allele sequences with or without changes in the overall copy number of the region. Our group have previously found, that the number telomeric AIs is indicative of defective DNA repair in ovarian cancer and triple-negative breast cancer, and that higher number of telomeric AI is associated with better response to cisplatin treatment (Birkbak et al. 2012).
The number of telomeric allelic imbalances is the number AIs that extend to the telomeric end of a chromosome. Figure 3.A: Visual representation of the ntAI on short theoretical chromosomes. Figure 3.B: Illustration of possible telomeric allelic imbalances in an allele specific copy number profile.

References

Abkevich, V., K. M. Timms, B. T. Hennessy, J. Potter, M. S. Carey, L. A. Meyer, K. Smith-McCune, et al. 2012. “Patterns of genomic loss of heterozygosity predict homologous recombination repair defects in epithelial ovarian cancer.” Br. J. Cancer 107 (10): 1776–82.

Birkbak, N. J., Z. C. Wang, J. Y. Kim, A. C. Eklund, Q. Li, R. Tian, C. Bowman-Colin, et al. 2012. “Telomeric allelic imbalance indicates defective DNA repair and sensitivity to DNA-damaging agents.” Cancer Discov 2 (4): 366–75.

Favero, F., T. Joshi, A. M. Marquard, N. J. Birkbak, M. Krzystanek, Q. Li, Z. Szallasi, and A. C. Eklund. 2015. “Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data.” Ann. Oncol. 26 (1): 64–70.

Popova, T., E. Manie, G. Rieunier, V. Caux-Moncoutier, C. Tirapo, T. Dubois, O. Delattre, et al. 2012. “Ploidy and large-scale genomic instability consistently identify basal-like breast carcinomas with BRCA1/2 inactivation.” Cancer Res. 72 (21): 5454–62.

Van Loo, P., S. H. Nordgard, O. C. Lingj?rde, H. G. Russnes, I. H. Rye, W. Sun, V. J. Weigman, et al. 2010. “Allele-specific copy number analysis of tumors.” Proc. Natl. Acad. Sci. U.S.A. 107 (39): 16910–5.

scarhrd's People

Contributors

sztup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

scarhrd's Issues

Writing output issues

Hello, I believe there might be minor tweaking, but the error I am getting is

_scarHRD::scar_score("./seqz_inscarhrd.tsv", chr.in.names=FALSE,
+                     seqz = FALSE,
+                     reference = "grch37")_

Determining HRD-LOH, LST, TAI 
Error in write.table(t(HRDresulst), paste0(outputdir, "/", run_name, "_HRDresults.txt"),  : 
  invalid 'row.names' specification

Could you please send me suggestions to resolve this .

Thanks

No license

There is no license neither in the repository nor in the source files. This makes the software non-distributable at the very best.

ERROR: first argument must be a vector

Hi,

After loading the scarHRD package, I run the command
scar_score("small.seqz.gz",reference = "grch37", seqz=TRUE)
but returns
Error in split.default(lis_obj, names(lis_obj)) : first argument must be a vector

seqz file:

head(a)
chromosome position base.ref depth.normal depth.tumor depth.ratio Af Bf
1 1 131036 N 2 27 19.167 1.000 0
2 1 131087 N 2 42 21.743 1.000 0
3 1 131138 N 3 64 22.624 1.000 0
4 1 131174 A 2 57 28.500 0.982 0
5 1 131189 N 2 66 36.526 1.000 0
6 1 131240 N 2 51 32.949 1.000 0
zygosity.normal GC.percent good.reads AB.normal AB.tumor tumor.strand
1 hom 70 51 N . 0
2 hom 73 50 N . 0
3 hom 62 51 N . 0
4 hom 62 57 A C0.018 C1.0
5 hom 63 51 N . 0
6 hom 58 49 N . 0

The difference between the input file I provided and the test file in repo is the 'chr' label in chrom col.

Any suggestions?

Thanks

Error in Vroom connection size !

Hi all, I ran scarHRD normally for my seqz files.I've been getting this error. I updated my R and R studio. I also increased my Vroom connection size as Sys.setenv(VROOM_CONNECTION_SIZE = 131072*3), still I'm getting this error
The error is as follows:

Preprocessing started...
Collecting GC information ............... done

Processing chr1:
7 variant calls.
3 copy-number segments.
397 heterozygous positions.
793879 homozygous positions.
Processing chr2:
Error: The size of the connection buffer (393216) was not large enough
to fit a complete line:

  • Increase it by setting Sys.setenv("VROOM_CONNECTION_SIZE")
    In addition: There were 50 or more warnings (use warnings () to see the first 50)

Also, if I increase the size VROOM_CONNECTION_SIZE = 131072*10000), Im facing this error.

Processing chr1:
Error: cannot create std:vector larger than max_size()

Kindly anyone helpme in this!!

questions about the model

I am a user of sequenza. My name is Jay. I read your description of sequenza and I have some questions as follows.
1. why did you choose t-distribution to fit the segment data?
2.why you set freedom degree to 5?
Since I am a newcomer to statistics, hope you can help me solving these questions.

ERROR of scar_score

I'm using the tool (scarHRD) you developed, but I can't solve it because of the error message.
Other samples do not have issues, but only one sample does not proceed with the error message shown below.
I would really appreciate it if you could tell me how to solve it.
This sample is a sample that has more mutations than other samples.
I would like to refer to it.
Best Regards.

score <- scar_score (seq.file, 
                     reference = "grch37",
                     chr.in.name = FALSE,
                     seqz=TRUE)

Error Message

Preprocessing started...
Collecting GC information .............................................................................................................................. done

Processing 1:
   54296 variant calls.
   865 copy-number segments.
   170822 heterozygous positions.
   6090996 homozygous positions.
Processing 2:
   117284 variant calls.
   1629 copy-number segments.
   174767 heterozygous positions.
   7265508 homozygous positions.
Processing 3:
   96222 variant calls.
   1248 copy-number segments.
   152559 heterozygous positions.
   5716194 homozygous positions.
Processing 4:
   106524 variant calls.
   1354 copy-number segments.
   161308 heterozygous positions.
   5742369 homozygous positions.
Processing 5:
   96567 variant calls.
   1086 copy-number segments.
   130998 heterozygous positions.
   5684501 homozygous positions.
Processing 6:
   66154 variant calls.
   1351 copy-number segments.
   146514 heterozygous positions.
   4446091 homozygous positions.
Processing 7:
   72779 variant calls.
   1124 copy-number segments.
   136465 heterozygous positions.
   4477575 homozygous positions.
Processing 8:
   69686 variant calls.
   1013 copy-number segments.
   116426 heterozygous positions.
   4138932 homozygous positions.
Processing 9:
   49857 variant calls.
   815 copy-number segments.
   106605 heterozygous positions.
   3265619 homozygous positions.
Processing 10:
   33108 variant calls.
   914 copy-number segments.
   117085 heterozygous positions.
   3314602 homozygous positions.
Processing 11:
   47075 variant calls.
   1030 copy-number segments.
   104890 heterozygous positions.
   3470945 homozygous positions.
Processing 12:
   49337 variant calls.
   923 copy-number segments.
   98882 heterozygous positions.
   3569681 homozygous positions.
Processing 13:
   53195 variant calls.
   631 copy-number segments.
   77144 heterozygous positions.
   2847099 homozygous positions.
Processing 14:
   5800 variant calls.
   340 copy-number segments.
   69161 heterozygous positions.
   2114784 homozygous positions.
Processing 15:
   39846 variant calls.
   557 copy-number segments.
   65863 heterozygous positions.
   2290936 homozygous positions.
Processing 16:
   30577 variant calls.
   551 copy-number segments.
   76606 heterozygous positions.
   2171227 homozygous positions.
Processing 17:
   17017 variant calls.
   494 copy-number segments.
   56532 heterozygous positions.
   2058491 homozygous positions.
Processing 18:
   46878 variant calls.
   691 copy-number segments.
   59575 heterozygous positions.
   2435681 homozygous positions.
Processing 19:
   9118 variant calls.
   350 copy-number segments.
   48231 heterozygous positions.
   1479047 homozygous positions.
Processing 20:
   23453 variant calls.
   469 copy-number segments.
   54373 heterozygous positions.
   1712294 homozygous positions.
Processing 21:
   13865 variant calls.
   374 copy-number segments.
   40845 heterozygous positions.
   948099 homozygous positions.
Processing 22:
   5169 variant calls.
   302 copy-number segments.
   30691 heterozygous positions.
   868607 homozygous positions.

  |                                                  | 0 % ~calculating  Error in if (!is.na(mat[x, ]$Bf) & !is.na(mat[x, ]$sd.Bf/sqrt(mat[x, ]$weight.Bf))) { : 
  argument is of length zero

what to do without sequenza

Hi,
I'm trying to install scarHRD but since there is not sequenza package available nowadays, I got installing scarHRD the error:
ERROR: dependency ‘sequenza’ is not available for package ‘scarHRD’
─ removing ‘/tmp/RtmpeWcghn/Rinst284cb4536be01/scarHRD’
-----------------------------------
ERROR: package installation failed
Error: Failed to install 'scarHRD' from GitHub:
! System command 'R' failed
What can I do? Moreover, I want to run scarHRD by using ASCAT output, not sequenza...
Thanks!

scarHRD input

The mutation file I have only contains: gene names as rows, sample names as columns, and numeric CN_values as data (I think it is a log2 copy ratio). Is it possible to transform this type of file to be used in scarHRD? Thanks a lot.

warning message

I run scarHRD using sequenza *.small.seq.gz as input file, and I got warning message as below.

What's wrong?

scar_score("/run/media/skanematsu/9ce21fae-49c9-4d74-a898-7ac57ac02b29/sequenza/results/WGS/WG0015__WG0016_WG0015__WG0016_tumor/WG0015__WG0016_WG0015__WG0016_tumor.out.small.seqzPreprocessing started...", seqz=TRUE, chr.in.names=FALSE)
Collecting GC information ............................................................................................................................................................................................................ done

Processing 1:
544 variant calls.
374 copy-number segments.
159858 heterozygous positions.
9756090 homozygous positions.
Processing 2:
617 variant calls.
243 copy-number segments.
167178 heterozygous positions.
11252969 homozygous positions.
Processing 3:
339 variant calls.
152 copy-number segments.
151357 heterozygous positions.
8463792 homozygous positions.
Processing 4:
491 variant calls.
257 copy-number segments.
151892 heterozygous positions.
8191443 homozygous positions.
Processing 5:
444 variant calls.
175 copy-number segments.
129249 heterozygous positions.
8348594 homozygous positions.
Processing 6:
435 variant calls.
299 copy-number segments.
137295 heterozygous positions.
7330989 homozygous positions.
Processing 7:
400 variant calls.
282 copy-number segments.
123755 heterozygous positions.
6701081 homozygous positions.
Processing 8:
385 variant calls.
217 copy-number segments.
114530 heterozygous positions.
6202254 homozygous positions.
Processing 9:
296 variant calls.
184 copy-number segments.
92695 heterozygous positions.
4927459 homozygous positions.
Processing 10:
298 variant calls.
230 copy-number segments.
105259 heterozygous positions.
5652674 homozygous positions.
Processing 11:
316 variant calls.
203 copy-number segments.
99418 heterozygous positions.
5739393 homozygous positions.
Processing 12:
306 variant calls.
187 copy-number segments.
95666 heterozygous positions.
5723343 homozygous positions.
Processing 13:
272 variant calls.
124 copy-number segments.
74928 heterozygous positions.
4205669 homozygous positions.
Processing 14:
137 variant calls.
75 copy-number segments.
67596 heterozygous positions.
3625828 homozygous positions.
Processing 15:
188 variant calls.
128 copy-number segments.
63433 heterozygous positions.
3498831 homozygous positions.
Processing 16:
205 variant calls.
132 copy-number segments.
65664 heterozygous positions.
3656314 homozygous positions.
Processing 17:
190 variant calls.
161 copy-number segments.
56260 heterozygous positions.
3437042 homozygous positions.
Processing 18:
227 variant calls.
140 copy-number segments.
57844 heterozygous positions.
3287200 homozygous positions.
Processing 19:
157 variant calls.
147 copy-number segments.
45993 heterozygous positions.
2521251 homozygous positions.
Processing 20:
148 variant calls.
103 copy-number segments.
45145 heterozygous positions.
2629299 homozygous positions.
Processing 21:
109 variant calls.
83 copy-number segments.
35472 heterozygous positions.
1570443 homozygous positions.
Processing 22:
88 variant calls.
96 copy-number segments.
27945 heterozygous positions.
1514616 homozygous positions.
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=11m 52s
Preprocessing finished
Determining HRD-LOH, LST, TAI
HRD Telomeric AI LST HRD-sum
[1,] 4 2 2 8
Warnings:
mean.default(extract$gc$adj[, 2]):
argument is not numeric or logical: returning NA

program interrupted without any error information

Hi All,

A seqz data(around 0.9Gb) were generated from WGS data. I used it as input for scarHRD. Unfortunately,program were interrupted automatically without any error information on Rstudio Server v1.3.1073. The scarHRD version is 0.1.1. Any suggestion for this situation?

Regards,
Victor

Could you please tag your project.

To promote reproducible science, could you please use git tags. Creating a tag also creates a release for your project. We require tagged releases when building scientific software. Pulling from the master is not reproducible.

I would also recommend using the standard for semantic versioning. (Semver)[https://semver.org/]
Version number in the form: Magor.Minor.Patch. Please do not
follow the git examples by putting a "v" as the leading character. Github will create a "release" when the tag is pushed.

Thank you for making your software available

git tag 1.0.0
git push origin 1.0.0

Cannot install scarHRD - error message

Hi there,
I am trying to install your package, but it is not working. I believe there might be some problem with the sequenza package. This is what I am getting:

devtools::install_github("sztup/scarHRD")

Error in read.dcf(path) :
Found continuation line starting ' sequenza (>= 2.1.2 ...' at begin of record.

I did install the lastest version of sequenza beforehand . I am putting the sessioinfo below.
Any help will greatly appreciated.
Thanks!

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C LC_TIME=Danish_Denmark.1252

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] sequenza_2.1.2 squash_1.0.8 copynumber_1.22.0 BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 compiler_3.5.1 GenomeInfoDb_1.18.2 XVector_0.22.0
[5] remotes_2.0.2 prettyunits_1.0.2 bitops_1.0-6 tools_3.5.1
[9] zlibbioc_1.28.0 digest_0.6.18 pkgbuild_1.0.3 pkgload_1.0.2
[13] memoise_1.1.0 rlang_0.3.2 cli_1.1.0 rstudioapi_0.10
[17] curl_3.3 yaml_2.2.0 xfun_0.5 GenomeInfoDbData_1.2.0
[21] withr_2.1.2 knitr_1.22 fs_1.2.7 desc_1.2.0
[25] S4Vectors_0.20.1 IRanges_2.16.0 devtools_2.0.1 stats4_3.5.1
[29] rprojroot_1.3-2 glue_1.3.1 R6_2.4.0 processx_3.3.0
[33] sessioninfo_1.1.1 callr_3.2.0 magrittr_1.5 usethis_1.4.0
[37] backports_1.1.3 ps_1.3.0 htmltools_0.3.6 GenomicRanges_1.34.0
[41] assertthat_0.2.1 RCurl_1.95-4.12 crayon_1.3.4

scar_score() with allele-specific segmentation file fails with Error in `[.data.frame`(seg, , 8) : undefined columns selected

Hi,
I have generated a allele-specific segmentation file with CNVkit. I adapted the format to fit your input
"... allele-specific segmentation file with the following columns: 1st column: sample name, 2nd column: chromosome, 3rd column: segmentation start, 4th column: segmentation end, 5th column: total copynumber, 6th column: copy number of A allele, 7th column: copy number of B allele"

sample_name chromosome segment_start segment_end tcn nA nB
model1 1 62676830 62677251 30 19 11
model1 1 173836085 177898343 3 2 1
model1 2 677597 1217588 5 4 1

But I get the error:

scar_score("./x.tsv", reference = "grch37")
Error in [.data.frame(seg, , 8) : undefined columns selected

It seems the function is expecting an additional column?!

Thank you for your support,
Thomas

hrd-sum of test1.small.seqz.gz not the same as it on introduction

Hi
I am trying to run an example data, test1.small.seqz.gz. The result from web is different from what i got. I am wrong in some steps?

The commands i used is:
library("scarHRD")
scar_score("test2.txt",reference = "grch38", seqz=FALSE)

my result is
HRD Telomeric AI LST HRD-sum
[1,] 2 2 3 7

the result on introduction is

HRD Telomeric AI LST HRD-sum

[1,] 1 2 0 3

questions about input for scarHRD

Hi,

I'm reading the input example files from here (https://github.com/sztup/scarHRD )

##   chromosome position base.ref depth.normal depth.tumor depth.ratio    Af
## 1       chr1    12975        N            7          20       2.841 1.000
## 2       chr1    13020        A            8          28       3.500 0.964
## 3       chr1    13026        N           15          43       2.964 1.000
## 4       chr1    13038        T           11          35       3.182 0.971
## 5       chr1    13041        A           11          37       3.364 0.946
## 6       chr1    13077        N           26          65       2.465 1.000
##   Bf zygosity.normal GC.percent good.reads AB.normal AB.tumor tumor.strand
## 1  0             hom         60         51         N        .            0
## 2  0             hom         60         28         A   G0.036         G1.0
## 3  0             hom         59         51         N        .            0
## 4  0             hom         59         35         T   C0.029         C1.0
## 5  0             hom         59         37         A   G0.054         G0.5
## 6  0             hom         62         51         N        .            0

I wonder if you could explain a bit more for some of the columns?

  • depth.ratio doesn't appear to be identical as depth.tumor / depth.normal
  • I assume Af is ref allele frequency in the tumor sample, but what does Bf mean? Af + Bf don't always add up to 1. Isn't that weird?
  • GC.percent. How do you have GC.percent for a single nucleotide position?
  • good.reads. What's the meaning?
  • tumor.strand. What this means?
  • Why for some positions we have N for base.ref?

Thanks

Huan

quenstion with LST

Hi,
I'm confused about the picture in LST. In the second mark "X", the gap is shoter than 3Mb and both sides of the gap is longer than 10Mb, why marked as "X"? Same confusion in chrom.2.

question about TAI

I read the source code for the TAI section and have some question:

  1. why ploidy = A_CN(longest segment)
  2. why:
    if(!ploidy %in% c(1,seq(2, 200,by=2))){
    sample.seg[,'AI'] <- c(0,2)[match(sample.seg[,7] + sample.seg[,8] == ploidy & sample.seg[,7] != ploidy, c('TRUE', 'FALSE'))]
    }

Err in scar_score

Error in if (!is.na(mat[x, ]$Bf) & !is.na(mat[x, ]$sd.Bf/sqrt(mat[x, ]$weight.Bf))) { :
Calls: scar_score ... lapply -> FUN -> baf.bayes -> mapply ->
Warning messages:
1: In b[which(diff(b) == 0) + 1] <- b[diff(b) == 0] + offset :
2: In b[which(diff(b) == 0) + 1] <- b[diff(b) == 0] + offset :
3: In max(segs.all$sd.BAF, na.rm = TRUE) :
how to change the problem~

Where is chrominfo_grch38?

Dear scarHRD developers,

When I see the scar_score's code and try to run the code , it tell me it's not find chrominfo_grch38.
Can you tell me where is chrominfo_grch38 coming from?
if (reference == "grch38"){
chrominfo = chrominfo_grch38
} else if(reference == "grch37"){
chrominfo = chrominfo_grch37
} else {
stop()
}

Error in split.default(lis_obj, names(lis_obj)) :

I have tried running scarHRD on my data.

scar_score("sequenza/0200135321.seqz.gz",reference = "grch37", seqz=TRUE)
Preprocessing started...
Collecting GC information ........... done

Error in split.default(lis_obj, names(lis_obj)) :
first argument must be a vector

Below is output of the sessionInfo()> sessionInfo()

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] scarHRD_0.1.0 devtools_2.3.0 usethis_1.6.1 forcats_0.5.0 purrr_0.3.4 readr_1.3.1
[7] tibble_3.0.1 tidyverse_1.3.0 cowplot_1.0.0 stringr_1.4.0 tidyr_1.1.0 data.table_1.12.8
[13] ComplexHeatmap_2.4.2 ggplot2_3.3.0 dplyr_0.8.5 GenVisR_1.20.0 maftools_2.4.0

loaded via a namespace (and not attached):
[1] colorspace_1.4-1 copynumber_1.15.0 rjson_0.2.20 ellipsis_0.3.1
[5] rprojroot_1.3-2 circlize_0.4.9 XVector_0.28.0 GenomicRanges_1.40.0
[9] GlobalOptions_0.1.1 fs_1.4.1 clue_0.3-57 rstudioapi_0.11
[13] remotes_2.1.1 bit64_0.9-7 AnnotationDbi_1.50.0 fansi_0.4.1
[17] lubridate_1.7.8 xml2_1.3.2 splines_4.0.0 pkgload_1.0.2
[21] jsonlite_1.6.1 Rsamtools_2.4.0 FField_0.1.0 broom_0.5.6
[25] cluster_2.1.0 dbplyr_1.4.4 png_0.1-7 BiocManager_1.30.10
[29] compiler_4.0.0 httr_1.4.1 backports_1.1.7 assertthat_0.2.1
[33] Matrix_1.2-18 cli_2.0.2 prettyunits_1.1.1 tools_4.0.0
[37] gtable_0.3.0 glue_1.4.1 GenomeInfoDbData_1.2.3 rappdirs_0.3.1
[41] Rcpp_1.0.4.6 Biobase_2.48.0 cellranger_1.1.0 vctrs_0.3.0
[45] Biostrings_2.56.0 nlme_3.1-147 rtracklayer_1.48.0 ps_1.3.3
[49] testthat_2.3.2 rvest_0.3.5 lifecycle_0.2.0 gtools_3.8.2
[53] XML_3.99-0.3 zlibbioc_1.34.0 scales_1.1.1 BSgenome_1.56.0
[57] VariantAnnotation_1.34.0 hms_0.5.3 parallel_4.0.0 SummarizedExperiment_1.18.1
[61] RColorBrewer_1.1-2 curl_4.3 pbapply_1.4-2 memoise_1.1.0
[65] gridExtra_2.3 biomaRt_2.44.0 stringi_1.4.6 RSQLite_2.2.0
[69] desc_1.2.0 S4Vectors_0.26.1 GenomicFeatures_1.40.0 BiocGenerics_0.34.0
[73] pkgbuild_1.0.8 BiocParallel_1.22.0 shape_1.4.4 GenomeInfoDb_1.24.0
[77] rlang_0.4.6 pkgconfig_2.0.3 matrixStats_0.56.0 bitops_1.0-6
[81] lattice_0.20-41 GenomicAlignments_1.24.0 processx_3.4.2 bit_1.1-15.2
[85] tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5 R6_2.4.1
[89] IRanges_2.22.2 generics_0.0.2 squash_1.0.9 DelayedArray_0.14.0
[93] DBI_1.1.0 pillar_1.4.4 haven_2.3.0 withr_2.2.0
[97] survival_3.1-12 RCurl_1.98-1.2 seqminer_8.0 modelr_0.1.8
[101] crayon_1.3.4 utf8_1.1.4 sequenza_3.0.0 BiocFileCache_1.12.0
[105] viridis_0.5.1 GetoptLong_0.1.8 progress_1.2.2 readxl_1.3.1
[109] callr_3.4.3 blob_1.2.1 reprex_0.3.0 digest_0.6.25
[113] openssl_1.4.1 stats4_4.0.0 munsell_0.5.0 viridisLite_0.3.0
[117] iotools_0.3-1 sessioninfo_1.1.1 askpass_1.1

about hg38

Thanks for sharing this tool. You mentioned we need to do install_github('aroneklund/copynumber') in order to run GRCh38. My guess is that this is to allow sequenza.extract function to take hg38 assembly because originally it only takes hg19, hg18, hg17.

Having checked your source code scar_score.R and preporcess.seqz.R, it seems that function sequenza.extract in preporcess.seqz.R does not take any parameter on assembly, so sequenza.extract would use hg19 (default) anyway no matter what reference (grch38 or grch37) you specify in scar_score.

If so, install_github('aroneklund/copynumber') is not needed to run on GRCh38. Am I understanding this correctly? Looking forward to your clarification.

Best,
Corey

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.