vccri / sierra Goto Github PK

View Code? Open in Web Editor NEW

48.0 9.0 17.0 20.37 MB

Discover differential transcript usage from polyA-captured single cell RNA-seq data

License: GNU General Public License v3.0

R 100.00%

scrna-seq isoforms alternative-polyadenylation alternative-splicing sierra

sierra's Issues

User-defined pseudo-bulk samples?

Hi,

I have data with cells from various individuals representing two different groups, and I'm interested in comparing the same cell types between the groups. I would like to use the individuals as replicates when running DEXSeq for DTU. Is there an easy way to modify the existing code so that the pseudo-bulk samples correspond to all the cells of interest in the available individuals?

Best,
Daniel

Does the Sierra package have a detailed protocol, I would like to find.

Interpret results

Hello, thank you very much for developing this package.
I have a few questions about how to explain the table obtained from DUTest.

In the row names, for example, "TMBIM6:12:50152188-50156873:1", do the coordinates correspond to the start and end positions of the peak?
What do the two columns, population1_pct and population2_pct represent?
Is Log2_fold_change calculated using the above two columns?
Is it possible to tell if there are 3'UTR shortening from this result?

Thanks!

Error in AnnotatePeaksFromGTF()

Hi,

I am trying to use this on human scRNA-seq data generated using the 10x platform.

I got this error after running

genome <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
AnnotatePeaksFromGTF(peak.sites.file = peak.merge.output.file, 
                     gtf.file = reference.file,
                     output.file = "peak_annotations.txt", 
                     genome = genome)

Error in asMethod(object) :
  The character vector to convert to a GRanges object must contain
  strings of the form "chr:start-end" or "chr:start-end:strand", with end
 >= start - 1, or "chr:pos" or "chr:pos:strand". For example:
  "chr1:2501-2900", "chr1:2501-2900:+", or "chr1:740". Note that ".." is
  a valid alternate start/end separator. Strand can be "+", "-", "*", or
  missing.

It turns out that some of the regions in the peak file have chr start position greater than the end. For example:

 peaks.use.chr.update[250:260]

[1] "chr16:2853291-2856833:-" "chr16:2854597-2855173:-"
[3] "chr19:18857128-18867664:+" "chr19:18863469-18866739:+"
[5] "chr19:18867222-18868236:+" "chr19:18865704-18866607:+"
[7] "chr19:18867002-18867680:+" "chr19:18866058-18867445:+"
[9] "chr19:18857450-18860932:+" "chr19:18867147-18867135:+"
[11] "chr7:26864407-26864659:-"

Do you know why that might be happening?

peak discrepancy

Hi,

I'm trying to re-analyze some already published data. previously analyzed data with gene counts detects 17473 features and 56902 cells, my dataset is only able to detect 394 peaks and 170706 cells.
which is weird this number should be higher than the number of genes feature or at least equal, is there a point where something could have gone wrong I need to check? thanks.

seuratPeaks <- NewPeakSeurat(
peak.counts,
peak.annotations,
min.cells = 0,
min.peaks = 0 )

Same data but with function:
peaks.seurat <- PeakSeuratFromTransfer(peak.data = peak.counts,
genes.seurat = published.seurat_object,
annot.info = peak.annotations,
min.cells = 0, min.peaks = 0)
Gives instead:
394 peaks and 56902 cells

I don't understand why the two functions give 2 different outputs.

Thanks!

Error in if (start > stop) { : missing value where TRUE/FALSE needed in Analysing genomic motifs

I've been running through a lot of datasets with Sierra pretty smoothly. With a tumor dataset, I'm getting the following error with AnnotatePeaksFromGTF:

[1] "Annotating 60112 peak coordinates."

Annotating 3' UTRs
Annotating 5' UTRs
Annotating introns
Annotating exons
Annotating CDS
Analysing genomic motifs surrounding peaks (this can take some time)

|==================================================================================================================== | 88%Error in if (start > stop) { : missing value where TRUE/FALSE needed

I'm using a custom GTF with transgene elements in it (EGFP, TdTomato, WPRE) but I've had no problems using CellRanger, Kallisto/Bustools, Seurat, Scanpy with it. Also, I got the same error with using the CellRanger mm10 (which is what I modified to add the transgenes). I also got some warnings related to the custom GTF:

1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon, exon, transcript. This information was
ignored.
2: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

in 'x': M

in 'y': MT, WPRE, YAPRELApA, BFP2, GL456210.1, GL456211.1, GL456212.1, GL456216.1, GL456219.1, GL456221.1, GL456233.1, GL456239.1, GL456350.1, GL456354.1, GL456372.1, GL456381.1, GL456385.1, JH584292.1, JH584293.1, JH584294.1, JH584295.1, JH584296.1, JH584297.1, JH584298.1, JH584299.1, JH584303.1, JH584304.1, mGFP, mTom
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
3: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

in 'x': M

in 'y': MT, WPRE, YAPRELApA, BFP2, GL456210.1, GL456211.1, GL456212.1, GL456216.1, GL456219.1, GL456221.1, GL456233.1, GL456239.1, GL456350.1, GL456354.1, GL456372.1, GL456381.1, GL456385.1, JH584292.1, JH584293.1, JH584294.1, JH584295.1, JH584296.1, JH584297.1, JH584298.1, JH584299.1, JH584303.1, JH584304.1, mGFP, mTom
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
4: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

in 'x': M

in 'y': MT, WPRE, YAPRELApA, BFP2, GL456210.1, GL456211.1, GL456212.1, GL456216.1, GL456219.1, GL456221.1, GL456233.1, GL456239.1, GL456350.1, GL456354.1, GL456372.1, GL456381.1, GL456385.1, JH584292.1, JH584293.1, JH584294.1, JH584295.1, JH584296.1, JH584297.1, JH584298.1, JH584299.1, JH584303.1, JH584304.1, mGFP, mTom
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
5: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

in 'x': M

in 'y': MT, WPRE, YAPRELApA, BFP2, GL456210.1, GL456211.1, GL456212.1, GL456216.1, GL456219.1, GL456221.1, GL456233.1, GL456239.1, GL456350.1, GL456354.1, GL456372.1, GL456381.1, GL456385.1, JH584292.1, JH584293.1, JH584294.1, JH584295.1, JH584296.1, JH584297.1, JH584298.1, JH584299.1, JH584303.1, JH584304.1, mGFP, mTom
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
6: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

in 'x': M

in 'y': MT, WPRE, YAPRELApA, BFP2, GL456210.1, GL456211.1, GL456212.1, GL456216.1, GL456219.1, GL456221.1, GL456233.1, GL456239.1, GL456350.1, GL456354.1, GL456372.1, GL456381.1, GL456385.1, JH584292.1, JH584293.1, JH584294.1, JH584295.1, JH584296.1, JH584297.1, JH584298.1, JH584299.1, JH584303.1, JH584304.1, mGFP, mTom
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
7: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

in 'x': M

in 'y': MT, WPRE, YAPRELApA, BFP2, GL456210.1, GL456211.1, GL456212.1, GL456216.1, GL456219.1, GL456221.1, GL456233.1, GL456239.1, GL456350.1, GL456354.1, GL456372.1, GL456381.1, GL456385.1, JH584292.1, JH584293.1, JH584294.1, JH584295.1, JH584296.1, JH584297.1, JH584298.1, JH584299.1, JH584303.1, JH584304.1, mGFP, mTom
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
8: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

in 'x': M

in 'y': MT, WPRE, YAPRELApA, BFP2, GL456210.1, GL456211.1, GL456212.1, GL456216.1, GL456219.1, GL456221.1, GL456233.1, GL456239.1, GL456350.1, GL456354.1, GL456372.1, GL456381.1, GL456385.1, JH584292.1, JH584293.1, JH584294.1, JH584295.1, JH584296.1, JH584297.1, JH584298.1, JH584299.1, JH584303.1, JH584304.1, mGFP, mTom
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).

Any suggestions? Thanks in advance!

Update: tried another "normal" non tumor sample with the custom reference and got the same error at the same place. Might there be a quick fix? I'm re-aligning one my samples to the normal mm10 to see if that fixes it.

Which alignment method and indexing options are suitable to use with Sierra?

Hello everyone,
Thank you developing Sierra. I want to use it on some scRNA-Seq data but as I guess, the alignment step is crucial in this case. I believe STAR works fine in this case. Is it okay to use other algorithms such as hisat2?

Also, the indexing of genome using STAR requires lots of parameters including splice junction database parameters. Do they needed to use in indexing for best performance in using Sierra?

cell barcode tag

Hi there,

Thank you for developing this useful tool for single cell dataset analysis.
I tried to use this for my dataset which processed by Dropseq pipeline. It actually tag the cell barcode as XC rather than CB. I just replace the CB with XC in the source code and it works. I just wondering whether it is possible to set the cell barcode tag as one of the input parameter for convinced in future. Thank you in advance.

PeakSeuratFromTransfer

Thanks again for all of the help.

Almost everything from the vignette now works on my dataset.

However, I can't use the function PeakSeuratFromTransfer, only NewPeakSeurat (which works fine and let's me proceed through the vignette).

Specifically, using PeakSeuratFromTransfer I get the error:

[1] "Creating Seurat object with 49254 peaks and 0 cells"
[1] "Preparing feature table for DEXSeq"
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Error in new.data[new.features, colnames(x = object), drop = FALSE] :
invalid or not-yet-implemented 'Matrix' subsetting

I used a standard Seurat V3 workflow (not using SCTransform though I tried that at first with no success) to make the Seurat Object from the 10X matrix and computed the tsne. Is the "0 cells" notation troubling?

NewPeakSeurat gives the following:

[1] "Creating Seurat object with 49254 peaks and 9128 cells"
[1] "Preparing feature table for DEXSeq"
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
[1] "No t-SNE coodinates included"
[1] "No UMAP coordinates included"

Error:"the supplied start/end lead to a negative width" in AnnotatePeaksFromGTF

Dear developing team,

I get the following error when I run AnnotatePeaksFromGTF:

This issue comes from BSgenome::getSeq. I managed to track down the row that caused it. Ultimately, the error is given by a function in the IRanges package. You can reproduce it by running the function with the following peaks file.
Gene Chr Strand MaxPosition Fit.max.pos Fit.start Fit.end mu sigma k exon.intron exon.pos polyA_ID mt-Rnr1 chrM 1 636 634 115 1024 298.217592563921 173.813560963883 3632.99934014036 no-junctions NA mt-Rnr1:chrM:115-1024:1

Now, I am not sure why this error comes up. But, given the default parameters used in these functions, if I had to make a wild guess, it would be that the peak is too close to the start of the chromosome (closer than 250bp). I could be wrong of course, but, if that is the case, it would be good to adapt the code to these cases are consciously exclude them from the peaks file.

thanks,
amisios

FindPeaks error

Hi,
Sierra is really a great tool to analyze APA in scRNA-seq.
I try to run Sierra using Vignette example,
extdata_path <- system.file("extdata",package = "Sierra")
reference.file <- paste0(extdata_path,"/Vignette_cellranger_genes_subset.gtf")
junctions.file <- paste0(extdata_path,"/Vignette_example_TIP_sham_junctions.bed")
bamfile <- c(paste0(extdata_path,"/Vignette_example_TIP_sham.bam"),
paste0(extdata_path,"/Vignette_example_TIP_MI.bam") )
whitelist.bc.file <- c(paste0(extdata_path,"/example_TIP_sham_whitelist_barcodes.tsv"),
paste0(extdata_path,"/example_TIP_MI_whitelist_barcodes.tsv"))

peak.output.file <- c("Vignette_example_TIP_sham_peaks.txt",
"Vignette_example_TIP_MI_peaks.txt")
FindPeaks(output.file = peak.output.file[1], # output filename
gtf.file = reference.file, # gene model as a GTF file
bamfile = bamfile[1], # BAM alignment filename.
junctions.file = junctions.file, # BED filename of splice junctions exising in BAM file.
ncores = 1) # number of cores to use

however, I got the following error:
Error in validObject(.Object) :
invalid class "GFF2File" object: undefined class for slot "resource" ("characterORconnection")

Do you know how to cope with this issue?
Best

Error in CountPeaks()

Hello Sierra team,

I was hoping to get some help regarding an issue (see below) that I'm having when running Sierra CountPeaks with a specific sample (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR13043563 and SRR13043565). The pipeline works fine with both the vignette dataset and with other samples from the same experiment/patient (ex. SRR13043564).

I've tried some of the common suggestions found in other issues (ex. checking my gtf annotations, barcode file, bam file index and etc...) but so far nothing worked.

I'm using cellranger's annotation as a reference (refdata-gex-GRCh38-2020-A) and mapped my samples with STAR aligner (since the original experiment is MARS-seq) and followed STARsolo guidelines to make it as close to cellranger as possible.

Thanks in advance for any help you can provide.

Best regards,
Felipe

Error in {: task 32 failed - "1 elements in value to replace 0 elements"
Traceback:

CountPeaks(peak.sites.file = peak.merge.output.file, gtf.file = reference.file,
. bamfile = bam.file, whitelist.file = whitelist.bc.file, output.dir = outfile,
. countUMI = TRUE, ncores = 16)
foreach::foreach(each.chr = chr.names, .combine = "rbind", .packages = c("magrittr")) %dopar%
. {
. mat.per.chr <- c()
. message("Processing chr: ", each.chr)
. for (strand in c(1, -1)) {
. message(" and strand ", strand)
. isMinusStrand <- if (strand == 1)
. FALSE
. else TRUE
. peak.sites.chr <- dplyr::filter(peak.sites, Chr ==
. each.chr & Strand == strand) %>% dplyr::select(Gene,
. Chr, Fit.start, Fit.end, Strand)
. peak.sites.chr$Fit.start <- as.integer(peak.sites.chr$Fit.start)
. peak.sites.chr$Fit.end <- as.integer(peak.sites.chr$Fit.end)
. peak.sites.chr <- dplyr::filter(peak.sites.chr, Fit.start <
. Fit.end)
. if (nrow(peak.sites.chr) == 0) {
. next
. }
. isMinusStrand <- if (strand == 1)
. FALSE
. else TRUE
. which <- GenomicRanges::GRanges(seqnames = each.chr,
. ranges = IRanges::IRanges(1, max(peak.sites.chr$Fit.end)))
. param <- Rsamtools::ScanBamParam(tag = c(CBtag, UMItag),
. which = which, flag = Rsamtools::scanBamFlag(isMinusStrand = isMinusStrand))
. aln <- GenomicAlignments::readGAlignments(bamfile,
. param = param)
. nobarcodes <- which(unlist(is.na(GenomicRanges::mcols(aln)[CBtag])))
. noUMI <- which(unlist(is.na(GenomicRanges::mcols(aln)[UMItag])))
. to.remove <- dplyr::union(nobarcodes, noUMI)
. if (length(to.remove) > 0) {
. aln <- aln[-to.remove]
. }
. whitelist.pos <- which(unlist(GenomicRanges::mcols(aln)[CBtag]) %in%
. whitelist.bc)
. aln <- aln[whitelist.pos]
. if (countUMI) {
. GenomicRanges::mcols(aln)$CB_UB <- paste0(unlist(GenomicRanges::mcols(aln)[CBtag]),
. "_", unlist(GenomicRanges::mcols(aln)[UMItag]))
. uniqUMIs <- which(!duplicated(GenomicRanges::mcols(aln)$CB_UB))
. aln <- aln[uniqUMIs]
. }
. aln <- GenomicRanges::split(aln, unlist(GenomicRanges::mcols(aln)[CBtag]))
. polyA.GR <- GenomicRanges::GRanges(seqnames = peak.sites.chr$Chr,
. IRanges::IRanges(start = peak.sites.chr$Fit.start,
. end = as.integer(peak.sites.chr$Fit.end)))
. n.polyA <- length(polyA.GR)
. barcodes.gene <- names(aln)
. res <- sapply(barcodes.gene, function(x) GenomicRanges::countOverlaps(polyA.GR,
. aln[[x]]))
. res.mat <- matrix(0L, nrow = n.polyA, ncol = n.bcs)
. res.mat[, match(barcodes.gene, whitelist.bc)] <- res
. mat.per.strand <- Matrix::Matrix(res.mat, sparse = TRUE)
. polyA.ids <- paste0(peak.sites.chr$Gene, ":", peak.sites.chr$Chr,
. ":", peak.sites.chr$Fit.start, "-", peak.sites.chr$Fit.end,
. ":", peak.sites.chr$Strand)
. rownames(mat.per.strand) <- polyA.ids
. if (is.null(mat.per.chr)) {
. mat.per.chr <- mat.per.strand
. }
. else {
. mat.per.chr <- rbind(mat.per.chr, mat.per.strand)
. }
. }
. return(mat.per.chr)
. }
e$fun(obj, substitute(ex), parent.frame(), e$data)

long reads

Hi @reprobate

I tried to use Sierra on single cell long read data. I know its not tested for the same. I came across the following error while running DUTest. Is this related to less amount of data?

[1] "5145 expressed peaks in feature types exon" [1] "1289 genes detected with multiple peak sites expressed" [1] "4910 individual peak sites to test" converting counts to integer mode [1] "Running DEXSeq test..." -- note: fitType='parametric', but the dispersion trend was not well captured by the function: y = a/x + b, and a local regression fit was automatically substituted. specify fitType='local' or 'mean' to avoid this message next time. Error in countsThis[as.character(newMf[i, "exon"]), as.character(newMf[i, : subscript out of bounds

Error in DUTest

Hi Developers,

Thanks for this cool method.

I'm currently encountering an error when I run DUTest with a Seurat object. The error message is as follows.

[1] "Running DEXSeq test..."
Error in checkSlotAssignment(object, name, value) :
assignment of an object of class "DFrame" is not valid for slot 'elementMetadata' in an object of class "DEXSeqResults"; is(value, "DataTable_OR_NULL") is not TRUE
In addition: Warning message:
In DESeqDataSet(rse, design, ignoreRank = TRUE) :
some variables in design formula are characters, converting to factors

Can I get some advice to resolve it?

Thanks!

Best,
Soobeom

sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS: /gpfs/share/apps/R/4.0.0/lib64/R/lib/libRblas.so
LAPACK: /gpfs/share/apps/R/4.0.0/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringr_1.4.0 Sierra_0.99.24 SeuratObject_4.0.0 Seurat_4.0.0

loaded via a namespace (and not attached):
[1] reticulate_1.18 tidyselect_1.1.0
[3] RSQLite_2.2.3 AnnotationDbi_1.52.0
[5] htmlwidgets_1.5.3 grid_4.0.0
[7] BiocParallel_1.24.1 Rtsne_0.15
[9] munsell_0.5.0 codetools_0.2-18
[11] ica_1.0-2 statmod_1.4.35
[13] future_1.21.0 miniUI_0.1.1.1
[15] colorspace_2.0-0 Biobase_2.50.0
[17] knitr_1.31 rstudioapi_0.13
[19] stats4_4.0.0 SingleCellExperiment_1.12.0
[21] ROCR_1.0-11 tensor_1.5
[23] listenv_0.8.0 MatrixGenerics_1.2.0
[25] GenomeInfoDbData_1.2.4 harmony_1.0
[27] hwriter_1.3.2 polyclip_1.10-0
[29] bit64_4.0.5 parallelly_1.23.0
[31] vctrs_0.3.6 generics_0.1.0
[33] xfun_0.20 biovizBase_1.38.0
[35] BiocFileCache_1.14.0 R6_2.5.0
[37] GenomeInfoDb_1.26.2 locfit_1.5-9.4
[39] AnnotationFilter_1.14.0 bitops_1.0-6
[41] spatstat.utils_2.0-0 cachem_1.0.1
[43] DelayedArray_0.16.1 assertthat_0.2.1
[45] promises_1.1.1 scales_1.1.1
[47] nnet_7.3-14 gtable_0.3.0
[49] globals_0.14.0 goftest_1.2-2
[51] ensembldb_2.14.0 rlang_0.4.10
[53] genefilter_1.72.1 splines_4.0.0
[55] rtracklayer_1.50.0 lazyeval_0.2.2
[57] dichromat_2.0-0 checkmate_2.0.0
[59] reshape2_1.4.4 abind_1.4-5
[61] GenomicFeatures_1.42.1 backports_1.2.1
[63] httpuv_1.5.5 Hmisc_4.4-2
[65] tools_4.0.0 ggplot2_3.3.3
[67] ellipsis_0.3.1 RColorBrewer_1.1-2
[69] BiocGenerics_0.36.0 ggridges_0.5.3
[71] Rcpp_1.0.6 plyr_1.8.6
[73] base64enc_0.1-3 progress_1.2.2
[75] zlibbioc_1.36.0 purrr_0.3.4
[77] RCurl_1.98-1.2 prettyunits_1.1.1
[79] rpart_4.1-15 openssl_1.4.3
[81] deldir_0.2-9 pbapply_1.4-3
[83] cowplot_1.1.1 S4Vectors_0.28.1
[85] zoo_1.8-8 SummarizedExperiment_1.20.0
[87] ggrepel_0.9.1 cluster_2.1.0
[89] magrittr_2.0.1 data.table_1.13.6
[91] scattermore_0.7 lmtest_0.9-38
[93] RANN_2.6.1 ProtGenerics_1.22.0
[95] fitdistrplus_1.1-3 matrixStats_0.57.0
[97] hms_1.0.0 patchwork_1.1.1
[99] mime_0.9 xtable_1.8-4
[101] XML_3.99-0.5 jpeg_0.1-8.1
[103] IRanges_2.24.1 gridExtra_2.3
[105] compiler_4.0.0 biomaRt_2.46.2
[107] tibble_3.0.6 KernSmooth_2.23-18
[109] crayon_1.3.4 htmltools_0.5.1.1
[111] mgcv_1.8-33 later_1.1.0.1
[113] Formula_1.2-4 geneplotter_1.68.0
[115] tidyr_1.1.2 DBI_1.1.1
[117] dbplyr_2.0.0 MASS_7.3-53
[119] rappdirs_0.3.2 Matrix_1.2-18
[121] parallel_4.0.0 Gviz_1.34.0
[123] igraph_1.2.6 GenomicRanges_1.42.0
[125] pkgconfig_2.0.3 GenomicAlignments_1.26.0
[127] foreign_0.8-80 plotly_4.9.3
[129] xml2_1.3.2 foreach_1.5.1
[131] annotate_1.68.0 XVector_0.30.0
[133] DEXSeq_1.36.0 VariantAnnotation_1.36.0
[135] digest_0.6.27 sctransform_0.3.2
[137] RcppAnnoy_0.0.18 spatstat.data_1.7-0
[139] Biostrings_2.58.0 leiden_0.3.6
[141] htmlTable_2.1.0 uwot_0.1.10
[143] curl_4.3 shiny_1.6.0
[145] Rsamtools_2.6.0 lifecycle_0.2.0
[147] nlme_3.1-151 jsonlite_1.7.2
[149] viridisLite_0.3.0 askpass_1.1
[151] BSgenome_1.58.0 pillar_1.4.7
[153] lattice_0.20-41 fastmap_1.1.0
[155] httr_1.4.2 survival_3.2-7
[157] glue_1.4.2 spatstat_1.64-1
[159] png_0.1-7 iterators_1.0.13
[161] bit_4.0.4 stringi_1.5.3
[163] blob_1.2.1 DESeq2_1.30.0
[165] latticeExtra_0.6-29 memoise_2.0.0
[167] dplyr_1.0.3 irlba_2.3.3
[169] future.apply_1.7.0

Error in scan in FindPeaks

Hello,

I am trying to run FindPeaks on my data and am getting the following output. It appears to work fine but then after running for about 2 hours (using 32gb ram on 4 threads), I get an error that the line doesn't have 12 elements. The only potential problem I can think of is that the bam file has the tag for cell barcodes as BC:Z and UMI as U8:Z. Could that be the issue? The bam file was produced using long read single cell sequencing.

Thank you so much in advance for your help!

Error in dataframe while running FindPeaks

Hi,

I am using Sierra for 10x data. While executing the findpeaks, i get the following error:
Error in data.frame(EnsemblID = gtf_gr@elementMetadata@listData$gene_id, : arguments imply differing number of rows: 208940, 0

The GTF file i use here is not from ensembl. It is downloaded from NCBI.
Any suggestion is appreciated.
Thanks in advance!

PeakSeuratFromTransfer or NewPeakSeurat Error

Hi, I'm new to working with Seurat objects as I'm used to various scRNA-seq Python tools. I'm having trouble with both PeakSeuratFromTransfer and NewPeakSeurat if you could help me out. I think Sierra is a really cool package and I'm excited to use it :)

Using an existing Seurat object and running PeakSeuratFromTransfer, I get the following:


peaks_seurat <- PeakSeuratFromTransfer(peak.data = peak_counts,
                                       genes.seurat = sdata, 
                                       annot.info = peak_annotations, 
                                       min.cells = 0, min.peaks = 0, 
                                       )


Error in if (gene.cov != "") { : missing value where TRUE/FALSE needed

Trying to make a new Seurat object from peak_counts and peak_annotations I get:

peaks.seurat <- NewPeakSeurat(peak.data = peak_counts, 
                              annot.info = peak_annotations,
                              project.name = "cr_st5",
                              filter.gene.mismatch = FALSE,
                              verbose = TRUE)

[1] "Creating Seurat object with 0 peaks and 2921 cells"
Error in annot.info[peaks.use, feature.names] : 
  incorrect number of dimensions

If I set filter.gene.mismatch = TRUE I get a different error: Error in if (gene.cov != "") { : missing value where TRUE/FALSE needed, which is the same as the PeakSeuratFromTransfer error.

I've tried digging into the source code to see what might be wrong, but I'm not super familiar with R so I'm having a bit of trouble.

Thanks!

Ashley

Error in CountPeaks step

I'm attempting to figure out the source of the following error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'writeMM' for signature '"NULL"'

I just re-installed Sierra after finding the same error described here: #35
I have previously ran Sierra successfully on 10x data from the same pipeline (STARsolo). The current dataset is substantially larger, about 1.5 billion reads per sample. Running via slurm with 128GB RAM provided.

Edit: subsetting the BAM to chr19 only did not resolve the error so I don't think this is a memory issue.

Command:

CountPeaks(peak.sites.file = peak.output.file, # FindPeaks runs fine
gtf.file = gtf, # ensembl build 99 gtf for mouse, filterd per CellRanger spec
bamfile = bam, # indexed bam file from STARsolo
whitelist.file = bcs, # text file containing one column with barcode sequences
output.dir = count.dir,
countUMI = TRUE,
chr.names = c("1","2","3"), # subset to three chromosomes to expedite testing
filter.chr=TRUE, # (your documentation describes this option incorrectly)
ncores = 8)

Full output below:

Warning messages:
1: replacing previous import 'GenomicRanges::union' by 'dplyr::union' when loading 'Sierra'
2: replacing previous import 'GenomicRanges::intersect' by 'dplyr::intersect' when loading 'Sierra'
3: replacing previous import 'GenomicRanges::setdiff' by 'dplyr::setdiff' when loading 'Sierra'
4: replacing previous import 'Gviz::tail' by 'utils::tail' when loading 'Sierra'
5: replacing previous import 'Gviz::head' by 'utils::head' when loading 'Sierra'
There are 8266 whitelist barcodes.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
There are 168884 sites
Doing counting for each site...
Processing chr: 3
and strand 1
Processing chr: 2
and strand 1
Processing chr: 1
and strand 1
and strand -1
and strand -1
and strand -1
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'writeMM' for signature '"NULL"'
Calls: CountPeaks -> ->
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
Execution halted

thank you,

Sierra Installation error

Hi,

While trying to install the package i get the following error:
Error: package or namespace load failed for ‘Sierra’:
.onLoad failed in loadNamespace() for 'checkmate', details:
call: get0(oNam, envir = ns)
error: lazy-load database '/Users/samhas/Library/R/3.6/library/backports/R/backports.rdb' is corrupt
In addition: Warning messages:
1: In get0(oNam, envir = ns) : restarting interrupted promise evaluation
2: In get0(oNam, envir = ns) : internal error -3 in R_decompress1

Thanks in advance!

Issue Preparing input data sets - 3.1.2 Splice junctions file - Unsure if Sierra data or Regtools error?

Hi!

Love the idea of Sierra, and am looking forward to applying the pipeline to our data.

I've ran the Sierra Vignette example using the pre-formatted inputs. However, prior to beginning to work with our own larger data sets, I wanted to test generating the requisite Splice junctions file .bed file using the example data provided in the vignette.

I do want to preface that I'm unsure if the following is a Regtools error, or an issue with the Sierra data.

I pulled the current regtools docker image from their repo, and after downloading a fresh copy of the Vignette_example_TIP_sham.bam file from here, I ran the following:

PS C:> docker run griffithlab/regtools regtools junctions extract -s 1 C:\Vignette_example_TIP_sham.bam -o C:\testoutput.bed

Program: regtools
Version: 0.5.2
Minimum junction anchor length: 8
Minimum intron length: 70
Maximum intron length: 500000
Alignment: C:\Vignette_example_TIP_sham.bam
Output file: C:\testoutput.bed

[E::hts_open_format] fail to open file 'C:\Vignette_example_TIP_sham.bam'
Unable to open BAM/SAM file.

Is this a docker / regtools issue (eg, not Sierra)? Or is this a Sierra data issue? Any help or pointers on how to resolve this issue would be great! If I can reproducibly run your Vignette, then I am certain I can get it working with our data.

Andrew

failed re-building ‘Sierra_vignette.rmd’

E creating vignettes (36.4s)
--- re-building ‘Sierra_vignette.rmd’ using rmarkdown
Quitting from lines 297-298 (Sierra_vignette.rmd)
Error: processing vignette 'Sierra_vignette.rmd' failed with diagnostics:
argument is of length zero
--- failed re-building ‘Sierra_vignette.rmd’

Error counting UMI

Hi,
I am running Sierra CountPeaks() using ~80000 cells, across 6 10x lanes with a cell barcode structure like this "R1-L1_NNNNNNNNNNNNNNNNNN" (R indicates replicate, L indicates lane).The parsed BAM is about 600GB. and number of peaks is ~180000.

Running
CountPeaks(peak.sites.file = peak.output.file, gtf.file = reference.file, bamfile = bamfile, whitelist.file = whitelist.bc.file, output.dir = count.dirs, countUMI = TRUE, ncores = 6)

leading to the following prompt and error:

There are 77835 whitelist barcodes. Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK There are 184034 sites Doing counting for each site... Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'writeMM' for signature '"NULL"' In addition: Warning messages: 1: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon, exon. This information was ignored. 2: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : scheduled cores 1, 2, 3, 4, 5, 6 did not deliver results, all values of the jobs will be affected Registered S3 method overwritten by 'spatstat': method from print.boxx cli

Error when running CountPeaks

I get an error when using the CountPeaks function for one of my samples:

There are 14963 whitelist barcodes.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
There are 7  sites
Doing counting for each site...
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'writeMM' for signature '"NULL"'
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information
  was ignored.

Do you know what can be the problem ? (i saw a similar issue posted by another user, but did not find the solution)
Thanks !

Installation error on R.3.6.2

Hello, I got the following error after the installation.
Thank you for the help in advance.

devtools::install_github("VCCRI/Sierra", build = TRUE, build_vignettes = TRUE, build_opts = c("--no-resave-data", "--no-manual"))
Downloading GitHub repo VCCRI/Sierra@HEAD
✔ checking for file ‘/private/var/folders/28/jd9d90s50mdctp6_0jk1wthw0000gq/T/RtmpRyVoSj/remotesa75d3f6bab38/VCCRI-Sierra-ef71a45/DESCRIPTION’ ...
─ preparing ‘Sierra’:
✔ checking DESCRIPTION meta-information ...
─ installing the package to build vignettes
✔ creating vignettes (34s)
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
─ building ‘Sierra_0.99.22.tar.gz’

installing source package ‘Sierra’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Warning: replacing previous import 'GenomicRanges::union' by 'dplyr::union' when loading 'Sierra'
Warning: replacing previous import 'GenomicRanges::intersect' by 'dplyr::intersect' when loading 'Sierra'
Warning: replacing previous import 'GenomicRanges::setdiff' by 'dplyr::setdiff' when loading 'Sierra'
Warning: replacing previous import 'Gviz::tail' by 'utils::tail' when loading 'Sierra'
Warning: replacing previous import 'Gviz::head' by 'utils::head' when loading 'Sierra'
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
Warning: replacing previous import 'GenomicRanges::union' by 'dplyr::union' when loading 'Sierra'
Warning: replacing previous import 'GenomicRanges::intersect' by 'dplyr::intersect' when loading 'Sierra'
Warning: replacing previous import 'GenomicRanges::setdiff' by 'dplyr::setdiff' when loading 'Sierra'
Warning: replacing previous import 'Gviz::tail' by 'utils::tail' when loading 'Sierra'
Warning: replacing previous import 'Gviz::head' by 'utils::head' when loading 'Sierra'
** testing if installed package can be loaded from final location
Warning: replacing previous import 'GenomicRanges::union' by 'dplyr::union' when loading 'Sierra'
Warning: replacing previous import 'GenomicRanges::intersect' by 'dplyr::intersect' when loading 'Sierra'
Warning: replacing previous import 'GenomicRanges::setdiff' by 'dplyr::setdiff' when loading 'Sierra'
Warning: replacing previous import 'Gviz::tail' by 'utils::tail' when loading 'Sierra'
Warning: replacing previous import 'Gviz::head' by 'utils::head' when loading 'Sierra'
** testing if installed package keeps a record of temporary installation path
DONE (Sierra)

library(Sierra)
Error: package or namespace load failed for ‘Sierra’ in namespaceImportFrom(ns, loadNamespace(j <- i[[1L]], c(lib.loc, :
lazy-load database '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/magrittr/R/magrittr.rdb' is corrupt
In addition: Warning messages:
1: In namespaceImportFrom(ns, loadNamespace(j <- i[[1L]], c(lib.loc, :
restarting interrupted promise evaluation
2: In namespaceImportFrom(ns, loadNamespace(j <- i[[1L]], c(lib.loc, :
internal error -3 in R_decompress1

Error in peak calling

I'm having issues trying to perform peak calling.
Here are the error messages I got:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
1 gene entries to process
There are 0 unfiltered sites and 0 filtered sites
Error in `$<-.data.frame`(`*tmp*`, "polyA_ID", value = "::-:") : 
  replacement has 1 row, data has 0
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information
  was ignored.

Do you know what can be the problem ?
Many thanks for your feedback
Best, Isa

GTF problem

Hello,
Can I use a different version of GTF when I run Sierra compared to the sequence?
For example, use the GTF file of Ensembl GRCm38 release98 during sequence alignment. Use the GTF file in Ensembl GRCm38 release102 when running Sierra.

Clarification on UMI counting

Hello,

I have a custom dataset that has multiple read pairs that share the same UMI, would the counting algorithm take into account those multiple peaks as 1 count for each of those exons/introns?

Best,
Chang

Using data with batch effects in Sierra

Thank you for your great work and Sierra. I have a question. For multiple data sets with batch effect (for example two 10X data sets from different experiments or one 10X and one Smart-Seq2 data), is it possible to use Sierra? Do you think this kind of analysis creates a bias in results?

Thank you in advance.

Error in DetectUTRLengthShift.

Dear developing team,

When I run the mentioned function I get the following error.
Error in data.frame(PeakID = peaks.expressed, row.names = peaks.expressed.granges, : duplicate row.names: chr16:92085167-92085395:+
Btw im using mm10, latest ensembl (100) annotation, so there you have three genes:
"Mrps6:chr16:92085167-92085395:1" "Slc5a3:chr16:92085167-92085395:1" "Gm49711:chr16:92085167-92085395:1"

In fact I get multiple such errrors. These genomic regions correspond to areas that more than one gene (in the annotation) share. One can be a pseudogene, an intronic region or whatever. These errors come from peak counts tables that have the same peak annotated more than once (for a different gene each time). For example two peaks:
"Lypla1:chr1:4845940-4847188:1" "Gm37988:chr1:4845940-4847188:1"
This probably classifies as a bug, since this is a possibility in annotations, yet you do not account for this.
Im not sure if there is a straightforward deterministic way to solve this. It would make sense for the algorithm to pick at least an exonic peak in this case. Even if it picks a random exonic annotation.

best,
amisios

DUTest() with genes containg ':' or space

Runnnig DUTest() on a dataset with gene names containing colons (:) or spaces results in:

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names':

Because DEXSeq removes spaces and columns in the gene names:

Warning messages:
1: In DEXSeq::DEXSeqDataSet(peak.matrix, sampleData = sampleTable,  :
  empty spaces or ':' characters were found either in your groupIDs or in your featureIDs, these will be removed from the identifiers

Causing mismatches between gene names and DEXSeq output.

This can be solved with changing differential_usage.R line 671 to:

# removing colons and spaces to match output of DEXSeq                                     
pid_gene_names <- gsub('[: ]', '', dexseq.feature.table$Gene_name)
rownames(dexseq.feature.table) <- paste0(pid_gene_names, ":", dexseq.feature.table$Peak_number)

multithreading is not working for apply_DEXSeq_test_seurat module

Hi Ralph,
Hope everything is well,
It seems there multithreading is not working for functions that use the apply_DEXSeq_test_seurat. (only happens when trying multithreading)
here is an example of the error I'm getting. Would appreciate it if you have any comments as I need to speed up the process as I'm running hundreds of this test. Thanks.
'''

> apa.res <- DetectAEU(peaks.object = peaks_so, 
+                      gtf_gr = gtf_gr,
+                      gtf_TxDb = gtf_TxDb,
+                      do.MAPlot = T,
+                      population.1 = cell1,
+                      population.2 = cell2,
                        ncores = 20)
[1] "10082 expressed peaks in feature types UTR3"
[1] "2232 genes detected with multiple peak sites expressed"
[1] "6803 individual peak sites to test"
converting counts to integer mode
[1] "Running DEXSeq test..."
Error: $ operator is invalid for atomic vectors
In addition: Warning message:
In DESeqDataSet(rse, design, ignoreRank = TRUE) :
 
 Error: $ operator is invalid for atomic vectors

'''

umap coordinates for NewPeakSeurat

Hi there, thank you for creating this interesting package!

I am attempting to create my peak level Seurat object. I initially received an error when using PeakSeuratFromTransfer that there were no matching barcodes despite arranging them in the same order when aggregating peaks (the combined prepared Seurat object is has barcode suffixes _1-1, _2-1 etc, which was how I understood the Sierra barcode appending worked as well, but maybe I am incorrect?)

When I tried NewPeakSeurat I encountered another error. It seems to want tsne coordinates even though I am supplying umap coordinates in the function.

Here is what I did along with the error

umap.coords <- query[["proj.umap"]]@cell.embeddings

peaks.seurat <- NewPeakSeurat(peak.data = peak.counts,

                          annot.info = peak.annotations,

                          cell.idents = populations,

                          umap.coords = umap.coords,

                          min.cells = 0, min.peaks = 0)

[1] "Creating Seurat object with 35146 peaks and 60227 cells"
Warning: The following arguments are not used: row.names
[1] "Preparing feature table for DEXSeq"
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
[1] "No t-SNE coodinates included"
Error in umap.coords[colnames(peaks.seurat), ] : subscript out of bounds

Any help would be appreciated!

Thanks

Problems with DUTest

Hi guys,

Thanks for being so responsive and fixing all my issues. Unfortunately I have now found a new issue in the DUTest function when I run it using the vignette code for the SingleCellExperiment way. I get the following error message:

res.table = DUTest(peaks.sce, population.1 = "BalloonCells", population.2 = "Astrocytes",
+                    exp.thresh = 0.1, feature.type = c("exon"))
Error: $ operator not defined for this S4 class

This seems to be caused by line 1028 in differential_usage.R (get_expressed_peaks_sce function). I am pretty sure the line should be this.data <- peaks.sce.object@assays@data$counts.

Cheers,

Saskia

CoveragePlot error with 'zoom_3UTR=TRUE '

Hi,

Thanks for creating this great tool! I'm running into an error when using the zoom3UTR=TRUE argument in the CoveragePlot function, e.g:

PlotCoverage(genome_gr = gtf_gr,
geneSymbol = "CXCL8",
genome = "hg38",
pdf_output = FALSE,
bamfiles = c("Disease.CXCL8.bam", "Control.CXCL8.bam"),
zoom_3UTR=TRUE,
bamfile.tracknames=c("Disease", "Control"))

}

Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'to': error in evaluating the argument 'query' in selecting a method for function 'findOver
laps': In range 1: at least two out of 'start', 'end', and 'width', must
be supplied.

Do you know what could be causing this error? I am using R version 4.0.5 (2021-03-31), and Sierra_0.99.24.

Thanks

3'end database

Hi, your program is very useful and a great tool. I was wondering if there's a way to provide a 3'end database from the beginning instead of calling peaks, just count?

thanks

PlotRelativeExpressionBox looks for tSNE

The function PlotRelativeExpressionBox looks for tSNE cell embeddings, and throws an error if there is no tSNE run (I've only run a UMAP):

Error in PlotRelativeExpressionBox(peaks.seurat, peaks.to.plot = peaks.to.plot) : 
  trying to get slot "cell.embeddings" from an object of a basic class ("NULL") with no slots

As far as I understand, a boxplot shouldn't require tSNE coordinates.

Error in x$.self$finalize()

Hi Ralph,
Thanks for this awesome tool.
Can you please help me with this Error that happens in CountPeaks? >> "Error in x$.self$finalize() : attempt to apply non-function "

There are 3028 whitelist barcodes.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
There are 118455  sites
Doing counting for each site...
Processing chr: chrX
 and strand 1
Processing chr: chr20
 and strand 1
Processing chr: chr1
 and strand 1
Processing chr: chr6
 and strand 1
Processing chr: chr3
 and strand 1
Processing chr: chr7
 and strand 1
Processing chr: chr12
 and strand 1
Processing chr: chr11
 and strand 1
Processing chr: chr4
 and strand 1
Processing chr: chr17
 and strand 1
Processing chr: chr2
 and strand 1
Processing chr: chr16
 and strand 1
Processing chr: chr8
 and strand 1
Processing chr: chr19
 and strand 1
Processing chr: chr9
 and strand 1
Processing chr: chr13
 and strand 1
Processing chr: chr14
 and strand 1
Processing chr: chr5
 and strand 1
Processing chr: chr22
 and strand 1
Processing chr: chr10
 and strand 1
Processing chr: chrY
 and strand 1
Processing chr: chr18
 and strand 1
Processing chr: chr15
 and strand 1
Processing chr: chr21
and strand 1
Processing chr: chrM
 and strand 1
Processing chr: KI270713.1
 and strand 1
 and strand -1
Processing chr: KI270711.1
 and strand 1
 and strand -1
Processing chr: GL000205.2
 and strand 1
 and strand -1
Processing chr: KI270728.1
 and strand 1
 and strand -1
Processing chr: GL000219.1
 and strand 1
 and strand -1
Processing chr: KI270727.1
 and strand 1
 and strand -1
Processing chr: GL000194.1
 and strand 1
 and strand -1
**Error in x$.self$finalize() : attempt to apply non-function**
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.

Cheers,

PeakSeuratFromTransfer error

Hi,

I am trying out the package Sierra and I'd like to thank you for the really useful functions it has.
Nonetheless, I think I found some a problem that I would like to bring to your attention.

I had troubles running the function PeakSeuratFromTransfer, which returned the error
Error in if (gene.cov != "") { : missing value where TRUE/FALSE needed
I ran the code as:
peaks_seurat <- PeakSeuratFromTransfer(peak.data = peak.counts, genes.seurat = seurat.object, annot.info = peak.annotations, min.cells = 1, min.peaks = 1, filter.gene.mismatch = T)

I think the issue was with the function AnnotatePeaksFromGTF which introduced some NAs (I am using BSgenome.Rnorvegicus.UCSC.rn6 as the reference genome) in the names of the annotated peaks, since i was able to solve the issue by filtering out NAs when reading in the result from AnnotatePeaksFromGTF. This is what I did and I think a check on NAs would be helpful.
peak_annotations <- read.table("data/merged_annotated_peaks.txt", header = T, sep = "\t", row.names = 1, stringsAsFactors = FALSE) %>% filter(!is.na(gene_id))

Hope this is helpful, if this was just a mistake on my side please close and delete the issue.

Best,
Nicola

CountPeaks

Tried the vignette and it worked perfectly and so I'm excited to try Sierra on other datasets. I was trying the 10X E18 v3 NextGem sample (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_neuron_v3_nextgem) and got to the CountPeaks step, which stopped with the following error:

Processing chr: 1
and strand 1
[W::hts_idx_load2] The index file is older than the data file: 5k_neuron_v3_nextgem_possorted_genome_bam.bam.bai
Processing chr: 12
and strand 1
[W::hts_idx_load2] The index file is older than the data file: 5k_neuron_v3_nextgem_possorted_genome_bam.bam.bai
Processing chr: 19
and strand 1
[W::hts_idx_load2] The index file is older than the data file: 5k_neuron_v3_nextgem_possorted_genome_bam.bam.bai
Processing chr: MT
and strand 1
[W::hts_idx_load2] The index file is older than the data file: 5k_neuron_v3_nextgem_possorted_genome_bam.bam.bai
Error in { : task 1 failed - "1 elements in value to replace 0 elements"
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon. This
information was ignored.

The previous step had seemingly completed normally:

Gene Chr Strand MaxPosition Fit.max.pos Fit.start Fit.end mu sigma k exon.intron
1 Gnai3 3 -1 108107503 108107434 108107280 108107683 231.8151 83.64921 4430.0914 non-juncs
2 Gnai3 3 -1 108112568 108112569 108111881 108118443 301.0178 101.11622 2918.5156 non-juncs
3 Gnai3 3 -1 108107807 108107849 108107726 108107972 342.2765 41.31000 570.5898 non-juncs
4 Gnai3 3 -1 108144083 108144086 108143879 108144293 303.7170 69.29031 451.9037 non-juncs
5 Gnai3 3 -1 108126299 108126308 108126083 108126533 309.5252 75.23088 376.4667 non-juncs
6 Gnai3 3 -1 108127845 108127856 108127652 108128060 311.5164 68.22383 326.3526 non-juncs
exon.pos
1
2 (108111881,108112087)(108112473,108112601)(108115763,108115890)(108118301,108118443)
3
4
5
6
There are 65746 unfiltered sites and 64181 filtered sites
There are 64132 sites following duplicate removal
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon. This
information was ignored.

Any suggestions on what is happening?

installation erro

Hi everyone,
I'm trying to install Sierra and getting this Error.

"Error: Failed to install 'Sierra' from GitHub:
System command 'R' failed, exit status: 1, stdout + stderr (last 10 lines):
E> Quitting from lines 27-41 (Sierra_vignette.rmd)
E> Error: processing vignette 'Sierra_vignette.rmd' failed with diagnostics:
E> argument is of length zero
E> --- failed re-building ‘Sierra_vignette.rmd’
E>
E> SUMMARY: processing the following file failed:
E> ‘Sierra_vignette.rmd’
E>
E> Error: Vignette re-building failed.
E> Execution halted"

Seems you need to update the vignette file or do you guys have any idea?

Cheers
Aiden

Error in (function (classes, fdef, mtable)

Hi Sierra team,

I am trying to calculate the peak matrix from scRNA-seq data from human tumor. The vignettes goes well. But I get some problems when annotating peak. Reference genome and gtf file was downloaded from 10x Genomics. Junction file was calculated follow the vignettes, except "-s was setted to 0", since it is a requirement. Enclosed error information and codes.

Looking forward to your suggestion.

Regards,
Nelson

Commands to run regtools:

regtools junctions extract possorted_genome_bam.bam -o L28_junction.bed -s 0

Error:

> AnnotatePeaksFromGTF(peak.sites.file = peak.merge.output.file, 
+                      gtf.file = reference.file,
+                      output.file = "TIP_merged_peak_annotations.txt", 
+                      genome = genome)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
[1] "Annotating  65378  peak coordinates."

Annotating 3' UTRs
Annotating 5' UTRs
Annotating introns
Annotating exons
Annotating CDSError in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'getSeq' for signature '"character"'
In addition: Warning messages:
1: In for (i in seq_len(n)) { :
  closing unused connection 4 (/home/ot/R/x86_64-pc-linux-gnu-library/3.6/Sierra/extdatagenes.gtf)
2: In for (i in seq_len(n)) { :
  closing unused connection 3 (/home/ot/R/x86_64-pc-linux-gnu-library/3.6/Sierra/extdatagenes.gtf)
3: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
4: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': M
  - in 'y': MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
5: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': M
  - in 'y': MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
6: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': M
  - in 'y': MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
7: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': M
  - in 'y': MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
8: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': M
  - in 'y': MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
9: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': M
  - in 'y': MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
10: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': M
  - in 'y': MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
> library(Sierra)
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_HK.UTF-8       LC_NUMERIC=C               LC_TIME=en_HK.UTF-8       
 [4] LC_COLLATE=en_HK.UTF-8     LC_MONETARY=en_HK.UTF-8    LC_MESSAGES=en_HK.UTF-8   
 [7] LC_PAPER=en_HK.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_HK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0  IRanges_2.20.1       S4Vectors_0.24.1    
[5] BiocGenerics_0.32.0  Sierra_0.1.0        

loaded via a namespace (and not attached):
  [1] ProtGenerics_1.18.0         bitops_1.0-6                matrixStats_0.55.0         
  [4] flock_0.7                   bit64_0.9-7                 doParallel_1.0.15          
  [7] RColorBrewer_1.1-2          progress_1.2.2              httr_1.4.1                 
 [10] tools_3.6.1                 backports_1.1.5             R6_2.4.1                   
 [13] rpart_4.1-15                Hmisc_4.3-0                 DBI_1.0.0                  
 [16] lazyeval_0.2.2              Gviz_1.30.0                 colorspace_1.4-1           
 [19] nnet_7.3-12                 tidyselect_0.2.5            gridExtra_2.3              
 [22] prettyunits_1.0.2           bit_1.1-14                  curl_4.3                   
 [25] compiler_3.6.1              Biobase_2.46.0              htmlTable_1.13.2           
 [28] DelayedArray_0.12.0         rtracklayer_1.46.0          scales_1.1.0               
 [31] checkmate_1.9.4             askpass_1.1                 rappdirs_0.3.1             
 [34] stringr_1.4.0               digest_0.6.23               Rsamtools_2.2.1            
 [37] foreign_0.8-72              XVector_0.26.0              base64enc_0.1-3            
 [40] dichromat_2.0-0             pkgconfig_2.0.3             htmltools_0.4.0            
 [43] ensembldb_2.10.2            BSgenome_1.54.0             dbplyr_1.4.2               
 [46] htmlwidgets_1.5.1           rlang_0.4.2                 rstudioapi_0.10            
 [49] RSQLite_2.1.3               BiocParallel_1.20.0         acepack_1.4.1              
 [52] dplyr_0.8.3                 VariantAnnotation_1.32.0    RCurl_1.95-4.12            
 [55] magrittr_1.5                GenomeInfoDbData_1.2.2      Formula_1.2-3              
 [58] Matrix_1.2-18               Rcpp_1.0.3                  munsell_0.5.0              
 [61] lifecycle_0.1.0             stringi_1.4.3               SummarizedExperiment_1.16.0
 [64] zlibbioc_1.32.0             plyr_1.8.4                  BiocFileCache_1.10.2       
 [67] grid_3.6.1                  blob_1.2.0                  crayon_1.3.4               
 [70] lattice_0.20-38             Biostrings_2.54.0           splines_3.6.1              
 [73] GenomicFeatures_1.38.0      hms_0.5.2                   zeallot_0.1.0              
 [76] knitr_1.26                  pillar_1.4.2                reshape2_1.4.3             
 [79] codetools_0.2-16            biomaRt_2.42.0              XML_3.98-1.20              
 [82] glue_1.3.1                  biovizBase_1.34.0           latticeExtra_0.6-28        
 [85] BiocManager_1.30.10         data.table_1.12.6           foreach_1.4.7              
 [88] vctrs_0.2.0                 gtable_0.3.0                openssl_1.4.1              
 [91] purrr_0.3.3                 assertthat_0.2.1            ggplot2_3.2.1              
 [94] xfun_0.11                   AnnotationFilter_1.10.0     survival_2.44-1.1          
 [97] SingleCellExperiment_1.8.0  tibble_2.1.3                iterators_1.0.12           
[100] GenomicAlignments_1.22.1    AnnotationDbi_1.48.0        memoise_1.1.0              
[103] cluster_2.1.0

R script:

library(Sierra)

peak.output.file <- c("Vignette_example_TIP_sham_peaks.txt",
                      "Vignette_example_TIP_MI_peaks.txt")
FindPeaks(output.file = peak.output.file[2],   # output filename
          gtf.file = "genes.gtf",           # gene model as a GTF file
          bamfile = "./L07/possorted_genome_bam.bam",                # BAM alignment filename.
          junctions.file = "./L07/L07_junction.bed",     # BED filename of splice junctions exising in BAM file. 
          ncores = 4)                          # number of cores to use


FindPeaks(output.file = peak.output.file[1],   # output filename
          gtf.file = "genes.gtf",           # gene model as a GTF file
          bamfile = "./L28/possorted_genome_bam.bam",                # BAM alignment filename.
          junctions.file = "./L28/L28_junction.bed",     # BED filename of splice junctions exising in BAM file. 
          ncores = 4)   


### Merge data
### Read in the tables, extract the peak names and run merging ###

peak.dataset.table = data.frame(Peak_file = peak.output.file,
                                Identifier = c("TIP-example-Sham", "TIP-example-MI"), 
                                stringsAsFactors = FALSE)

peak.merge.output.file = "TIP_merged_peaks.txt"
MergePeakCoordinates(peak.dataset.table, output.file = peak.merge.output.file, ncores = 1)


### Count Peak
count.dirs <- c("example_TIP_sham_counts", "example_TIP_MI_counts")

#sham data set
CountPeaks(peak.sites.file = peak.merge.output.file, 
           gtf.file = "genes.gtf",
           bamfile = "./L28/possorted_genome_bam.bam", 
           whitelist.file = "./L28/barcodes.tsv.gz",
           output.dir = count.dirs[1], 
           countUMI = TRUE, 
           ncores = 4)

# MI data set
CountPeaks(peak.sites.file = peak.merge.output.file, 
           gtf.file = "genes.gtf",
           bamfile = "./L07/possorted_genome_bam.bam", 
           whitelist.file = "./L07/barcodes.tsv.gz",
           output.dir = count.dirs[2], 
           countUMI = TRUE, 
           ncores = 4)


### Integration
peak.merge.output.file <- "TIP_merged_peaks.txt"
count.dirs <- c("example_TIP_sham_counts", "example_TIP_MI_counts")


# New definition
out.dir <- "example_TIP_aggregate"

# Now aggregate the counts for both sham and MI treatments
AggregatePeakCounts(peak.sites.file = peak.merge.output.file,
                    count.dirs = count.dirs,
                    exp.labels = c("Sham", "MI"),
                    output.dir = out.dir)


### Annotation peak
# As previously defined
peak.merge.output.file <- "TIP_merged_peaks.txt"
reference.file <- "genes.gtf"

# New definitions
genome <- "/media/ot/Data/Nelson/refdata-cellranger-GRCh38-3.0.0/fasta"

AnnotatePeaksFromGTF(peak.sites.file = peak.merge.output.file, 
                     gtf.file = reference.file,
                     output.file = "TIP_merged_peak_annotations.txt", 
                     genome = genome)

Install failure

In trying to fix my other open issue, I wanted to reinstall Sierra. However on 4 different machines (3 windows and 1 ubuntu) I'm getting the same error (R version 3.6.1 and 3.6.2). On two of them, I have wiped my library directory clean, done a fresh install of R, and retried with the same error. Any suggestions?
Thanks again!

devtools::install_github("VCCRI/Sierra", build = TRUE, build_vignettes = TRUE, build_opts = c("--no-resave-data", "--no-manual"))
Downloading GitHub repo VCCRI/Sierra@master
✓ checking for file ‘/tmp/RtmprLptpC/remotes218c1f0f2761/VCCRI-Sierra-adca482/DESCRIPTION’ ...
─ preparing ‘Sierra’:
✓ checking DESCRIPTION meta-information ...
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:25: unknown macro '\item'
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:27: unknown macro '\item'
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:29: unknown macro '\item'
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:31: unknown macro '\item'
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:33: unknown macro '\item'
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:35: unexpected section header '\value'
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:38: unexpected section header '\description'
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:41: unexpected section header '\examples'
Warning: /tmp/Rtmper8hhG/Rbuild23fb72ed048e/Sierra/man/FindPeaks.Rd:45: unexpected END_OF_INPUT '
'
─ installing the package to build vignettes
✓ creating vignettes (30s)
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
─ building ‘Sierra_0.2.3.tar.gz’

Installing package into ‘/home/[username]/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)

installing source package ‘Sierra’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
Error : (converted from warning) /tmp/RtmphJPah2/R.INSTALL24699426d65/Sierra/man/FindPeaks.Rd:25: unknown macro '\item'
ERROR: installing Rd objects failed for package ‘Sierra’
removing ‘/home/[username]/R/x86_64-pc-linux-gnu-library/3.6/Sierra’
Error: Failed to install 'Sierra' from GitHub:
(converted from warning) installation of package ‘/tmp/RtmprLptpC/file218c9a1eee6/Sierra_0.2.3.tar.gz’ had non-zero exit status

Using STARSolo SJ.out.tab?

Hello,

I was wondering if it was possible to accept splice junctions generated from STAR/STARSolo?
I noticed it was slightly different from regtools in terms of junctions called. The data format is somewhat similar as well.
The format is outlined in the STAR manual.

Best,
Chang

AnnotatePeaks

Hi,

I am having problems annotating the peaks. I get the following error message:

AnnotatePeaksFromGTF(peak.sites.file = peak.merge.output.file, 
+                      gtf.file = reference.file,
+                      output.file = "data/peaks/merged_peak_annotations.txt", 
+                      genome = genome)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
[1] "Annotating  264857  peak coordinates."
Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': M
  - in 'y': MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).No samples matchedError: $ operator is invalid for atomic vectors

Here is my sessionInfo:

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /stornext/System/data/apps/R/R-3.6.1/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-3.6.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0  IRanges_2.20.2       S4Vectors_0.24.3     BiocGenerics_0.32.0  Sierra_0.2.3        

loaded via a namespace (and not attached):
  [1] ProtGenerics_1.16.0               bitops_1.0-6                      matrixStats_0.55.0                bit64_0.9-7                      
  [5] RColorBrewer_1.1-2                progress_1.2.2                    httr_1.4.1                        rstan_2.19.2                     
  [9] backports_1.1.5                   tools_3.6.1                       R6_2.4.1                          rpart_4.1-15                     
 [13] Hmisc_4.3-0                       DBI_1.1.0                         lazyeval_0.2.2                    Gviz_1.28.3                      
 [17] colorspace_1.4-1                  nnet_7.3-12                       tidyselect_1.0.0                  gridExtra_2.3                    
 [21] prettyunits_1.1.0                 processx_3.4.1                    curl_4.3                          bit_1.1-15.2                     
 [25] compiler_3.6.1                    cli_2.0.1                         Biobase_2.44.0                    htmlTable_1.13.3                 
 [29] DelayedArray_0.12.2               rtracklayer_1.46.0                checkmate_1.9.4                   scales_1.1.0                     
 [33] callr_3.4.0                       stringr_1.4.0                     digest_0.6.23                     Rsamtools_2.2.2                  
 [37] StanHeaders_2.21.0-1              foreign_0.8-75                    XVector_0.26.0                    dichromat_2.0-0                  
 [41] htmltools_0.4.0                   base64enc_0.1-3                   jpeg_0.1-8.1                      pkgconfig_2.0.3                  
 [45] ensembldb_2.8.1                   BSgenome_1.52.0                   htmlwidgets_1.5.1                 rlang_0.4.4                      
 [49] rstudioapi_0.10                   RSQLite_2.2.0                     BiocParallel_1.20.1               acepack_1.4.1                    
 [53] dplyr_0.8.3                       VariantAnnotation_1.30.1          inline_0.3.15                     RCurl_1.98-1.1                   
 [57] magrittr_1.5                      GenomeInfoDbData_1.2.2            Formula_1.2-3                     loo_2.2.0                        
 [61] Matrix_1.2-18                     Rcpp_1.0.3                        munsell_0.5.0                     fansi_0.4.1                      
 [65] lifecycle_0.1.0                   stringi_1.4.5                     SummarizedExperiment_1.16.1       zlibbioc_1.30.0                  
 [69] plyr_1.8.5                        pkgbuild_1.0.6                    grid_3.6.1                        blob_1.2.1                       
 [73] crayon_1.3.4                      lattice_0.20-38                   Biostrings_2.54.0                 splines_3.6.1                    
 [77] GenomicFeatures_1.36.4            hms_0.5.3                         BSgenome.Hsapiens.UCSC.hg38_1.4.1 knitr_1.24                       
 [81] ps_1.3.0                          pillar_1.4.3                      codetools_0.2-16                  biomaRt_2.40.5                   
 [85] XML_3.99-0.3                      glue_1.3.1                        biovizBase_1.32.0                 latticeExtra_0.6-29              
 [89] data.table_1.12.8                 BiocManager_1.30.10               foreach_1.4.7                     png_0.1-7                        
 [93] vctrs_0.2.2                       gtable_0.3.0                      purrr_0.3.3                       assertthat_0.2.1                 
 [97] ggplot2_3.2.1                     xfun_0.12                         AnnotationFilter_1.8.0            survival_3.1-8                   
[101] SingleCellExperiment_1.8.0        tibble_2.1.3                      iterators_1.0.12                  GenomicAlignments_1.22.1         
[105] AnnotationDbi_1.48.0              memoise_1.1.0                     cluster_2.1.0

Error in SplitBam (extra argument to Rsamtools::ScanBamParam)

Hi,

I'm following the vignette in the wiki and running the code in chunks. The following chunk raised an error:

outdir = "bam_subsets/"
dir.create(outdir)
SplitBam(bamfile[1], cells.df, outdir)

and the error reads:

Error in Rsamtools::ScanBamParam(tag = bamTags, what = what, tag = bamTags) :
formal argument "tag" matched by multiple actual arguments

Looking at the code in split_bams.R, all the function calls to Rsamtools::ScanBamParam contain the tag argument twice. Shouldn't the argument be passed just once?

Best,
Daniel

Error in AggregatePeakCounts()

Hello,
it's really an impressive tool to broaden the single cell analysis. And I want to use this tool in my study recently, but when I run the

peak.merge.output.file <- "AML_ALL_merged_peaks.txt"
out.dir <- "/mnt/data/user_data/xiangyu/workshop/scRNA/scAPA/sierra/AML_all_merge/AML_all_merge"
AggregatePeakCounts(peak.sites.file = peak.merge.output.file,
                    count.dirs = count.dirs,
                    exp.labels = c("WBM","HSPC","T0","T1_L","T1_R","T2_2R","T2_N","Tend_1L","Tend_2L"),
                    output.dir = out.dir)

there is an error :

Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) :
  invalid character indexing

I have checked the source code

function (peak.sites.file, count.dirs, output.dir, exp.labels = NULL)
{
    if (!is.null(exp.labels)) {
        .......    
       peak.table <- read.table(peak.sites.file, sep = "\t", header = TRUE,
        stringsAsFactors = FALSE)
    all.peaks <- peak.table$polyA_ID
    aggregate.counts <- c()
    for (i in 1:length(count.dirs)) {
        this.dir <- count.dirs[i]
        this.data <- ReadPeakCounts(this.dir)
        cell.names = colnames(this.data)
        barcodes = sub("(.*)-\\d", "\\1", cell.names)
        if (is.null(exp.labels)) {
            cell.names.update = paste0(barcodes, "-", i)
        }
        else {
            cell.names.update = paste0(barcodes, "-", exp.labels[i])
        }
        colnames(this.data) = cell.names.update
        this.data <- this.data[all.peaks, ]
        aggregate.counts <- cbind(aggregate.counts, this.data)
    }
    if (!dir.exists(output.dir)) {
...........

and I found that the rownames of this.data didn't be included in all.peaks files.
I got the all.peaks by your pipeline

peak.dataset.table = data.frame(Peak_file = peak.output.file,
  Identifier = c("WBM","HSPC","T0","T1_L","T1_R","T2_2R","T2_N","Tend_1L","Tend_2L"), 
  stringsAsFactors = FALSE)
peak.merge.output.file = "AML_ALL_merged_peaks.txt"
MergePeakCoordinates(peak.dataset.table, output.file = peak.merge.output.file, ncores = 30)

Could you help me work it out?

And I also tried another way as following codes:

names <- c("Too_BAOHONG","Too_HSPC","","T1_L","T1_R","T2_2R","T2_N","Tend_1L","Tend_2L")
new_obj <- list()
for (i in 1:length(count.dirs)) {
  this.dir <- count.dirs[i]
  this.data <- ReadPeakCounts(this.dir)
  cell.names = colnames(this.data)
  barcodes = sub("(.*)-\\d", "\\1", cell.names)
  cell.names.update = paste0(names[i],"_",barcodes)
  colnames(this.data) = cell.names.update
  new_obj[[i]] <- CreateSeuratObject(counts = this.data)
  message(names[i]," is done")
}

all_data <- merge(x=new_obj[[1]],y=new_obj[2:length(names)])
aggregate.counts <- GetAssayData(all_data,slot="counts")
peak.sites.file = peak.merge.output.file
peak.table <- read.table(peak.sites.file, sep = "\t", header = TRUE,stringsAsFactors = FALSE)
all.peaks <- peak.table$polyA_ID
both_id <- intersect(all.peaks,rownames(aggregate.counts))
output.dir = out.dir
aggregate.counts <- aggregate.counts[both_id,]
Matrix::writeMM(aggregate.counts, file = paste0(output.dir,"/matrix.mtx"))
writeLines(colnames(aggregate.counts), paste0(output.dir,"/barcodes.tsv"))
writeLines(rownames(aggregate.counts), paste0(output.dir,"/sitenames.tsv"))

I just used the Seurat formula merge()to merge all smples' peak files, and I reminded the peaks positions both in MergePeakCoordinates() result and in Seurat::merge results.
And I continued the next steps, did I use right way ?
Thanks!

does Sierra use normalised peak counts?

Hi! Very excited about this package!

Just have a question about whether the peak counts that Sierra gets from the PeakSeuratFromTransfer function is normalised by gene-level counts or not?

If not, is there any reason why the absolute peak counts are used instead of peak counts relative to each gene? (basically junction PSI value)

Additionally, if the peak counts are not already normalised by gene expression, how would I go about doing the normalisation myseslf? I'm not very familiar with manipulating S4 objects.

Thank you! :)
Angel

Problems reading genes with a quote (') in the name

Hi,

I'm using Sierra on single cell data of Drosophila, and the gtf contains gene symbols with a quote in the name (e.g. beta'COP).

This results into issues with the function FindPeaks. As a last step, the output table is read back in for filtering, but that results in warnings:

Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string
2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  number of items read is not a multiple of the number of columns

But more importantly, a truncated file; only the first ~4k lines of ~50k lines are read in.

I don't think you can expect quoted values in the input data table, so I think you can safely change (line 724; count_polyA.R)

 peak.sites <- read.table(peak.sites.file, header = T, sep = "\t",
                            stringsAsFactors = FALSE)

Into:

peak.sites <- read.table(peak.sites.file, header = T, sep = "\t", quote = '',
                         stringsAsFactors = FALSE)

I've sent a pull request

what do the genomic intervals in the results mean?

hi,

I have successfully run DTUtest() on my dataset and now I am trying to figure out what a called DTU peak means in terms of differential junction usage.

The results table (return.dexseq.res = FALSE) successfully gives the gene names and genomic intervals like "RPL12:9:127449689-127451405:-1".

However, I found that this does not seem to refer to any single splice junctions which were in the BAM file. Are these genomic intervals supposed to describe the regions of the genome where local splicing variations are taking place?

I am more interested in finding the splice junctions contributing to each DTU peak detected. However, the intermediate output files from Sierra don't seem to have the information that allows me to at least trace back each differential peak to the junctions. Is there any other way I can achieve this? Or is my interpretation of the output wrong...?

Many Thanks

Two Issues with AnnotatePeaksFromGTF

Dear developer team,

I have two issues when I run When I run AnnotatePeaksFromGTF.

Issue 1:
When I run the function with genome=NULL I get an error:
Error in [.data.frame(peak.table, peaks.keep.idx, "exon.intron") : object 'peaks.keep.idx' not found
I looked in the Annotate.R script (master). In line 138 you have annot.df$Junctions <- peak.table[peaks.keep.idx, "exon.intron"].
This is where the error is coming from. peaks.keep.idx is in defined in if (!is.null(genome) & isS4(genome))
Is this function created with the intention to be run exclusively with a genome?

Issue 2:
I only encountered the previous issue because there is an issue when I use a genome. The error is the following:
Error in strsplit(coord, split = ":") : non-character argument
This is coming from the Annotate.R script again, from line 561.
This is strange, because the example of the vignette works.
And coord is a "row" from a GRanges object. So in my console strsplit does not work. (could it be I am using a different version of GenomicRanges than you?). Of course this seems to be solved when coord=as.character(gr[i]) (which you have somewhere in comments but not in the actual code).

I would appreciate any fast feedback of how to proceed.
Thanks,
amisios

vccri / sierra Goto Github PK

sierra's Issues

Any suggestions? Thanks in advance!

Commands to run regtools:

Error:

R script:

Recommend Projects

Recommend Topics

Recommend Org

Jobs