bimsbbioinfo / rcas Goto Github PK

R package for the RNA Centric Annotation System (RCAS)

R 94.35% HTML 5.65%

bioconductor interactive-plots par-clip cage rna modification reporting msigdb go protein-rna-interactions

rcas's Introduction

RCAS project

Introduction

RCAS is an R/Bioconductor package designed as a generic reporting tool for the functional analysis of transcriptome-wide regions of interest detected by high-throughput experiments. Such transcriptomic regions could be, for instance, signal peaks detected by CLIP-Seq analysis for protein-RNA interaction sites, RNA modification sites (alias the epitranscriptome), CAGE-tag locations, or any other collection of query regions at the level of the transcriptome. RCAS produces in-depth annotation summaries and coverage profiles based on the distribution of the query regions with respect to transcript features (exons, introns, 5’/3’ UTR regions, exon-intron boundaries, promoter regions). Moreover, RCAS can carry out functional enrichment analyses and discriminative motif discovery. RCAS supports all genome versions that are available in BSgenome::available.genomes

installation:

Installing from Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")

BiocManager::install('RCAS')

Installing the development version from Github

library('devtools')
devtools::install_github('BIMSBbioinfo/RCAS')

Installing via Bioconda channel

conda install bioconductor-rcas -c bioconda

Installing via Guix

guix package -i r r-rcas

usage:

Package vignettes and reference manual

For detailed instructions on how to use RCAS, please see:

package vignette for single sample analysis
package vignette for multi-sample analysis
reference manual for more information about the detailed functions available in RCAS.

Use cases from published RNA-based omics datasets

Multi-sample analysis use case

See an example report comparing the peak regions discovered via CLIP-sequencing experiments of the RNA-binding protein FUS by Nakaya et al, 2013, Synaptic Functional Regulator FMR1 by Ascano et al. 2012, and Eukaryotic initiation factor 4A-III by Sauliere et al, 2012.

Single Sample Analysis Use Cases

RCAS report for PUM2 RNA-binding sites detected by PAR-CLIP technique (Hafner et al, 2010)
- input: PARCLIP_PUM2_Hafner2010b_hg19
RCAS report for QKI RNA-binding sites detected by PAR-CLIP technique (Hafner et al, 2010)
- input: PARCLIP_QKI_Hafner2010c_hg19
RCAS report for IGF2BP1 2 3 RNA-binding sites detected by PAR-CLIP technique (Hafner et al, 2010)
- input: PARCLIP_IGF2BP123_Hafner2010d_hg19
RCAS report for tiny RNA (tiRNA) loci detected by deepCAGE analysis (Taft et al. 2009)
- input: human_FANTOM4_tiRNAs.bed
RCAS report for m¹A methylation sites (Dominissini et al, 2016)
- input: GSE70485_human_peaks.txt.gz

Citation

In order to cite RCAS, please use:

Bora Uyar, Dilmurat Yusuf, Ricardo Wurmus, Nikolaus Rajewsky, Uwe Ohler, Altuna Akalin; RCAS: an RNA centric annotation system for transcriptome-wide regions of interest. Nucleic Acids Res 2017 gkx120. doi: 10.1093/nar/gkx120

See our publication here.

Acknowledgements

RCAS is developed in the group of Altuna Akalin (head of the Scientific Bioinformatics Platform) by Bora Uyar (Bioinformatics Scientist), Dilmurat Yusuf (Bioinformatics Scientist) and Ricardo Wurmus (System Administrator) at the Berlin Institute of Medical Systems Biology (BIMSB) at the Max-Delbrueck-Center for Molecular Medicine (MDC) in Berlin.

RCAS is developed as a bioinformatics service as part of the RNA Bioinformatics Center, which is one of the eight centers of the German Network for Bioinformatics Infrastructure (de.NBI).

rcas's People

Contributors

Stargazers

Watchers

Forkers

yzharold rogerzou0108 standardgalactic

rcas's Issues

calculateCoverageProfileList argument clarification

calculateCoverageProfileList has an argument with class list or GRangesList, this is has to be clearer. Now, it seems like a list, but GRangesList should be used if the elements of the the list are GRanges objects

https://github.com/BIMSBbioinfo/RCAS/blob/rcas_R/rpackage/RCAS/R/report_functions.R#L351

Error with `runReport`

Hi!

I'm trying to get annotation summary for m6A peak calling result. My bed file is named "Mod.bed".
I tried to run this but got an error, which I don't know why

library(RCAS)
hg19_mod_path <- "/path/to/Mod.bed"
gff_path <- "/path/to/GRCh37_RefSeq_24.gff"
runReport( queryFilePath = hg19_mod_path,
           gffFilePath = gff_path,
           motifAnalysis = FALSE,
           goAnalysis = FALSE )

However, there is no problem to get enriched motif by running the following code

queryRegions <- importBed(filePath = hg19_mod_path, sampleN = 10000)
gff <- importGtf(filePath = "/path/to/GRCh37_RefSeq_24.gff")
motifResults <- runMotifDiscovery(queryRegions = queryRegions, 
                                  resizeN = 15, sampleN = 10000,
                                  genomeVersion = 'hg19', motifWidth = 5,
                                  motifN = 3, nCores = 5)
ggseqlogo::ggseqlogo(motifResults$matches_query)
summary <- getMotifSummaryTable(motifResults)
knitr::kable(summary)

Did I do anything wrong with the code usage? Is there any way to avoid error in runReport and got the annotation summary?

suggestions

I really love this tool and the output it gives, even for users like me.

From the RBP researcher point of view I have small suggestions:

A normalization of overlaps like it is done in homer ( I think it is normalized to the total length of the feature)
motif and GO term detection based on features, so for example you can search for motifs only in 3'UTRs and introns and annotate only those GO terms related to this feature
maybe as a little nice add-on to motif discovery you could implement secondary structure prediction tools

However, this is a really nice tool and I hope my suggestions are useful!

Best,

Deniz

Increased variety of genomic annotations and genome versions

We plan to keep adding more and more useful plots and tables that can enrich the biological context of the given input datasets. For instance, plots that show the distribution of mutations and polymorphisms in relation to the input transcript segments could provide useful insights in prioritizing potential targets for follow-up studies. Moreover, we would like to provide support for an expanded collection of species (e.g. other eukaryotic model organisms such as Zebrafish) and additional genome builds for the currently supported species (e.g. hg38 for human, mm10 for mouse).

meta-analysis capability

As of RCAS version 1.1.1, the package is designed to prepare reports for one input dataset at a time. If a user wanted to do comparative analysis of multiple experimental datasets (for instance, to detect differences between case-control conditions), they would need to run the runReport function multiple times (once for each condition and setting the ‘printProcessedTables’ argument to TRUE). Using ‘printProcessedTables’ argument would enable the user to get the processed data for each experiment that can later be used for down-stream case-control comparisons. However, this would require additional scripting, which may not be ideal for non-programmer users. Therefore, in our next major release with Bioconductor 3.5, we plan to integrate functions that will enable meta-analysis of two or more input datasets at the same time to save the users from the need to do additional programming.

line length shouldn't exceed 80 characters

this is important for some reviewers at BioC, they will comment on it if the lines exceed 80 characters

Error in previously working function getFeatureBoundaryCoverage

Dear RCAS team,

First, thanks for creating the RCAS package, I use it frequently to interrogate my CLIP datasets, and it's extremely useful.

Recently, when I try to run the function getFeatureBoundaryCoverage, I get the following error:
invalid class "ScoreMatrix" object: superclass "mMatrix" not defined in the environment of the object's class

In a previously working code chunk. I could not find out what was happening, and it did not seem to depend on my datasets.
Any idea of what could be happening?

Many thanks.

Best regards,
Raul

More fine-tuned control on the generated HTML reports

Another planned revision on the package is to provide users with more fine-tuned control over the reports the runReport function generates. Currently, users can turn on/off certain analysis modules, but they do not have control over which plots/tables are generated in each module. We believe that providing such a flexibility would be a useful feature for non-programmer users, especially considering that the package will keep expanding with more variety of plots and tables in each module.

potential bug in runMotifRG

runMotifRG doesn't report any motifs when nCores is set to '1', while it can reproducibly find the same motifs when nCores is set to 2 or more.

non-unique values when setting 'row.names': ‘Ighv1-13’, ‘Ighv5-8’

Hi,
when i use RCAS to process multi-samples :

WT_rep1_path <-"D:/WT_rep1.bed"
WT_rep2_path <-"D:/WT_rep2.bed"
WT_rep3_path <-"D:/WT_rep3.bed"
KO_rep1_path <-"D:/KO_rep1.bed"
KO_rep2_path <-"D:/KO_rep2.bed"
KO_rep3_path <-"D:/KO_rep3.bed"
projData <- data.frame('sampleName' = c('WT_1', 'WT_2','WT_3' 'KO_1', 'KO_2','KO_3'),
'bedFilePath' = c(WT_rep1_path, WT_rep2_path, WT_rep3_path
KO_rep1_path,
KO_rep2_path,KO_rep3_path),
stringsAsFactors = FALSE)
projDataFile <- "D:/myProjDataFile.tsv"
write.table(projData, projDataFile, sep = '\t', quote =FALSE, row.names = FALSE)
gtfFilePath <- "D:/Mus_musculus.GRCm38.102.gtf"
databasePath <-"D:/myProject.sqlite"
createDB(dbPath = databasePath, projDataFile = projDataFile, gtfFilePath = gtfFilePath, genomeVersion = 'mm10',update = TRUE,motifAnalysis = FALSE)

it show error:
Importing GTF annotations
importing gtf file from D:/Mus_musculus.GRCm38.102.gtf
Keeping standard chromosomes only
File D:/Mus_musculus.GRCm38.102.gtf.granges.rds already exists.
Use overwriteObjectAsRds = TRUE to overwrite the file
Parsing transcript features
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02m 50s
Saving interval datasets in 'bedData' table
Calculating annotation summaries
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02m 28s
Saving annotation summaries in 'annotationSummaries' table
Running function: getIntervalOverlapMatrix for tablegeneOverlaps
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=14s
Error in .rowNamesDF<-(x, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘Ighv1-13’, ‘Ighv5-8’

i think my row.name is unique? so it is why?

RCAS r package should have its own repository

I think this is a bit crowded. Ideally, RCAS R package should have its own repository but this up for debate.

Issue with the `runReport` function.

Thank you for providing this tool, however I have come across a small issue.
The following part of code from the runReport function causes error due to changed structure of the BSGenome package.

  db <- checkSeqDb(genomeVersion)
  # get species name 
  # this is needed for gprofiler functional enrichment 
  fields <- unlist(strsplit(db@organism, ' '))

which gives the error

Loading required package: rtracklayer
Error in strsplit(db@organism, " ") : 
  no slot of name "organism" for this object of class "BSgenome"

This can be fixed by updating the line to

  db <- checkSeqDb(genomeVersion)
  # get species name 
  # this is needed for gprofiler functional enrichment 
  fields <- unlist(strsplit(db@metadata$organism, ' '))

I am not familiar with pull request policy, therefore I am raising it as an issue here.

check if input BED file is correctly formatted

If the BED file doesn't follow the specifications or contains incomplete columns (e.g. missing the "name" field), some functions may fail (e.g. queryGff).