GithubHelp home page GithubHelp logo

suhrig / arriba Goto Github PK

View Code? Open in Web Editor NEW
210.0 14.0 50.0 24.7 MB

Fast and accurate gene fusion detection from RNA-Seq data

License: Other

Makefile 0.66% R 12.10% Shell 4.63% C++ 82.30% Dockerfile 0.30%
gene-fusions rna-seq cancer gene-fusion fusion-gene fusion-genes variant-calling structural-variation star virus-integration

arriba's Introduction

About

Arriba is a command-line tool for the detection of gene fusions from RNA-Seq data. It was developed for the use in a clinical research setting. Therefore, short runtimes and high sensitivity were important design criteria. It is based on the ultrafast STAR aligner, and the post-alignment runtime is typically just ~2 minutes. Arriba's workflow produces fully reusable alignments, which can serve as input to other common analyses, such as quantification of gene expression. In contrast to many other fusion detection tools which build on STAR, Arriba does not require to reduce the STAR parameter --alignIntronMax to detect fusions arising from focal deletions. Reducing this parameter impairs mapping of reads to genes with long introns and may affect expression quantification, hence.

Apart from gene fusions, Arriba can detect other structural rearrangements with potential clinical relevance, including viral integration sites, internal tandem duplications, whole exon duplications, intragenic inversions, enhancer hijacking events involving immunoglobulin/T-cell receptor loci, translocations affecting genes with many paralogs such as DUX4, and truncations of genes (i.e., breakpoints in introns or intergenic regions).

Arriba is the winner of the DREAM SMC-RNA Challenge, an international competition organized by ICGC, TCGA, IBM, and Sage Bionetworks to determine the current gold standard for the detection of gene fusions from RNA-Seq data. The final results of the challenge are posted on the Round 5 Leaderboard and discussed in the accompanying publication.

Get help

Use the GitHub issue tracker to get help or to report bugs.

Citation

Sebastian Uhrig, Julia Ellermann, Tatjana Walther, Pauline Burkhardt, Martina Fröhlich, Barbara Hutter, Umut H. Toprak, Olaf Neumann, Albrecht Stenzinger, Claudia Scholl, Stefan Fröhling and Benedikt Brors: Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Research. March 2021 31: 448-460; Published in Advance January 13, 2021. doi: 10.1101/gr.257246.119

License

The code, software and database files of Arriba are distributed under the MIT/Expat License, with the exception of the script draw_fusions.R, which is distributed under the GNU GPL v3 due to dependencies on GPL-licensed R packages. The terms and conditions of both licenses can be found in the LICENSE file.

User manual

Please refer to the user manual for installation instructions and information about usage. Note: You should not use git clone to download Arriba, because the git repository does not include the blacklist and other database files!

  1. Quickstart

  2. Workflow

  3. Input files

  4. Output files

  5. Visualization

  6. Command line options

  7. Interpretation of results

  8. Utility scripts

  9. Current limitations

  10. Internal algorithm

arriba's People

Contributors

iainrb avatar kant avatar micknudsen avatar suhrig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arriba's Issues

-U not being passed to read subsampling stage

Hi no matter what I set -U to e.g. 1000 I will still receive the runtime message:

Finding fusions and counting supporting reads (total=WARNING: Some fusions were subsampled, because they have more than 300 supporting reads 12129)

It would appear that whilst the parameter is not being passed to the relevant subsampling routine.

Chimeric.out.sam vs Aligned.out.bam

Dear Arriba,
Inspection of events using IGV needs two files Chimeric.out.sam and Aligned.out.bam using run_arriba.sh generates Aligned.out.bam only. Is it possible to get Chimeric.out.sam from Aligned.out.bam? or is there an option to get Chimeric.out.sam as well as Aligned.out.bam using the run_arriba.sh?

ERROR: could not find sequence of contig 'Y'

Hi,
I tried running Arriba using the following command -

/Users/chahat/Documents/DRG/STAR-master/bin/MacOSX_x86_64/STAR --runThreadN 3 --genomeDir /Users/chahat/Documents/DRG/STAR-master/genome --genomeLoad NoSharedMemory --readFilesIn /Volumes/bam/DRG/fastq_50/PhenoInfoAvailable/23T2L.fastq --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 0 --outFilterMultimapNmax 1 --outFilterMismatchNmax 3 --chimSegmentMin 10 --chimOutType WithinBAM SoftClip --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 | /Users/chahat/Downloads/arriba_v1.2.0/arriba -x /dev/stdin -o fusions.tsv -O fusions.discarded.tsv -a /Volumes/bam/DRG/annotations/hg38.fa -g /Volumes/bam/DRG/annotations/gencode.v32.annotation.gtf.gz -b /Users/chahat/Downloads/arriba_v1.2.0/database/blacklist_hg38_GRCh38_2018-11-04.tsv.gz -T -P

Where, I downloaded the latest hg38 fasta file from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz and the latest Gencode GTF annotation from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz

On running, I got -

[2020-01-15T19:26:31] Loading annotation from '/Volumes/bam/DRG/annotations/gencode.v32.annotation.gtf.gz'
[2020-01-15T19:27:11] Loading assembly from '/Volumes/bam/DRG/annotations/hg38.fa'
ERROR: could not find sequence of contig 'Y'

Any ideas why this error could be occuring?

All fusions going to "fusions.discarded.tsv"

Hi,

First of all, thanks for this amazing tool, it has been very useful to predict some of the protein fusions of my samples.

I have used arriba before with a different batch of samples without any issue. However, the last sequencing samples I tried to run with Arriba gave me some warning/errors and the "fusions.tsv" file is empty.

The output:

Loading annotation from '/mnt/lustre/users/k/RNAseq_RD/ReferenceGenome/Homo_sapiens.GRCh38.77.gtf.gz'
Loading assembly from '/mnt/lustre/users/k/RNAseq_RD/ReferenceGenome/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz'
Reading chimeric alignments from '/dev/stdin' (total=86)
Filtering multi-mappers and single mates (remaining=86)
Detecting strandedness (no)
Annotating alignments
Filtering duplicates (remaining=75)
Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) (remaining=74)
Estimating mate gap distributionWARNING: not enough chimeric reads to estimate mate gap distribution, using default values
Filtering read-through fragments with a distance <=10000bp (remaining=47)
Filtering inconsistently clipped mates (remaining=41)
Filtering breakpoints adjacent to homopolymers >=6nt (remaining=41)
Filtering fragments with small insert size (remaining=33)
Filtering alignments with long gaps (remaining=33)
Filtering fragments with both mates in the same gene (remaining=33)
Filtering fusions arising from hairpin structures (remaining=33)
Filtering reads with a mismatch p-value <=0.01 (remaining=31)
Filtering reads with low entropy (k-mer content >=60%) (remaining=29)
Finding fusions and counting supporting reads (total=27)
Merging adjacent fusion breakpoints (remaining=27)
Estimating expected number of fusions by random chance (e-value)
Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=27)
Filtering intragenic fusions with both breakpoints in exonic regions (remaining=27)
Filtering fusions with <2 supporting reads (remaining=5)
Filtering fusions with an e-value >=0.3 (remaining=5)
Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=5)
Filtering PCR fusions between genes with an expression above the 99.8% quantile (remaining=5)
Searching for fusions with spliced split reads (remaining=5)
Selecting best breakpoints from genes with multiple breakpoints (remaining=3)
Searching for fusions with >=4 spliced events (remaining=3)
Filtering blacklisted fusions in '/users/k1470099/arriba_v1.1.0/database/blacklist_hg38_GRCh38_2018-11-04.tsv.gz' (remaining=1)
Filtering fusions with anchors <=23nt (remaining=1)
Filtering end-to-end fusions with low support (remaining=1)
Filtering fusions with no coverage around the breakpoints (remaining=1)
Indexing gene sequences
Filtering genes with >=30% identity (remaining=0)
Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=0)
Selecting best breakpoints from genes with multiple breakpoints (remaining=0)
Searching for additional isoforms (remaining=0)
Assigning confidence scores to events
Writing fusions to file '/mnt/lustre/users/k/RNAseq_RD/output/fusions.tsv'
Writing discarded fusions to file '/mnt/lustre/users/k/RNAseq_RD/output/fusions.discarded.tsv'

Are there any parameters I should change? So far I have been using the default settings but as I said they work with previous samples.

The script I am running:
`#$ -S /bin/bash
#$ -o /mnt/lustre/users/k
#$ -e /mnt/lustre/users/k
#$ -l h_vmem=40G

module load bioinformatics/STAR/2.7.0f

STAR --runThreadN 8
--runMode alignReads
--genomeDir /mnt/lustre/users/k/RNAseq_RD/ReferenceGenome
--readFilesIn /mnt/lustre/users/k/RNAseq_RD/R1_001_1.fastq.gz /mnt/lustre/users/k1470099/RNAseq_RD/R2_001_2.fastq.gz --readFilesCommand zcat
--outStd BAM_Unsorted --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 0
--outFilterMultimapNmax 1 --outFilterMismatchNmax 3
--chimSegmentMin 10 --chimOutType WithinBAM SoftClip --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 --winBinNbits 15 |
/users/k/arriba_v1.1.0/./arriba -x /dev/stdin
-g /mnt/lustre/users/k/RNAseq_RD/ReferenceGenome/Homo_sapiens.GRCh38.77.gtf.gz -a /mnt/lustre/users/k/RNAseq_RD/ReferenceGenome/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -b /users/k/arriba_v1.1.0/database/blacklist_hg38_GRCh38_2018-11-04.tsv.gz
-o /mnt/lustre/users/k/RNAseq_RD/output/tes_fusions.tsv -O /mnt/lustre/users/k/RNAseq_RD/output/tes_fusions.discarded.tsv `

Thanks in advance!

Arriba error while running STAR outputs

Hi! I am trying to use arriba, but wiith some errors:

Loading assembly from 'genome.fna'
ERROR: could not find sequence of contig '10'

That was my Arriba command-line:
arriba -c S9Chimeric.out.junction -x S9Aligned.sortedByCoord.out.bam -g genannotation.gtf -a genome.fna -f blacklist -o fusions.tsv -O fusions.discarded.tsv-d S9_sorted.vcf -s auto -V 0.1 -T -P -I
I was wondering if the problem would be with the chromosomes names into my files, then I tried to to stating them with -i options of contig but with no progress. Do you have any clue what is going on?

Thanks in Advance!

Deterministic behavior

Hi,

Thanks a lot for providing this fantastic, well documented tool.
Would it be possible to add a seed option to enable deterministic behavior of the subsampling routine (-U)?

Also, I have repeatedly run a few samples (v.0.12) with identical settings and most differences can possibly be explained by subsampling differences. There is however one difference in confidence of a fusion with only 2 supported reads I cannot explain:

Run 1:
WDTC1 TBCEL +/+ +/+ 1:27561443 11:120957487 splice-site splice-site translocation downstream upstream 1 1 0 low . . mismatches(1) GCGCCCCCCcTCCCGGGAGAGGGGCCGCCCCCCCCGGACGGACATGGGCTCCTGAAGTTGCGCCGCTGCCGGTCGGGGGAAGAGACCTGACAG|GTATCATGAACTGATCACTAAATATGGGAAGTTGGAGCCTTTGGCAGAAGTGGACCTAAGACCCCAGAGCAGTGCAAAAGTAGAAGTCCACTTTAACGATCAGGTGGAAGAAATGAGCATTCGTCTGGACCAAACAGTGGCA .
Run 2:
WDTC1 TBCEL +/+ +/+ 1:27561443 11:120957487 splice-site splice-site translocation downstream upstream 1 1 0 medium . . mismatches(1) GCGCCCCCCcTCCCGGGAGAGGGGCCGCCCCCCCCGGACGGACATGGGCTCCTGAAGTTGCGCCGCTGCCGGTCGGGGGAAGAGACCTGACAG|GTATCATGAACTGATCACTAAATATGGGAAGTTGGAGCCTTTGGCAGAAGTGGACCTAAGACCCCAGAGCAGTGCAAAAGTAGAAGTCCACTTTAACGATCAGGTGGAAGAAATGAGCATTCGTCTGGACCAAACAGTGGCA .

Why is the same breakpoint once assigned with low- and once with medium confidence?
Is there a way to achieve completely reproducible results?

Many thanks!

Using Draw.fusion.r with Input from others callers ?

Hello,
First, thanks for this tool, it's really quick and useful.

I'm working on a diagnostic call on fusions. I'm been asked to be the more sensitive even if I get false positive. In this aim, I use different caller ( StarFusion & FusionCatcher) . I was wondering if it was possible to use the tool Draw.fusion.r with other inputs ?
I know it won't be the same quality since there is a lot of information that you add in your fusion call that others don't.
Or maybe the other way around is to add a list of fusions already called little bit like the "-k" argument.

Best,

Conda recipe

Would it be possible to wrap the tool into conda package and upload it to bioconda? I am working on a rnafusion pipeline which consists of multiple tools for fusion detection. It would be super nice to implement your tool to the stack 🎉

memory corruption

Writing fusions to file 'fusions.tsv'
*** Error in `arriba': free(): invalid size: 0x00002aab3eab6010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7c503)[0x2aaaab784503]
arriba[0x409589]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaab729b35]
arriba[0x405779]
======= Memory map: ========
00400000-00515000 r-xp 00000000 00:2e 3054027100                         /research/rgs01/home/clusterHome/edavis5/.local/bin/arriba
00715000-00717000 rw-p 00115000 00:2e 3054027100                         /research/rgs01/home/clusterHome/edavis5/.local/bin/arriba
00717000-75b4b5000 rw-p 00000000 00:00 0                                 [heap]
2aaaaaaab000-2aaaaaacb000 r-xp 00000000 08:02 204877411                  /usr/lib64/ld-2.17.so
2aaaaaacb000-2aaaaaacd000 r-xp 00000000 00:00 0                          [vdso]
2aaaaaacd000-2aaaaaacf000 rw-p 00000000 00:00 0
2aaaaaae5000-2aaaaaaeb000 rw-p 00000000 00:00 0
2aaaaacca000-2aaaaaccb000 r--p 0001f000 08:02 204877411                  /usr/lib64/ld-2.17.so
2aaaaaccb000-2aaaaaccc000 rw-p 00020000 08:02 204877411                  /usr/lib64/ld-2.17.so
2aaaaaccc000-2aaaaaccd000 rw-p 00000000 00:00 0
2aaaaaccd000-2aaaaadb6000 r-xp 00000000 08:02 206003601                  /usr/lib64/libstdc++.so.6.0.19
2aaaaadb6000-2aaaaafb5000 ---p 000e9000 08:02 206003601                  /usr/lib64/libstdc++.so.6.0.19
2aaaaafb5000-2aaaaafbd000 r--p 000e8000 08:02 206003601                  /usr/lib64/libstdc++.so.6.0.19
2aaaaafbd000-2aaaaafbf000 rw-p 000f0000 08:02 206003601                  /usr/lib64/libstdc++.so.6.0.19
2aaaaafbf000-2aaaaafd4000 rw-p 00000000 00:00 0
2aaaaafd4000-2aaaab0d4000 r-xp 00000000 08:02 205237963                  /usr/lib64/libm-2.17.so
2aaaab0d4000-2aaaab2d4000 ---p 00100000 08:02 205237963                  /usr/lib64/libm-2.17.so
2aaaab2d4000-2aaaab2d5000 r--p 00100000 08:02 205237963                  /usr/lib64/libm-2.17.so
2aaaab2d5000-2aaaab2d6000 rw-p 00101000 08:02 205237963                  /usr/lib64/libm-2.17.so
2aaaab2d6000-2aaaab2eb000 r-xp 00000000 08:02 206003591                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaab2eb000-2aaaab4ea000 ---p 00015000 08:02 206003591                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaab4ea000-2aaaab4eb000 r--p 00014000 08:02 206003591                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaab4eb000-2aaaab4ec000 rw-p 00015000 08:02 206003591                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaab4ec000-2aaaab503000 r-xp 00000000 08:02 205304005                  /usr/lib64/libpthread-2.17.so
2aaaab503000-2aaaab702000 ---p 00017000 08:02 205304005                  /usr/lib64/libpthread-2.17.so
2aaaab702000-2aaaab703000 r--p 00016000 08:02 205304005                  /usr/lib64/libpthread-2.17.so
2aaaab703000-2aaaab704000 rw-p 00017000 08:02 205304005                  /usr/lib64/libpthread-2.17.so
2aaaab704000-2aaaab708000 rw-p 00000000 00:00 0
2aaaab708000-2aaaab8be000 r-xp 00000000 08:02 204940282                  /usr/lib64/libc-2.17.so
2aaaab8be000-2aaaababe000 ---p 001b6000 08:02 204940282                  /usr/lib64/libc-2.17.so
2aaaababe000-2aaaabac2000 r--p 001b6000 08:02 204940282                  /usr/lib64/libc-2.17.so
2aaaabac2000-2aaaabac4000 rw-p 001ba000 08:02 204940282                  /usr/lib64/libc-2.17.so
2aaaabac4000-2aaaae0a1000 rw-p 00000000 00:00 0
2aaab00a1000-2aab3c0a1000 rw-p 00000000 00:00 0
2aab3eab6000-2aab41e84000 rw-p 00000000 00:00 0
2aab440a1000-2aaba40a1000 rw-p 00000000 00:00 0
2aabc44a8000-2aabe79be000 rw-p 00000000 00:00 0
2aacac0a1000-2aaccc0a1000 rw-p 00000000 00:00 0
7ffffffdc000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
/lsf_jobspool/1559613367.79853720: line 8: 17716 Aborted                 (core dumped) arriba -c Chimeric.out.sam -x Aligned.sortedByCoord.out.bam -g /home/edavis5/arriba_v1.1.0/RefSeq_hg19.gtf -o fusions.tsv -b /home/edavis5/arriba_v1.1.0/database/blacklist_hg19_hs37d5_GRCh37_2018-11-04.tsv.gz -a /home/edavis5/arriba_v1.1.0/hg19.fa

file batch analysis

Dear arriba developers,
Is there an option to analyze multiple samples at a time?

Thank you,

Event Prediction Specificity if Many Overlapping Pairs of Reads

I have a data set which is pairs of reads, each read being 150 bases long. The median insert size is 140 bases, so more than half of the read pairs for a sample overlap each other completely (I have done essential adapter trimming with cutadapt to remove TruSeq adapter from reads ends). It seems like arriba would have problems with this data set:

... and three alignments for split reads (alignments of the first and second read and a supplementary alignment of the clipped segment) ... Fragments with too few or too many alignments are removed.

If two reads that completely overlap each other have a split in them, the split would appear in both reads and so there would be four alignments and it sounds like they would be discarded by arriba. Is it so?

Also, I notice many chimeric read pairs at a low level all across many different genes. I read on SEQanswers that fragmenting the RNA to very short lengths causes short fragments to randomly ligate to each other. Is there a way for arriba to handle this 'background level' of chimeras?

separate IDs of split-reads from those of discordant-reads

Hi suhrig,
First, thank you so much for making such a wonderful tool, fast and sensitive.
Secondly, I have a suggestion in the column of read_identifiers in the result file. In this column, could you please separate IDs of split-reads from IDs of discordant-reads? so save them in two separate columns, instead of in a single column. Also the IDs of duplicates and mismatches are better to list separately since they are not real supporting evidence.

Thank you.
Z

Support for Hisat2?

Hi Sebastian,

We would like to investigate the effect of some large copy number gains (SV, CNV), but also screen some of our patients for unexpected events. Currently we align using Hisat2 mainly en use Kallisto for counting. I am not an bioinformatician myself, but have some support.
Sofar I understand both aligners do not give output compatible with Arriba. Do you know whether this understanding is correct? And are you thinking on generating support for other aligners or a file converter?
Best regards, Jasper Saris

GTF file malformed (Arriba)

Hello,

I want to run Arriba directly (without STAR) because I already have the bam file of my rna-seq sample. When inputting the necessary files in the command line (/.arriba .bam file, .gtf file, .fa file, etc) , when I run the analysis each time it says that my GTF file is malformed - I have also tried the GTF file (assembly) from your script "download_references.sh" - which does not work either :

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

BTW I'm running Arriba on Mac .

Thank you!

Marie

fusion strand

Hello,

I am curious about the strand1 and strand2 columns in arriba output. I read in the documentation that the strand before "/" is the strand of the gene from the gtf and the second strand (that is after "/") is from the assembled fusion.

However I'm not sure how arriba assembles the fusion supporting reads to get the strand information from the STAR BAM file. Can you please explain?

Thanks!

Warning when reading in custom made GTF annotation

Hi to arriba developers,

I tried using arriba to filter for fusion transcripts in my samples between virus and host. I supplied a custom made GTF file for the virus as there are no public GTF available for the virus. However, I always receive a message saying WARNING: exon belongs to unknown gene with ID: HIV_vif and it could not read in the virus gene annotations.

Could you advice on how I can fix the GTF file so I can read in the annotations? I have tried multiple times with various modifications to no avail.

Here I showed the lines for the first 3 genes of the virus annotation.

hiv-1_HXB2	RefSeq	transcript	456	9636	0	+	.	transcript_id "vif_01"; gene_name "Vif"; gene_id "HIV_vif";
hiv-1_HXB2	RefSeq	exon	456	742	0	+	.	transcript_id "vif_01"; gene_name "Vif"; gene_id "HIV_vif";
hiv-1_HXB2	RefSeq	exon	4912	9636	0	+	.	transcript_id "vif_01"; gene_name "Vif"; gene_id "HIV_vif";
hiv-1_HXB2	RefSeq	CDS	5041	5619	0	+	.	transcript_id "vif_01"; gene_name "Vif"; gene_id "HIV_vif";

hiv-1_HXB2	RefSeq	transcript	456	9636	0	+	.	transcript_id "vpr_01"; gene_name "Vpr"; gene_id "HIV_vpr";
hiv-1_HXB2	RefSeq	exon	456	742	0	+	.	transcript_id "vpr_01"; gene_name "Vpr"; gene_id "HIV_vpr";
hiv-1_HXB2	RefSeq	exon	5389	9636	0	+	.	transcript_id "vpr_01"; gene_name "Vpr"; gene_id "HIV_vpr";
hiv-1_HXB2	RefSeq	CDS	5559	5850	0	+	.	transcript_id "vpr_01"; gene_name "Vpr"; gene_id "HIV_vpr";

hiv-1_HXB2	RefSeq	transcript	456	9636	0	+	.	transcript_id "vpu_01"; gene_name "Vpu"; gene_id "HIV_vpu";
hiv-1_HXB2	RefSeq	exon	456	742	0	+	.	transcript_id "vpu_01"; gene_name "Vpu"; gene_id "HIV_vpu";
hiv-1_HXB2	RefSeq	exon	5975	9636	0	+	.	transcript_id "vpu_01"; gene_name "Vpu"; gene_id "HIV_vpu";
hiv-1_HXB2	RefSeq	CDS	6062	6310	0	+	.	transcript_id "vpu_01"; gene_name "Vpu"; gene_id "HIV_vpu";

Hope to hear back from you soon. Thank you very much.

DNA fusion detect

Sience Arriba is a command-line tool for the detection of gene fusions from RNA-Seq data, I want to know how to do with DNA-seq data. looking forward to hearing from you.

'lzma.h' file not found

Hi!

I am trying to install Arriba and on running make, it gives the error -

gcc -g -Wall -O2 -I. -Ihtslib/htslib -c -o cram/cram_io.o cram/cram_io.c cram/cram_io.c:61:10: fatal error: 'lzma.h' file not found

#include <lzma.h>

^~~~~~~~

1 error generated. make[1]: *** [cram/cram_io.o] Error 1 make: *** [htslib/libhts.a] Error 2

Do you have any idea what I could do to resolve this error? I was expecting that all files necessary for the installation would be contained within the arriba folder.

Thanks!

forever running at the filtering mismappers step

Hi Sebastian,

I really like your tool arriba, it's really fast.
But recently I encountered this weird issue, what's the possible reason for it ? is it because some GTF records ? very similar splicing junction structures ? or anything else ...

Thanks.

Known fusion format

Hi,

I am wondering what is the expected file format for -k known_fusions.tsv

I recently downloaded the know fusions from COSMIC Complete Fusion Export as recommanded in the Arriba guide. However, the CosmicFusionExport.tsv.gz file has the following format, which I think is not what's expected by Arriba

Sample ID	Sample name	Primary site	Site subtype 1	Site subtype 2	Site subtype 3	Primary histology	Histology subtype 1	Histology subtype 2	Histology subtype 3	Fusion ID	Translocation Name	5' Chromosome	5' Genome start from	5' Genome start to	5' Genome stop from	5' Genome stop to	5' Strand	3' Chromosome	3' Genome start from	3' Genome start to	3' Genome stop from	3' Genome stop to	3' Strand	Fusion type	Pubmed_PMID
749711	HCC1187	breast	NS	NS	NS	carcinoma	ductal_carcinoma	NS	NS	665	ENST00000360863.10(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452	8	99981937	99981937	100106116	100106116	-	1	114944339	114944339	114995367	114995367	+	Inferred Breakpoint	20033038
749711	HCC1187	breast	NS	NS	NS	carcinoma	ductal_carcinoma	NS	NS	665	ENST00000360863.10(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452	8	99981937	99981937	100106116	100106116	-	1	114944339	114944339	114995367	114995367	+	Observed mRNA	20033038
749711	HCC1187	breast	NS	NS	NS	carcinoma	ductal_carcinoma	NS	NS	689	ENST00000324093.8(PLXND1):r.1_2864_ENST00000393238.7(TMCC1):r.918_5992	3	129574336	129574336	129606818	129606818	-	3	129647792	129647792	129671264	129671264	-	Inferred Breakpoint	20033038
749711	HCC1187	breast	NS	NS	NS	carcinoma	ductal_carcinoma	NS	NS	689	ENST00000324093.8(PLXND1):r.1_2864_ENST00000393238.7(TMCC1):r.918_5992	3	129574336	129574336	129606818	129606818	-	3	129647792	129647792	129671264	129671264	-	Observed mRNA	20033038
749711	HCC1187	breast	NS	NS	NS	carcinoma	ductal_carcinoma	NS	NS	695	ENST00000285518.10(AGPAT5):r.1_898_ENST00000344683.9(MCPH1):r.2529_8039	8	6708357	6708357	6741751	6741751	+	8	6642994	6642994	6648504	6648504	+	Inferred Breakpoint	20033038

-i doesn't take a space separated list

Hi @suhrig,

When using the -i option, passing a space-separated list only grabs the first one in the list:

arriba -x ${test_bam} -g ${test_gtf} -a ${test_fasta} -o ${sample}-withi-fusions.tsv -o ${sample}-withi-fusions.discarded.tsv -i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Filtering mates which do not map to interesting contigs (1) (remaining=73956)
arriba -x ${TEST_BAM} -g ${TEST_GTF} -a ${TEST_FASTA} -o ${SAMPLE}-withi-fusions.tsv -o ${SAMPLE}-withi-fusions.discarded.tsv -i 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y

Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) (remaining=1543505)

The documentation says spaces or commas work, though.

std::bad_alloc when running ./download_references.sh

Full Error:

dags@bio:~/Desktop/FUSIONS/arriba_v1.1.0$ ./download_references.sh hg19+GENCODE19
Downloading assembly: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/chromFa.tar.gz
Downloading annotation: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
Jul 30 10:43:59 ..... started STAR run
Jul 30 10:43:59 ... starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
./download_references.sh: line 111: 7880 Aborted (core dumped) STAR --runMode genomeGenerate --genomeDir STAR_index_${ASSEMBLY}_${ANNOTATION} --genomeFastaFiles "$ASSEMBLY.fa" --sjdbGTFfile "$ANNOTATION.gtf" --runThreadN "$THREADS" --sjdbOverhang 200

High false positive rate

Hello,
I met a problem with Arriba. Before using it on my samples I'm testing it and other tools on test dataset (positive, negative & real breast line cancer, paired-end data).
It appears that I got a lot of false positive for negative dataset made with Beers, I take it from Jaffa dataset, available here :
https://github.com/Oshlack/JAFFA/wiki/Download

For comparison, I run the analysis on the same data with FusionCatcher, Star-Fusion & Infusion. For these 3 tools I got 1, 8 and 38 false positive respectively while I got 196 fusions with Arriba ! I tried these parameters to improve my results but it's unsuccessful.

  • Max evalue to 0.05 (instead default 0.3) : 186 fusions.

  • Anchor Lenght to 40bp (instead default 23) : 196.

Any ideas to improve this ? Why does so many false positives I don't understand, I will have thought that changing anchor to 40 would have decrease drastically the number of fusions but It still the same...

Thank you in advance for your answer.

Explanation of best-select filter ?

Hi Sebastian,

I am coming back to you after a few weeks of extended use and integration of arriba (still a great tool :) ). I am running it with the -T -P options.

In one of my samples, I had a series of results that I have been wondering about :

CSNK1D SUZ12P1 -/- +/+ 17:82251379 17:30734897 splice-site splice-site inversion upstream upstream 0 3 14 902 149 high . . duplicates(14),mismatches(1) GCTACCCTT___CCGAATTTGCCACATACCTGAATTTCTGCCGTTCCTTGCGTTTTGACGACAAGCCTGACTACTCGTACCTGCGGCAGCTTTTCCGGAATCTGTTCCATCGCCAGGGCTTCTCCTATGACTACGTGTTCGACTGGAACATGCTCAAATTT|AGCCAACACAGATCTATAGATTTCTTTGAACTCGGAATCTCATAGCA___CCAATATTTTTGCACAGAACTCTTACTTACATGTCTCATCGAAACTCCAGAACAAACATCAAAAG___GAA...AGCTTGTCAGCTCATTTGCAGCTTACATTTTTGGTTTCTT out-of-frame YPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLKF|sqhrsidffelgis* .
CSNK1D SUZ12P1 -/- +/+ 17:82251379 17:30759491 splice-site splice-site inversion upstream upstream 0 1 4 902 278 medium . . duplicates(3),mismatches(1) CATACCTGAATTTCTGCCGTTCCTTGCGTTTTGACGACAAGCCTGACTACTCGTACCTGCGGCAGCTTTTCCG...CATCGCCAGGGCTTCTCCTATGACTACGTGTTCGACTGGAACATGCTCAAATTT|CTTGTCAGCTCATTTGCAGCTTACATTTTTGGTTTCTTCCACAAAAATG___ATAAGCCATCA...AAAATGAACAAAATTCTGTTACCCTGGAAGTCCTGCTTGTGAAAGTTTGC out-of-frame HRQGFSYDYVFDWNMLKF|lvssfaayifgffhkndkp .
CSNK1D SUZ12P1 -/- +/+ 17:82252434 17:30766503 splice-site splice-site inversion upstream upstream 0 1 0 1016 238 low . . duplicates(1) ATCGAAGTGTTGTGTAAAGGCTACCCTT|ATAAGCCATCACCAAACTCAGA...TCCAATAAGGCAAGTTCCCACAGGTAAAAAGCAGGTGCCTTTGAATCCTG out-of-frame IEVLCKGYP|ykpspns .

As you can see, arriba detected several fusions between the same pair of two genes (all of them at different splice sites of SUZ12P1). However, according to the best-select filter description, "If there are multiple breakpoints detected between the same pair of genes, this filter discards all but the most credible one." What is happening in this case ? Does another filter overrule the best-select ?

Thank you in advance for your time,
Best,
Bruno

tarball contains .git

Hi Suhrig,

It looks like the release tarball has the .git directory included in it, is that intentional?

arriba_v1.1.0/.git/objects/82/cd1803664f28f864d8e84e5e65e86c25eff653
arriba_v1.1.0/.git/objects/82/4b29862c4e75ac77ccedaaa7c37d7c2e504f5f
arriba_v1.1.0/.git/objects/82/8dd1cff22901bff9b303d5164ac578e3c2b9ce
arriba_v1.1.0/.git/objects/74/
arriba_v1.1.0/.git/objects/74/60c082bbf9c511783691f0405ce57dd90ca2d9

etc.

Compatibility with STAR >2.7.2a

Hi, do you know if Arriba will work as intended with versions of STAR greater than 2.7.2a. I note in the release notes that with reference to this STAR release that behaviour with respect to Chimeric reads changed:

Chimeric read reporting now requires that the chimeric read alignment score higher than the alternative non-chimeric alignment to the reference genome. The Chimeric.out.junction file now includes the scores of the chimeric alignments and non-chimeric alternative alignments, in addition to the PEmerged bool attribute.

It's specifically the presence of scores for non-chimeric alternatives, I'm assuming this is a new feature of STAR? I don't see anything in the release notes for Arriba 1.2.0 about the changes in STAR and I'm cautious that alternations to Chimeric.out.junction format might result in unwanted behaviour or non-chimeric alternatives being parsed.

Accept Structural Variant VCF

Most structural variant callers output results in VCF. To simplify the output of one program as the input to another, it would be desirable to specify a VCF using -d.

How to use GENCODE annotation with arriba?

I am attempting to use the hg38 annotation from GENCODE to run arriba. However it gave me "ERROR: failed to parse GTF file, please consider using -G". I added the following: "-G gene_name=gene_id gene_id=gene_id" but I got "ERROR: Malformed GTF features: gene_id=gene_id." This is where I got the annotation file from: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz

Gene ID's in output

Hello,

When reporting fusion genes, is there a way to output gene ID's as well as names.
I was trying to use htseq-count for counting reads mapped to each exon of fusion genes. I also use STAR-Fusion together with Arriba; STAR-Fusion reports both gene name with gene ID.

Thanks.

compile error

I git clone this software, but compile error (see below) was got when i run make command in Centos 6.
How to fix this make error?

$make
make -C source arriba
make[1]: Entering directory /share/Data01/liwujiao/biosoft/arriba/source' g++ -c -pthread -std=c++0x -O2 -w -I. -I.. -I../samtools-1.3 -I../samtools-1.3/htslib-1.3 annotation.cpp -lz g++ -c -pthread -std=c++0x -O2 -w -I. -I.. -I../samtools-1.3 -I../samtools-1.3/htslib-1.3 assembly.cpp -lz g++ -c -pthread -std=c++0x -O2 -w -I. -I.. -I../samtools-1.3 -I../samtools-1.3/htslib-1.3 options_arriba.cpp -lz options_arriba.cpp: In function 'void print_usage(const std::string&)': options_arriba.cpp:141: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:144: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:147: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:150: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:152: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:159: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:162: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:167: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:169: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:172: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:175: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:180: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:185: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) make[1]: *** [options_arriba.o] Error 1 make[1]: Leaving directory /some/dir/arriba/source'
make: *** [arriba] Error 2

Can't compile with make release

When I try to compile Arriba with the release flag I get the following error:

$ make release
...
make LIBS_SO="" LIBS_A="htslib-1.8/libhts.a " CPPFLAGS="-DHAVE_LIBDEFLATE -Ihtslib-1.8/htslib -I../static_libs_centos6.9"
make[1]: Entering directory '/dev/shm/arriba_v1.1.0'
make -C htslib-1.8 CPPFLAGS="-DHAVE_LIBDEFLATE -Ihtslib-1.8/htslib -I../static_libs_centos6.9" LDFLAGS="" libhts.a
make[2]: Entering directory '/dev/shm/arriba_v1.1.0/htslib-1.8'
gcc -g -Wall -O2 -I. -DHAVE_LIBDEFLATE -Ihtslib-1.8/htslib -I../static_libs_centos6.9 -c -o bgzf.o bgzf.c
bgzf.c:39:10: fatal error: libdeflate.h: No such file or directory
 #include <libdeflate.h>
          ^~~~~~~~~~~~~~
compilation terminated.
Makefile:121: recipe for target 'bgzf.o' failed
make[2]: *** [bgzf.o] Error 1
make[2]: Leaving directory '/dev/shm/arriba_v1.1.0/htslib-1.8'
Makefile:27: recipe for target 'htslib-1.8/libhts.a' failed
make[1]: *** [htslib-1.8/libhts.a] Error 2
make[1]: Leaving directory '/dev/shm/arriba_v1.1.0'
Makefile:34: recipe for target 'release' failed
make: *** [release] Error 2

Is the htslib-1.8 folder supposed to include the libdeflate library or how is the release flag supposed to work? And can you document this in the installation instructions? Also what is supposed to be in STATIC_LIBS or static_libs_centos6.9?

Make --sjdbOverhang mutable via command-line in download_references.sh

The script you provide for downloading the references + annotations assumes a read length of 201bp.

Specifically the final stage of the script builds a STAR index with:

STAR --runMode genomeGenerate --genomeDir STAR_index_${ASSEMBLY}_${ANNOTATION} --genomeFastaFiles "$ASSEMBLY.fa" --sjdbGTFfile "$ANNOTATION.gtf" --runThreadN "$THREADS" --sjdbOverhang 200

However if we consult the STAR documentation note that the -sjdbOverhang for index generation is defined as:

Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads

Consequently would hardwiring this to a fixed value of 200 be unwise? For instance in my case I have 76bp reads, so my optimum value of -sjdbOverhang would be 75 assuming the STAR recommendations apply?

Results from the SMC-RNA Challenge

Hi,
I have read in your home page that as of the 4th round of the SMC-RNA challenge, Arriba has topped the leader board. Can you please share the results from that competition ? This will help me not to duplicate efforts in bench marking this tool against others.

Thanks,
RV

Circos plot Q: How to make the chromosome labels appear clearer?

In my circos plot, some of the chromosome labels are sort of hidden due to the numerous intrachromosomal fusions, what would you suggest for making the chromosome label stick out?

Is it possible to make the lines connecting the gene label to the chromosomal band longer, or thinner to not hide the chromosome label? Perhaps a different color font for the chromosomal label?

So far I got this parameter to work:
circos.genomicLabels(geneLabels, labels.column=4, side="outside", cex=fontSize,connection_height = convert_height(20, "mm"))

But adding the other parameters give me 50 warnings

circos.genomicLabels(geneLabels, labels.column=4, side="outside", cex=fontSize,connection_height = convert_height(20, "mm"),line_col = par(col="gray"), line_lwd = par(lwd=0.8), line_lty = par(lty=4))

FusionCatcher Dataset

HI,

I am trying to catchem' all with the FusionCatcher Dataset + Arriba:
https://github.com/ndaniel/fusioncatcher/tree/master/test

I tried with the following command line arguments:

STAR \
    '--runMode alignReads' \
    --alignIntronMax \
    1000000 \
    --alignIntronMin \
    20 \
    --alignMatesGapMax \
    1000000 \
    --alignSJDBoverhangMin \
    1 \
    --alignSJoverhangMin \
    8 \
    --chimJunctionOverhangMin \
    15 \
    --chimOutType \
    WithinBAM \
    SoftClip \
    --chimSegmentMin \
    15 \
    --genomeDir \
    /var/lib/cwl/stg12a41017-ebd1-4396-9fdb-ef251725790f/star \
    --genomeLoad \
    NoSharedMemory \
    --limitBAMsortRAM \
    60000000000 \
    --limitOutSAMoneReadBytes \
    90000000 \
    --outFilterIntronMotifs \
    RemoveNoncanonical \
    --outFilterMismatchNmax \
    10 \
    --outFilterMismatchNoverLmax \
    0.1 \
    --outFilterMultimapNmax \
    10 \
    --outFilterType \
    BySJout \
    --outReadsUnmapped \
    Fastx \
    --outSAMmapqUnique \
    255 \
    --outSAMstrandField \
    intronMotif \
    --outSAMtype \
    BAM \
    SortedByCoordinate \
    --outSAMunmapped \
    Within \
    --outSAMmode \
    Full \
    --readFilesCommand \
    zcat \
    --runThreadN \
    8 \
    --seedSearchStartLmax \
    30 \
    --readFilesIn \
    /var/lib/cwl/stgda69f516-ac0d-4ab1-a638-7b3621f260fe/joinedfiles.dat \
    /var/lib/cwl/stg0b23d764-be65-4426-85b3-c6a7ac523b44/joinedfiles.dat

    arriba \
    -g \
    /var/lib/cwl/stgc586d5c5-6f47-4507-b3c8-32bc67801d2b/gencode.v32.annotation.gtf \
    -a \
    /var/lib/cwl/stg807396cf-c9a5-41d9-a791-8b4420370377/GRCh38.primary_assembly.genome.fa \
    -b \
    /var/lib/cwl/stga879b704-1ea1-4158-bc0a-761a72af1f02/blacklist_hg38_GRCh38_2018-11-04.tsv.gz \
    -x \
    /var/lib/cwl/stgb4d5992c-53d9-4c45-adee-e4de4f54027c/Aligned.sortedByCoord.out.bam \
    -O \
    fusions.discarded.tsv \
    -o \
    fusions.tsv \
    -P \
    -P

and got ten:

#gene1  gene2   strand1(gene/fusion)    strand2(gene/fusion)    breakpoint1     breakpo$
FGFR3   TACC3   +/+     +/+     4:1806934	4:1727977	splice-site     CDS    $
FIP1L1  PDGFRA  +/+     +/+     4:53425965	4:54274925	splice-site     CDS    $
HOOK3   RET     +/+     +/+     8:42968214	10:43116584     splice-site     splice-$
AKAP9   BRAF    +/+     -/-     7:92003235	7:140787584     splice-site     splice-$
EWSR1   ATF1    +/+     +/+     22:29287134     12:50814280     splice-site     splice-$
ETV6    NTRK3   +/+     -/-     12:11869969     15:87940753     splice-site     splice-$
EML4    ALK     +/+     -/-     2:42301394	2:29223584	splice-site     5'UTR  $
BRD4    NUTM1   -/-     +/+     19:15254152     15:34347969     splice-site     splice-$
GOPC    ROS1    -/-     -/-     6:117566854     6:117321394     splice-site     splice-$
TMPRSS2 ETV1    -/-     -/-     21:41494375     7:13935838	CDS     CDS     translo$

Then I tried changing MaxReads and MaxEValue which didnt increase the number of fusions, then tried disabling a few filters (min_support\many_spliced):

    STAR \
    '--runMode alignReads' \
    --alignIntronMax \
    1000000 \
    --alignIntronMin \
    20 \
    --alignMatesGapMax \
    1000000 \
    --alignSJDBoverhangMin \
    1 \
    --alignSJoverhangMin \
    8 \
    --chimJunctionOverhangMin \
    15 \
    --chimOutType \
    WithinBAM \
    SoftClip \
    --chimSegmentMin \
    15 \
    --genomeDir \
    /var/lib/cwl/stgf7406dee-faff-40b1-9b06-303b961e77c7/star \
    --genomeLoad \
    NoSharedMemory \
    --limitBAMsortRAM \
    60000000000 \
    --limitOutSAMoneReadBytes \
    90000000 \
    --outFilterIntronMotifs \
    RemoveNoncanonical \
    --outFilterMismatchNmax \
    10 \
    --outFilterMismatchNoverLmax \
    0.1 \
    --outFilterMultimapNmax \
    10 \
    --outFilterType \
    BySJout \
    --outReadsUnmapped \
    Fastx \
    --outSAMmapqUnique \
    255 \
    --outSAMstrandField \
    intronMotif \
    --outSAMtype \
    BAM \
    SortedByCoordinate \
    --outSAMunmapped \
    Within \
    --outSAMmode \
    Full \
    --readFilesCommand \
    zcat \
    --runThreadN \
    8 \
    --seedSearchStartLmax \
    30 \
    --readFilesIn \
    /var/lib/cwl/stgdb0d3990-f97d-40cc-9f55-c06d95ba9968/joinedfiles.dat \
    /var/lib/cwl/stg89bb78b7-93ff-43f0-a1e7-214f45ec2bce/joinedfiles.dat

arriba \
    -g \
    /var/lib/cwl/stg4b736520-cb0e-4049-a556-69e167158000/gencode.v32.annotation.gtf \
    -a \
    /var/lib/cwl/stg28d39c0b-6b89-4638-b092-bbcd871a1bc7/GRCh38.primary_assembly.genome.fa \
    -b \
    /var/lib/cwl/stg82c7408b-af40-42df-971d-809ef1387304/blacklist_hg38_GRCh38_2018-11-04.tsv.gz \
    -x \
    /var/lib/cwl/stg6832edbb-f979-4f0d-9128-843c9e69cf6c/Aligned.sortedByCoord.out.bam \
    -f \
    min_support \
    many_spliced \
    -O \
    fusions.discarded.tsv \
    -o \
    fusions.tsv \
    -E \
    1 \
    -U \
    50 \
    -P \
    -P

which gave me 15 fusions, but not the ones i was looking for and i dont think disabling the filters is the right way to expand the set of fusions:

FGFR3   TACC3   +/+     +/+     4:1806934	4:1727977	splice-site     CDS     dupl$
FIP1L1  PDGFRA  +/+     +/+     4:53425965	4:54274925	splice-site     CDS     dele$
HOOK3   RET     +/+     +/+     8:42968214	10:43116584     splice-site     splice-site $
AKAP9   BRAF    +/+     -/-     7:92003235	7:140787584     splice-site     splice-site $
EWSR1   ATF1    +/+     +/+     22:29287134     12:50814280     splice-site     splice-site $
ETV6    NTRK3   +/+     -/-     12:11869969     15:87940753     splice-site     splice-site $
EML4    ALK     +/+     -/-     2:42301394	2:29223584	splice-site     5'UTR   inve$
BRD4    NUTM1   -/-     +/+     19:15254152     15:34347969     splice-site     splice-site $
GOPC    ROS1    -/-     -/-     6:117566854     6:117321394     splice-site     splice-site $
TMPRSS2 ETV1    -/-     -/-     21:41494375     7:13935838	CDS     CDS     translocatio$
CD74    ROS1    -/-     -/-     5:150404680     6:117324415     splice-site     splice-site $
GNAS    BRAF    +/+     -/-     20:58909359     7:140783038     CDS     CDS     translocatio$
SEPTIN9 BRAF    +/+     -/-     17:77499723     7:140753339     3'UTR   CDS     translocatio$
EWSR1   FLI1    +/+     +/+     22:29287134     11:128807180    splice-site     splice-site $
NTRK3   ATP2B1  -/-     -/-     15:87880322     12:89630643     CDS     CDS     translocatio$

Any idea how I can catchem all?

FULL SET OF FUSIONS:

- FGFR3-TACC3    (short reads from [2]),
-   FIP1L1-PDGFRA  (short reads from [3]),
-   GOPC-ROS1  (short reads from [4]),
-   EWS-ATF1  (short reads from [1]),
-   TMPRSS2-ETV1  (short reads from [1]),
-   EWS-FLI1  (short reads from [1]),
-   NTRK3-ETV6  (short reads from [1]),
-   CD74-ROS1  (short reads from [1]),
-   HOOK3-RET  (short reads from [1]),
-   EML4-ALK  (short reads from [1]),
-   AKAP9-BRAF  (short reads from [1]),
-   BRD4-NUT  (short reads from [1]),
-   MALT1-IGH  (short reads from [5]),
-   IGH-CRLF2  (short reads from [6]),
-   DUX4-IGH  (short reads from [7]),
-   NPM1-ALK  (short reads from [8]), and
-   CIC-DUX4  (short reads from [9]).

Thanks,
-WaO

pinning to htslib 1.8 in bioconda

Hey everyone,

Congrats on the great showing in the DREAM challenge! I am looking at what it would take to integrate this in our workflow. I tried installing the package from bioconda, but the pinning of htslib to be 1.8 causes a bunch of other stuff to get removed that we install. Is the pinning to 1.8 necessary? If it is, do you think we could work to get it working against htslib 1.9?

How to cite Arriba?

Hi,

what would be the best way to cite Arriba? A biorXiv preprint would be excellent! Thanks,

Philip

Translation function from arriba

Hello Sebatian,

I am writing to you because I have started using arriba (beautiful tool BTW) and the output of the translation feature of Arriba (called by the -P parameter) is a bit unclear to me.

As per the documentation, If the fusion transcript contains an ellipsis (...), the sequence beyond the ellipsis is trimmed before translation, because the reading frame cannot be determined reliably. A very sensible choice, and for my output it is indeed the case for the part after the fusion. However, the output of the part before the fusion confuses me. As examples, here are two events I find:

CCCTTTGGACCTTTGgCACCAGGCTGGG___AAAAAAGAGTGGATTCAACAGACAGGGTTTACTTTGTGAATCATAACA...GGAAGATCCAAGAACTCAAGG|ATAAAAGATCTGCAGCTATGGAATTCTTCTCCATGACATTTTCACAGAACATACTACTTGTGATTTATATCATGTCCTTACTCAAG___GTAAGGAACTGCAAGTGATCAATATTGC

gives

EDPRTQg|*

And

AAAGAAGACTGGGCCTACAAAGAAGAAAGTGAAAGAACTGAGAATTTTGG...ATCGCATGCCATATGAAGACATAAGAAACGTTATTCTGGAGGTTAATGAAGACATGCTGAGTGAGGCTTTAATTCAG|ATATGTTTAAAGGGTAAGGTGCAC...CACAGCCTCTCACAGACAG___TATGGAAGATTTTTATCCAAATAAAAATCATGGCCCT

gives

RMPYEDIRNVILEVNEDMLSEALIQ|ICLKGKV

I gather that in those cases (ellipsis in the first part of the fusion), arriba only translates from the last ellipsis of the first transcript part onwards, but how so ? Do you translate a full reconstruction of the CDS from the assembly and the gtf, and then only take the appropriate part (which I'm guessing is the case from the doc: Translation starts [...] when the start codon is encountered in the 5' gene. )?If so, how do you handle alternate start codons ? Does this translation method only happen when you are certain of the transcript start / transcript ID ?

Additionally, I have events looking like this:

GCGGATCTGGGGCCGTCCTCAG|CATAAGCTGTGGCCATGACTACTGAAGT...GGACTCTAGCCAGTTAGGAACAGATGCAACCAAGGAAAAACCTAAAGAAG

translated like that:

ADLGPSS|a*

Which I'm guessing is the other behavior of the translation tool described in the doc: Translation starts at the start of the assembled fusion transcript

My question here is then how do you choose which translation method to use ? Do you check whether the natural (annotated) 5' start codon is here and translate from it if you can, from the start of the transcript if not ? In addition, translation from the start of the transcript seems like it lacks some biological relevance - why not translate from the first encountered start codon ?

I'm sorry for the wall of text and the numerous questions but the translation output of arriba is the part I am most interested in, so I figured I would try to clarify my confusion.

Thanks in advance, and thanks for a great tool too !

Bruno

Lower evidence for fusions with Arriba

Hi,

We have been observing consistently low evidence (split_reads) for fusions detected in cell lines/solid tumor samples as compared to when I use FusionCatcher or Pizzly. Is this expected or is there parameter tuning that can regulate this behavior?

Thanks,
Prateek

Multi-Threading

Hi, just wanted to say how awesome this package is. I have found the results to be robust and comprehensive in terms of information provided (compared to any other package I've used).

I had quick question related to the multi-threading and whether there is any consideration around implementing it as flag option? Might speed up the processing time for those of using WGS/WES BAM files.

Thank you again!

Outdated Bioconductor Installation Instructions

The user guide advises the user to execute

source("https://bioconductor.org/biocLite.R")
biocLite("GenomicAlignments")
biocLite("GenomicRanges")

but about a year ago, this was stopped because of security issues. The correct way to install packages now is to install BiocManager from CRAN and then

library(BiocManager)
install("GenomicAlignments")

There is no need for the user to separately install GenomicRanges because that will automatically happen during the installation of GenomicAlignments because GenomicRanges is in the Depends field of the DESCRIPTION file of GenomicAlignments, so the install command will take care of it.

Also, there's nothing in the user guide which explains if there's a Google Groups site or how to contact the developer with questions.

how to add "chr" prefix to output exon coordination

Hi,
When I tried to use "draw_fusions.R" to visualize arriba output, it reported "error : exon coordinates not found in gff3 ",so is there a way to get output results with "chr" prefix in exon coordinates ?

Thank you,
Xiucz

Nice software

Hi Sebastian,

I had a chat with you on arriba in the PhD poster presentation last month, and it impressed me a lot!

I made posts to briefly introduce the software in two popular Chinese bioinformatics forums:
Arriba I
Arriba II

Sorry, it is almost all Chinese...

I hope more Chinese researchers would know your software and prefer to use it!

GL to your DREAM SMC RNA Challenge contest,

Wenhu Cao

Blacklist not available outside release?

When I git clone this repo, I don't get the database subdirectory with the hg19 blacklist that's present in the release version v0.11.0. The README mentions the blacklist but doesn't include it in the example run, and doesn't describe its use. This is a bit confusing.

can't install through conda

I am getting this error over and over again:

Solving environment: failed
Initial quick solve with frozen env failed. Unfreezing env and trying again.

split reads

Hello suhrig,

Thank you for developing the tool, it's very informative. However I had a couple of questions that I believe are not covered in the document, if it's already available please point me to it.

  1. Would you agree that the split reads1+split reads2 == spanning reads that can then be used to identify how many spanning reads support each call? Specifically for read based filtering purposes.
  2. For Discordant reads would at-least 1 read support be strictly used in arriba to call a true fusion call? Some high/medium calls have 0 discordant/junction reads.
  3. How you you recommend we annotate arriba calls with databases such as ChimericDB etc?

Thank you,
Krutika

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.