Comments (5)
Dear nicodemus88,
Arriba requires genes to be explicitly defined in the GTF file. If you have records of type transcript
, exon
or CDS
, which reference a gene via the gene_id
attribute, then there must be a record of type gene
introducing a gene with this ID.
Moreover, Arriba requires that records of type gene
specify the attribute gene_type
as protein_coding
for protein-coding genes or some other value for non-coding genes.
Given your example, the following three lines are missing:
1 unknown gene 456 9636 . + . gene_id "HIV_vpr"; gene_name "HIV_vpr"; gene_type "protein_coding";
1 unknown gene 456 9636 . + . gene_id "HIV_vpu"; gene_name "HIV_vpu"; gene_type "protein_coding";
1 unknown gene 456 9636 . + . gene_id "HIV_vif"; gene_name "HIV_vif"; gene_type "protein_coding";
I am currently working on the next release of Arriba, which features more robust parsing of GTF files. This release will be able to deal with the GTF file you gave as an example out of the box. Until then, you can use the following awk
script to add the missing records and attributes:
awk -v FS='\t' -v OFS='\t' '!(/^$/){
# extract gene_id
gene_id=$9
sub(/.*gene_id "/, "", gene_id)
sub(/".*/, "", gene_id);
# determine boundaries of genes
gene_contigs[gene_id]=$1
gene_strands[gene_id]=$7
if (!gene_starts[gene_id] || $4 < gene_starts[gene_id]) gene_starts[gene_id]=$4
if (!gene_ends[gene_id] || $5 > gene_ends[gene_id]) gene_ends[gene_id]=$5
# find out if gene is protein coding based on existence of "CDS" lines
if ($3 == "CDS") {
gene_type[gene_id]="gene_type \"protein_coding\""
} else if (!(gene_id in gene_type)) {
gene_type[gene_id]="gene_type \"non_coding\""
}
print
}
END{
# print genes
for (gene_id in gene_starts)
print gene_contigs[gene_id],"unknown","gene",gene_starts[gene_id],gene_ends[gene_id],".",gene_strands[gene_id],".","gene_id \""gene_id"\"; gene_name \""gene_id"\"; "gene_type[gene_id]";"
}' YOUR_CUSTOM_GTF_FILE.gtf > FIXED_GTF_FILE.gtf
Two more side notes:
-
It seems you are using RefSeq annotation. Depending on how the GTF file was prepared, you might need to run the above awk script on the GTF, too, in order to add missing records of type
gene
. In addition, there are other minor issues with RefSeq annotation, such as gene IDs being reused for multiple copies of a gene, which might cause events to not be reported for such genes. For this reason, I generally recommend using GENCODE annotation. Moreover, the sensitivity improves with GENCODE annotation due to better annotation of splice-sites. -
Please note that you must add the contig containing the viral DNA sequence to the list of interesting contigs using the parameter
-I
ofarriba
or elsearriba
will not report any events relating to your custom contig.
Let me know, if you need further help.
Regards,
Sebastian
from arriba.
Hi Sebastian,
Thank you for your reply.
I will try out the changes you suggested and see if it works.
On a side note, the GTF file was made by myself, not retrieved from any databases. So if I changed the RefSeq to GENCODE, would it affect the GTF file? I initially put RefSeq because I was imitating the GTF file I was using previously. However, now I have recently changed to using the GENCODE version for hg19.
Thank you for your assistance and work! I'd really appreciate it!
from arriba.
It does not matter at all, what the second column of your GTF file says (GENCODE or RefSeq or any other value). This column is not interpreted at all. I just assumed that apart from your custom annotation you usually use RefSeq annotation, because in your example it said "RefSeq".
What really matters are the slight differences between RefSeq and GENCODE annotation, such as having records of type gene
and gene_type
attributes.
from arriba.
Thank you @suhrig for your comment.
I tried out the changes as per your suggestion and it works.
Thanks again!
On a side note, are there any plans on updating the algorithm for filtering multi-mappers? As I am aware, the recent STAR update has allowed for multi-mapping chimeras, although only on the Chimeric.out.junction file for now.
from arriba.
I tried out the changes as per your suggestion and it works.
Glad to hear! I'm closing this issue as resolved, then.
are there any plans on updating the algorithm for filtering multi-mappers?
Yes, definitely! But this will only be possible, if STAR reports multi-mapping reads in the Chimeric.out.sam
file.
from arriba.
Related Issues (20)
- Reference Genome HOT 2
- Using a genome not supported HOT 3
- zsh: exec format error: ./arriba HOT 5
- Is it possible to have draw_fusions.R output the exon number in text? HOT 4
- Suppressed Sequences included in RefSeq_viral_genomes_v2.4.0.fa.gz HOT 1
- Adding more tools to plot. HOT 1
- Error in merging adjacent breakpoints? HOT 2
- Known canonical fusion reported with zero reads - need help to understand the output HOT 1
- Finding fusions and counting supporting reads zsh: killed HOT 3
- Finding fusions and counting supporting reads zsh: killed HOT 17
- Single-End vs Paired-End behaviour for split1/2, discordant and coverage counts HOT 14
- Problem detection exom skipped HOT 2
- Issue with Dragen BAM encountering std::out_of_range error in version [v2.4.0] HOT 8
- Error occured while I running draw_fusion.R HOT 2
- Error while running draw_fusion.R HOT 8
- Issues with Missing Exon Coordinates Using "draw_fusions.R" HOT 1
- Criteria of selecting specific transcripts HOT 5
- Identifying gene fusions in plant genomes. HOT 3
- The interpretation of the contents of the result file fusions.tsv. HOT 2
- Arriba and STARfuison align HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arriba.