GithubHelp home page GithubHelp logo

Comments (5)

suhrig avatar suhrig commented on June 15, 2024

Dear nicodemus88,

Arriba requires genes to be explicitly defined in the GTF file. If you have records of type transcript, exon or CDS, which reference a gene via the gene_id attribute, then there must be a record of type gene introducing a gene with this ID.

Moreover, Arriba requires that records of type gene specify the attribute gene_type as protein_coding for protein-coding genes or some other value for non-coding genes.

Given your example, the following three lines are missing:

1       unknown gene    456     9636    .       +       .       gene_id "HIV_vpr"; gene_name "HIV_vpr"; gene_type "protein_coding";
1       unknown gene    456     9636    .       +       .       gene_id "HIV_vpu"; gene_name "HIV_vpu"; gene_type "protein_coding";
1       unknown gene    456     9636    .       +       .       gene_id "HIV_vif"; gene_name "HIV_vif"; gene_type "protein_coding";

I am currently working on the next release of Arriba, which features more robust parsing of GTF files. This release will be able to deal with the GTF file you gave as an example out of the box. Until then, you can use the following awk script to add the missing records and attributes:

awk -v FS='\t' -v OFS='\t' '!(/^$/){

  # extract gene_id
  gene_id=$9
  sub(/.*gene_id "/, "", gene_id)
  sub(/".*/, "", gene_id);
  
  # determine boundaries of genes
  gene_contigs[gene_id]=$1
  gene_strands[gene_id]=$7
  if (!gene_starts[gene_id] || $4 < gene_starts[gene_id]) gene_starts[gene_id]=$4
  if (!gene_ends[gene_id] || $5 > gene_ends[gene_id]) gene_ends[gene_id]=$5
  
  # find out if gene is protein coding based on existence of "CDS" lines
  if ($3 == "CDS") {
    gene_type[gene_id]="gene_type \"protein_coding\""
  } else if (!(gene_id in gene_type)) {
    gene_type[gene_id]="gene_type \"non_coding\""
  }
  
  print
}
END{
  # print genes
  for (gene_id in gene_starts)
    print gene_contigs[gene_id],"unknown","gene",gene_starts[gene_id],gene_ends[gene_id],".",gene_strands[gene_id],".","gene_id \""gene_id"\"; gene_name \""gene_id"\"; "gene_type[gene_id]";"
}' YOUR_CUSTOM_GTF_FILE.gtf > FIXED_GTF_FILE.gtf

Two more side notes:

  • It seems you are using RefSeq annotation. Depending on how the GTF file was prepared, you might need to run the above awk script on the GTF, too, in order to add missing records of type gene. In addition, there are other minor issues with RefSeq annotation, such as gene IDs being reused for multiple copies of a gene, which might cause events to not be reported for such genes. For this reason, I generally recommend using GENCODE annotation. Moreover, the sensitivity improves with GENCODE annotation due to better annotation of splice-sites.

  • Please note that you must add the contig containing the viral DNA sequence to the list of interesting contigs using the parameter -I of arriba or else arriba will not report any events relating to your custom contig.

Let me know, if you need further help.

Regards,
Sebastian

from arriba.

nicodemus88 avatar nicodemus88 commented on June 15, 2024

Hi Sebastian,

Thank you for your reply.

I will try out the changes you suggested and see if it works.

On a side note, the GTF file was made by myself, not retrieved from any databases. So if I changed the RefSeq to GENCODE, would it affect the GTF file? I initially put RefSeq because I was imitating the GTF file I was using previously. However, now I have recently changed to using the GENCODE version for hg19.

Thank you for your assistance and work! I'd really appreciate it!

from arriba.

suhrig avatar suhrig commented on June 15, 2024

It does not matter at all, what the second column of your GTF file says (GENCODE or RefSeq or any other value). This column is not interpreted at all. I just assumed that apart from your custom annotation you usually use RefSeq annotation, because in your example it said "RefSeq".

What really matters are the slight differences between RefSeq and GENCODE annotation, such as having records of type gene and gene_type attributes.

from arriba.

nicodemus88 avatar nicodemus88 commented on June 15, 2024

Thank you @suhrig for your comment.
I tried out the changes as per your suggestion and it works.
Thanks again!

On a side note, are there any plans on updating the algorithm for filtering multi-mappers? As I am aware, the recent STAR update has allowed for multi-mapping chimeras, although only on the Chimeric.out.junction file for now.

from arriba.

suhrig avatar suhrig commented on June 15, 2024

I tried out the changes as per your suggestion and it works.

Glad to hear! I'm closing this issue as resolved, then.

are there any plans on updating the algorithm for filtering multi-mappers?

Yes, definitely! But this will only be possible, if STAR reports multi-mapping reads in the Chimeric.out.sam file.

from arriba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.