GithubHelp home page GithubHelp logo

Comments (9)

afinit avatar afinit commented on September 28, 2024

Since posting, I went back and tried a couple previous versions. The last version to work is v1.1.2 when I tried running v1.2.0, I got the same error as reported above. I guess I'll stick with the older version for now, but it would be nice to know what changed in the GFFReader that is causing stringtie to be inable to find transcripts in my gff files.

from stringtie.

gpertea avatar gpertea commented on September 28, 2024

The example you've shown here seems to be just a single-exon transcript. The format seems to be GFF3-like but there are no well formed child features (exons/CDS) belonging to a transcript parent feature. Indeed I have tighten the requirements for GFF3 parsing in the last version so a parent transcript feature is now expected, with well defined child feature(s) (i.e. having a Parent attribute with the same value as the ID attribute of its parent). I did that because it was way too loose before, leading to confusion and loss of transcript data in some cases. All major annotation sources nowadays use either this kind of GFF3 format (with matching ID/Parent attribute values), or the older GTF format with all features using just transcript_id.
Assuming that all tabs are where they should be (even though they look like spaces here), you could try a combination of the 3rd line followed by the 4th line as shown in your attempts, in order to represent this transcript, but make sure you replace ID= with Parent= in the CDS feature line, like this:

Xam668_contig195        Prodigal:2.6    mRNA    700     1398    .       -       .     ID=xam668_04238;gene=xam668_04238;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:RefSeq:WP_007974205.1;locus_tag=xam668_04238;product=partition protein
Xam668_contig195        Prodigal:2.6    CDS     700     1398    .       -       0     Parent=xam668_04238

There is some info (and a simple example) about the minimal GFF3 format expected by stringtie and other programs, that you might find helpful, on this page: http://ccb.jhu.edu/software/stringtie/gff.shtml#format

from stringtie.

afinit avatar afinit commented on September 28, 2024

I'm currently working with a bacterial genome, and since there aren't frequently any introns in these genes, I don't need parent lines. This file was created by a bacterial genome denovo annotation program called prokka. So are you saying that there is no way that I can run this new version without creating parent lines for my gff? Or are you saying that I can use the GTF format for one line genes?

from stringtie.

afinit avatar afinit commented on September 28, 2024

It seems a more informative error here would have addressed this issue. Would it be difficult to throw a more detailed warning or error describing what in particular is causing an issue? For instance Error: missing Parent attribute

from stringtie.

gpertea avatar gpertea commented on September 28, 2024

Yes, in this case you might find it easier to use the GTF format instead, with just the CDS (or exon) features and transcript_id - but don't forget the double quotes for this format! It would look something like this:

Xam668_contig195        Prodigal:2.6    CDS   700     1398    .       -       0     transcript_id "xam668_04238"; gene_id "xam668_04238"; inference "ab initio prediction:Prodigal:2.6,similar to AA sequence:RefSeq:WP_007974205.1"; locus_tag "xam668_04238";product "partition protein"

Please keep in mind that stringtie does not care about any other attributes there, so if you really want to save space, just transcript_id would be enough.

(snipped wrong opinion comment)

from stringtie.

gpertea avatar gpertea commented on September 28, 2024

As a side note I find it intriguing that you're using something like StringTie, which is mainly a sophisticated isoform assembler/resolver, on bacterial genomes and transcripts. I guess you are using it mostly for the abundance estimation but I am pretty sure there are more suitable tools out there for doing this on prokaryotic genomes..

from stringtie.

gpertea avatar gpertea commented on September 28, 2024

OK, it seems I was plain wrong in my assumption that matching ID/Parent values are somehow required by the GFF3 format, I just saw the Bacteriophage f1 example at http://www.sequenceontology.org/gff3.shtml and it is exactly as you described it that prokka wrote it -- just a CDS with an ID attribute should be enough to represent the whole gene. I guess I've been focusing on multi-exon transcripts for so long, I forgot that such a GFF3 record is perfectly OK..

So I am going to mark this down as a GFF3 parsing issue (a regression bug!) that should be fixed in my code. Thank you for bringing this problem to my attention..

from stringtie.

afinit avatar afinit commented on September 28, 2024

Thank you for addressing my issues. I'll use the older version of stringtie until I get around to writing a script to reformat my gff files.

from stringtie.

gpertea avatar gpertea commented on September 28, 2024

This should have been fixed in an earlier commit, should make it into the next release.

from stringtie.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.