GithubHelp home page GithubHelp logo

gpertea / gffcompare Goto Github PK

View Code? Open in Web Editor NEW
193.0 14.0 32.0 511 KB

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF

License: MIT License

Makefile 1.22% C++ 96.92% Shell 1.03% Perl 0.83%

gffcompare's Introduction

GffCompare

  • compare and evaluate the accuracy of RNA-Seq transcript assemblers (Cufflinks, Stringtie).
  • collapse (merge) duplicate transcripts from multiple GTF/GFF3 files (e.g. resulted from assembly of different samples)
  • classify transcripts from one or multiple GTF/GFF3 files as they relate to reference transcripts provided in a annotation file (also in GTF/GFF3 format)

More details and usage examples can be found in the paper: DOI: 10.12688/f1000research.23297.1 which can be also used to cite this software.

The official documentation and download packages for this utility can be found online here: http://ccb.jhu.edu/software/stringtie/gffcompare.shtml

For more information about the GFF3/GTF file formats expected by GffCompare please refer to the available online documentation -- a quick review can be found at: http://ccb.jhu.edu/software/stringtie/gff.shtml

The original version of this program was distributed as part of the Cufflinks suite, under the name "CuffCompare".

The overall functionality and most of the options of CuffCompare are still supported by GffCompare, while new functionality is being added to GffCompare only, as it is the program which is actively maintained.

An example of a new feature of GffCompare (compared to its predecessor CuffCompare) is this: when a single query GTF/GFF file is given as input for analysis, along with a reference annotation (-r option), GffCompare switches into annotation mode and it generates a .annotated.gtf file instead of the .combined.gtf produced by CuffCompare with the same parameters. This file has the same general format as CuffCompare's .combined.gtf file (with "class codes" assigned to transcripts as per their relationship with the matching/overlapping reference transcript), but the original transcript IDs are preserved, so GffCompare can thus be used as a simple way of annotating a set of transcripts.

Another important difference is that the input transcripts are by default no longer discarded when they are found to be "intron redundant", i.e. contained within other, longer isoforms. CuffCompare had the -G option to prevent collapsing of such intron redundant isoforms into their longer "containers", but GffCompare has made this the default mode of operation (hence the -G option is no longer needed and is simply ignored when given). However please note that "matching" transcripts with fully identical intron chains (i.e. with the same exact intron coordinates, hence the same intron-exon structure except the terminal exon ends) are still discarded when GTF/GFF files are loaded.

trmap

Some pipelines can produce a very large number of potential or partial transcripts ("transfrags"), for example when merging the transcript assemblies from tens or hundreds of RNA-Seq experiments assemblies with stringtie --merge. Running GffCompare on such large GTF/GFF files could be slow and memory intensive (because GffCompare always loads the whole transcript data in memory for clustering and other analysis). One may only be interested to know if and how these many transcripts overlap the reference annotation, and further analyze only those which have specific types of overlap with the reference annotation transcripts (or none at all, i.e. if they do not overlap any of it, which may be the case for putative novel transcripts). That's where the trmap utility comes in, as this program reports, for each query transcript, all the reference overlaps found, along with their classification codes as described in the GffCompare documentation. The main feature of 'trmap' is that it allows streaming of a very large file of query transcripts (in GFF or BED format) to be checked and classified against a reference annotation file (again, in GFF or BED format).

trmap ("Tanscript vs. Refererence MAPping") first loads the reference annotation file in memory as an interval tree and then streams the query file (which can be also provided at stdin) while checking and reporting any overlaps found, and classifies the relationship with reference transcripts using a (subset of) the "class codes" like those assigned by gffcompare (see http://ccb.jhu.edu/software/stringtie/gffcompare.shtml).

The streaming input GFF query input file to be streamed must be well-formed -- i.e. exons MUST be grouped together by transcript ID and immediately follow their parent feature if present. (for BED this is always the case due to the fact that exons are embedded in the same line).

Building from source

Steps for building this package from github:

  cd /some/build/dir
  git clone https://github.com/gpertea/gffcompare
  cd gffcompare
  make release

If you downloaded the standalone source package gffread-*.tar.gz then just unpack that, change to the unpacked directory and run make release there.

This should build the gffcompare and trmap binary executables in the current directory.

gffcompare's People

Contributors

alevar avatar gpertea avatar roryk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gffcompare's Issues

Memory corruption while running gffcompare

Hello!

I have been following the procedures provided in the article (HISAT2, Pertea et al., 2016) and successfully merged 34 single-end RNA-seq samples into a stringtie_merged.gtf. I am having trouble running the gffcompare program where it simply crashes out. The error looks like this:

gffcompare -r /home/slab/s6ma/archive/bin/hisat_indices/gene_annotations/Homo_sapiens.GRCh38.91.gtf -G -o merged stringtie_merged.gtf
*** glibc detected *** gffcompare: malloc(): memory corruption: 0x000000000c3ea820 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76e46)[0x7f69f4576e46]
/lib/x86_64-linux-gnu/libc.so.6(+0x79eb3)[0x7f69f4579eb3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x70)[0x7f69f457bcd0]
gffcompare[0x422440]
gffcompare[0x42a379]
gffcompare[0x4406fc]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f69f451eeed]
gffcompare[0x401fe9]
======= Memory map: ========
00400000-00465000 r-xp 00000000 00:27 61384051 /home/slab/s6ma/archive/src/gffcompare-0.10.2.Linux_x86_64/gffcompare
00664000-00666000 rw-p 00064000 00:27 61384051 /home/slab/s6ma/archive/src/gffcompare-0.10.2.Linux_x86_64/gffcompare
00666000-00680000 rw-p 00000000 00:00 0
020ae000-119d8000 rw-p 00000000 00:00 0 [heap]
7f69f0000000-7f69f0021000 rw-p 00000000 00:00 0
7f69f0021000-7f69f4000000 ---p 00000000 00:00 0
7f69f42e8000-7f69f42fd000 r-xp 00000000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f42fd000-7f69f44fd000 ---p 00015000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f44fd000-7f69f44fe000 rw-p 00015000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f4500000-7f69f4682000 r-xp 00000000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4682000-7f69f4881000 ---p 00182000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4881000-7f69f4885000 r--p 00181000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4885000-7f69f4886000 rw-p 00185000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4886000-7f69f488b000 rw-p 00000000 00:00 0
7f69f4890000-7f69f4911000 r-xp 00000000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4911000-7f69f4b10000 ---p 00081000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b10000-7f69f4b11000 r--p 00080000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b11000-7f69f4b12000 rw-p 00081000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b18000-7f69f4b38000 r-xp 00000000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d17000-7f69f4d37000 rw-p 00000000 00:00 0
7f69f4d37000-7f69f4d38000 r--p 0001f000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d38000-7f69f4d39000 rw-p 00020000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d39000-7f69f4d3a000 rw-p 00000000 00:00 0
7f69f4d3c000-7f69f4d41000 rw-p 00000000 00:00 0
7fffd4764000-7fffd4785000 rw-p 00000000 00:00 0 [stack]
7fffd4800000-7fffd4801000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

Does anyone know how to fix this?
I have tried to re-install the gffcompare program but this error still occurs. I have also tried gffcompare on unmerged gtf files but still would not work. So I am highly certain that the issue is within the gffcompare program. Thank you

no refmap/tmap generated from gffcompare?

Hello,

I've run gffcompare with the following command:
gffcompare -r reference.gtf -R -C -K -o sample1 query.gff

It generated sample1.combined.gtf, sample1.redundant.gtf, sample1.tracking, sample1.loci, sampe1.stats . However, according to this page, shouldn't I also get sample1.refmap and sample1.tmap? I can't find them, even when I drop "-C" and "-K" option.

The reference gtf file I used looks like this:

chr1 BestRefSeq exon 11874 12227 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 12613 12721 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 13221 14409 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 14362 14829 . - . transcript_id "NR_024540"; gene_id "653635";

and the sample1.gff looks like this:

chr1 pinfish mRNA 14421 195419 0 - . gene_id "847e65b5-7e7b-428b-936e-6a376f435050"; transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 14421 14829 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 14970 15038 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 186317 186469 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";

Should I modify my reference or query file to get the refmap/tmap? Any advice would be really helpful!

Warning: reference transcript has undetermined strand, discarded.

Error while executing the below command
gffcompare -V -R -r merged.gtf -o gffcompare_output -i gtf_list
Error happen: Warning: reference transcript has undetermined strand, discarded.

merged.gtf comes form the output of stringtie --merge
gtf_list is a list of gtf files of 12 samples which are input file of stringtie --merge

Tagged Versions?

It would be great if you could identify tagged releases of gffcompare and gclib that will work to build this, as I don't want to use an untagged version on our compute cluster. Also, for what it's worth, adding a flag to identify where gclib is would definitely beat hard coding "../gclib".

Why the total transcripts in the .tmp file was less than the total transcripts in the input .gtf file

Hi,
I am taking a few stringtie assemblies of human single-cell RNAseq data and annotating them back against Gencode v27 gtf files using gffcompare. I have noticed some differences between the total number of transcripts in the one input.gtf and the output file of .tmap after I used the many default modes of gffcompare.
The command is used as follows:
gffcompare-0.10.1.Linux_x86_64/gffcompare -R -r /ReferenceAnnotation/gencode.v27.annotation.gtf -G
-o gffResultPrefix stringtie_result/SRR.gtf
I have got some information about the differences between the "GffCompare" and "CuffCompare" from the website at https://github.com/gpertea/gffcompare, as follows:
Another important difference is that the input transcripts are no longer discarded when they are found to be "intron redundant", i.e. contained within other, longer isoforms. CuffCompare had the -G option to prevent collapsing of such intron redundant isoforms into their longer "containers", but GffCompare has made this the default mode of operation (hence the -G option is no longer needed and is simply ignored when given).
Maybe I am missing something obvious, so could you please give advice on how I can run the newest version of gffcompare to preserve the missing transcripts?

gffcompare shown wrong result when using strict-match

Hello,
Recently, I was running the GFFcompare and found some confusing results.
The query file has one single-exon transcript which is slightly different from the reference file at the boundary, and the query file has also one multiple-exon transcript which is slightly different from the reference file at the boundary. When I using GFFCompare with "--strict-match -e 0 -d 0", I was expected that the transcript level of two data should be 0% due to the error at the boundary, but the results have shown 100%. What are the possible reasons this might happens?

Thank you
grassking100

gffcmp.stats

# gffcompare v0.11.5 | Command line was:
#gffcompare --strict-match --chr-stats --debug -T -e 0 -d 0 --no-merge -r subanswer.gff3 subpredict.gff3
#
#> Genomic sequence: region_0385 
#     Query mRNAs :       1 in       1 loci  (0 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       1 in       1 loci  (0 multi-exon)
# Super-loci w/ reference transcripts:        1
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.8    |
        Exon level:     0.0     |     0.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       0
       Matching transcripts:       1
              Matching loci:       1

          Missed exons:       0/1	(  0.0%)
           Novel exons:       0/1	(  0.0%)
           Missed loci:       0/1	(  0.0%)
            Novel loci:       0/1	(  0.0%)
#> Genomic sequence: region_0333 
#     Query mRNAs :       1 in       1 loci  (1 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       1 in       1 loci  (1 multi-exon)
# Super-loci w/ reference transcripts:        1
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.9    |
        Exon level:    66.7     |    66.7    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       1
       Matching transcripts:       1
              Matching loci:       1

          Missed exons:       0/3	(  0.0%)
           Novel exons:       0/3	(  0.0%)
        Missed introns:       0/2	(  0.0%)
         Novel introns:       0/2	(  0.0%)
           Missed loci:       0/1	(  0.0%)
            Novel loci:       0/1	(  0.0%)

#= Summary for dataset: subpredict.gff3 
#     Query mRNAs :       2 in       2 loci  (1 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       2 in       2 loci  (1 multi-exon)
# Super-loci w/ reference transcripts:        2
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.8    |
        Exon level:    50.0     |    50.0    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       1
       Matching transcripts:       2
              Matching loci:       2

          Missed exons:       0/4	(  0.0%)
           Novel exons:       0/4	(  0.0%)
        Missed introns:       0/2	(  0.0%)
         Novel introns:       0/2	(  0.0%)
           Missed loci:       0/2	(  0.0%)
            Novel loci:       0/2	(  0.0%)

 Total union super-loci across all input datasets: 2 
2 out of 2 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)

subanswer.gff3

##gff-version 3
region_0385	.	gene	500	1432	.	+	.	ID=region_0769_gene_1;Parent=region_0769
region_0385	.	mRNA	500	1432	.	+	.	ID=region_0769_gene_1_mRNA;Parent=region_0769_gene_1
region_0385	.	exon	500	1432	.	+	.	ID=region_0769_gene_1_mRNA_exon_1;Parent=region_0769_gene_1_mRNA
###
region_0333	.	gene	501	2878	4	-	.	ID=AT2G26870_gene
region_0333	.	mRNA	501	2878	4	-	.	ID=AT2G26870;Parent=AT2G26870_gene
region_0333	.	exon	501	1315	4	-	.	Parent=AT2G26870
region_0333	.	exon	1409	1781	4	-	.	Parent=AT2G26870
region_0333	.	exon	2383	2878	4	-	.	Parent=AT2G26870

subpredict.gff3

##gff-version 3
region_0385	.	gene	500	1434	.	+	.	ID=region_0769_gene_1;Parent=region_0769
region_0385	.	mRNA	500	1434	.	+	.	ID=region_0769_gene_1_mRNA;Parent=region_0769_gene_1
region_0385	.	exon	500	1434	.	+	.	ID=region_0769_gene_1_mRNA_exon_1;Parent=region_0769_gene_1_mRNA
###
region_0333	.	gene	499	2878	.	-	.	ID=region_0666_gene_1;Parent=region_0666
region_0333	.	mRNA	499	2878	.	-	.	ID=region_0666_gene_1_mRNA;Parent=region_0666_gene_1
region_0333	.	exon	499	1315	.	-	.	ID=region_0666_gene_1_mRNA_exon_3;Parent=region_0666_gene_1_mRNA
region_0333	.	exon	1409	1781	.	-	.	ID=region_0666_gene_1_mRNA_exon_2;Parent=region_0666_gene_1_mRNA
region_0333	.	exon	2383	2878	.	-	.	ID=region_0666_gene_1_mRNA_exon_1;Parent=region_0666_gene_1_mRNA

Re: Deleted

Deleted, put comment in under wrong username

-G option no longer exists

This option is listed both in the current README.md as an example, as well as in Box 1 of the Nature Protocols paper (though with different meanings, it seems). However, gffcompare v0.9.7 does not list -G in the usage statement and does not accept it as input.

An explanation (and/or alternative) in the README would be appreciated!

Also (off topic) it's not instantly clear what's in the 6 different output files. A quick primer would save a new user a lot of time.

Apart from that - where has this tool been all my life?! So useful, thanks!

question about code "="

I always has one doubt for the code "=", if the query is an exact match of intron chain, but the last exon is longer than the reference, why is the query not an alternative splice.

gffcompare run on stringtie --merge gtf file results in zeros in the summary text

Hello,

I have a problem of empty output for the
Each set of paired end sequence data were aligned using hisat2 -- according to the published protocol for string tie. And then stringtie run on them with a command similar to:

stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseq1.bam -o rnaseq1.gtf
stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseq2.bam -o rnaseq2.gtf
...
stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseqn.bam -o rnaseqn.gtf

putting these files in a text file I call stringtiefilelist

rnaseq1.gtf
rnaseq2.gtf
...
rnaseqn.gtf

I run stringtie merge as follows:

stringtie --merge  -G Homo_sapiens.GRCh38.88.gtf  -o stringtie_merged.gtf stringtiefilelist

Each of these steps produces expected output. But then when I run gffcompare

gffcompare -r gencode.v.25.annotation.gtf -o stringtie_merged.txt stringtie_merged.gtf

I get empty output so to speak

#= Summary for dataset: stringtie_merged.gtf
#     Query mRNAs :  238442 in   34534 loci  (238442 multi-exon transcripts)
#            (21005 multi-transcript loci, ~6.9 transcripts per locus)
# Reference mRNAs :  171986 in   33152 loci  (171986 multi-exon)
# Super-loci w/ reference transcripts:        0
#-----------------| Sensitivity | Precision  |
        Base level:     0.0     |     0.0    |
        Exon level:     0.0     |     0.0    |
      Intron level:     0.0     |     0.0    |
Intron chain level:     0.0     |     0.0    |
  Transcript level:     0.0     |     0.0    |
       Locus level:     0.0     |     0.0    |

     Matching intron chains:       0
       Matching transcripts:       0
              Matching loci:       0

          Missed exons:  544338/544338  (100.0%)
           Novel exons:  634910/634910  (100.0%)
        Missed introns:  350157/350157  (100.0%)
         Novel introns:  398915/398915  (100.0%)
           Missed loci:   33152/33152   (100.0%)
            Novel loci:   34534/34534   (100.0%)

Is there perhaps a problem in the parameters I am using?

Do I need to convert the gtf file to a gff file? A successful run I had used as input a gff file. Perhaps this is the reason.

Question: How do you find novel transcripts using GFFcompare?

I am trying to find novel transcripts from an RNA-seq database. I tried to follow the protocol provided in the Nature Protocols paper that described the use of Stringtie; the paper also suggested using GFFcompare for comparing the assembled transcripts with the reference, to quantify and find the novel transcripts.

I followed the entire protocol, and this was the output of GFFcompare when comparing the reference GTF with that of one of the samples after mapping -

#= Summary for dataset: ./hisat2/ERR188044_chrX.gtf 
#-----------------| Sensitivity | Precision  |
        Base level:    51.7     |    79.5    |
        Exon level:    46.7     |    85.2    |
      Intron level:    47.2     |    93.9    |
Intron chain level:    31.2     |    64.4    |
  Transcript level:    31.6     |    52.3    |
       Locus level:    36.6     |    50.1    |

     Matching intron chains:     580
       Matching transcripts:     664
              Matching loci:     397

          Missed exons:    4395/8804	( 49.9%)
           Novel exons:     426/4874	(  8.7%)
        Missed introns:    3832/7946	( 48.2%)
         Novel introns:      83/3992	(  2.1%)
           Missed loci:     610/1086	( 56.2%)
            Novel loci:     273/795	( 34.3%)

 Total union super-loci across all input datasets: 749 
1270 out of 1270 consensus transcripts written in 188044_only.annotated.gtf (0 discarded as redundant)

In the paper however, the number of novel genes and transcripts is clearly mentioned -

enter image description here

So, I was wondering how they were able to calculate the number of novel transcripts and genes, from the data that GFFcompare provides (they even mention in the paper that they got it from GFFcompare). I see the numbers related to novel exons and novel introns, but how do I translate it to the number and identities of the novel transcripts/genes?

gff compare gives identical results for multiple files

Hi there,

I'm doing something weird but I can't find my mistake. I'm running gff compare v 0.9.8.Linux_x86_64.
I have 16 stringtie assemblies, but when I run gff compare on the individual assemblies, I get identical results in the .stats files. However, the file paths indicate that I have correctly provided different .gtf files. If I run multiple assemblies at once, the .stats file only contains the command run.

The reference genome is Atlantic salmon (Salmo salar). I'm also very skeptical that I have 100% sensitivity and precision.

Here are the contents of a stats file:

gffcompare v0.9.8 | Command line was:
./gffcompare -r /media/genetics/Data1/Jeff_workstation_1/subread-1.5.0-p1/GCF_000233375.1_ICSASG_v2_genomic.gff -o /media/genetics/Data1/Jeff_workstation_1/gffcompare-0.9.8.Linux_x86_64/compare/comp /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled/11-B5690_S3_mRNA_ARGU1_NGSLib3b.gtf

= Summary for dataset: /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled/11-B5690_S3_mRNA_ARGU1_NGSLib3b.gtf
Query mRNAs : 136272 in 81266 loci (111377 multi-exon transcripts)
(20690 multi-transcript loci, ~1.7 transcripts per locus)
Reference mRNAs : 136039 in 81267 loci (111148 multi-exon)
Super-loci w/ reference transcripts: 80037
-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 100.0 | 100.0 |
Intron level: 99.5 | 99.5 |
Intron chain level: 100.0 | 99.8 |
Transcript level: 100.0 | 99.8 |
Locus level: 100.0 | 100.0 |

 Matching intron chains:  111148
   Matching transcripts:  136039
          Matching loci:   81267

      Missed exons:       0/562650	(  0.0%)
       Novel exons:       0/562955	(  0.0%)
    Missed introns:    2190/471248	(  0.5%)
     Novel introns:       0/471248	(  0.0%)
       Missed loci:       0/81267	(  0.0%)
        Novel loci:       0/81266	(  0.0%)

Total union super-loci across all input datasets: 81266

Only 1 output file, input GTF is overwritten (v0.9.9)

I'm using the command below to compare a merged transcript GTF file from StringTie with annotated UCSC transcripts. gffcompare (v0.9.9) completes but only produces one output file, test_gffcompare.stats, which is empty except for 2 header lines of information about the command call. Also, my input merged_transcripts.gtf was overwritten with the correct gene names, so it appears that gffcompare is working in terms of reference annotation matching. Is overwriting the input file correct behavior? Where are the other output files (e.g. test_gffcompare.combined.gtf, test_gffcompare.loci)? I basically followed the steps outlined here, and I cloned the gffcompare repo just this morning.

Thanks :]
Mallory

gffcompare -r UCSC_Main_Human_refGene.gtf -T -V -o test_gffcompare merged_transcripts.gtf

how to deal with falsely merged gene regions into one super-locus?

Hello,
I used gffcompare to merge two gtf files with the following command: gffcompare -i gtffiles.txt

However, it seems that there are a falsely merged gene regions that are in fact two different genes. Do you have suggestions to deal with this?
Is there a way to specify to only merge after they share at least 2 exons or at least X amount of overlapping base pairs?

I would be gratefull for your reply,
Kind regards,
Ruth

Fuzz matching of intron chains and exons

Hi Geo,

Thanks for a great tool!

I'm using this to compare various short and long read "assemblies" and it would be great if there was an option to allow a "fuzz" distance between intron and exon starts and ends.

I started looking at the code and noticed that you had something for this already:

//strictMatching=false; // really match *all* exon coords for '=' class code!

and also later in the file where you do the actual matching.

I'm guessing there was a good reason for commenting it out and was wondering what that might be?

I'd be interested in making my own changes to re-enable it for my work, but wanted to find out the gotchas that you might have encountered before diving in.

Thanks,
Chris

gffcompare giving 'NA's

I have ~ 2000 transcripts that don't get given a class code when running gff compare. I am comparing the gencode.gtf with my stringtie_merged.gtf.

I'm confused as when I produce a transcript count matrix (via Stringtie/prepDE.py) there are transcripts that have lots of counts but supposedly aren't present in stringtie_merged.gtf.
eg ENST00000632434.1 has lots of read counts for all samples and has 'NA' class code.
When I grep ENST00000632434.1 from stringtie_merged.gtf I get an MSTRG id:
chr5 StringTie transcript 116085026 116293286 1000 + . gene_id "MSTRG.32862"; transcript_id "ENST00000632434.1"; gene_name "COMMD10"; ref_gene_id "ENSG00000145781.8";

When I then look at the MSTRG.32862 transcript in my count matrix is has very low counts.
If ENST00000632434.1 is present in all samples (as indicated by the transcript count matrix) then why isn't it being given a class code??

Sensitivity lower for stringtie merged gff than for component files

I've merged couple of gff files with stringtie merge tool. The resulting gff has lower sensitivity than each of single gff files used to create merged file. I'm comparing all gff files to the same reference (human, ensembl). Both tools: stringtie and gffcompare are called with default options. The difference is quite high: 30 - 50. Am I missing something? How is that possible?
Update: merged set created with gffcompare (with -i option) behaves according to my expectations: sensitivity is higher than for each component files. It seems more like a stringtie issue now...

Class code 'NA' when it should be '='?

I have run stringtie merge and I then use stringtie_merged.gtf to compare to the latest version of the gencode annotation with gffcompare. All of the files should be formatted the same (i.e. both files have 'chr1') and I mapped the data with the same gencode annotation file.
gffcompare -r gencode.v27.primary_assembly.annotation.gtf -o gff stringtie_merged.gtf
There are ~ 1000 (<1%) of transcripts that are not being given a class code.
For example when I create a transcript count matrix (with prepDE.py) transcript ENST00000640668.1 has read counts for all samples in the transcript count matrix but has no class code. When I search for this transcript in stringtie merged.gtf and look at each exon of this transcript and compare it to the genocde.gtf all of the co-ordinates exactly match so should gffcompare be giving it an '=' class code???
I have attached my stringtie_merged.gtf if this help to troubleshoot.
stringtie_merged.gtf.gz

Extract transcript from assembly based on class code using gffread

Hi,

I used CuffCompare to compare the genome guided assembled GTF file with the genome annotation. Now I want to extract transcripts with class codes annotated with “u”, “x,” and “i” using gffread program.

Please share command line to extract these transcripts.

Thanks

0 reference transcripts loaded.

Hi,

I am trying to use gffcompare to compare my assembled transcriptome to a reference gtf that contains information about small open reading frames (sORFs). The reference gtf was obtained by processing downloaded data from several sORF databases. The reference gtf is called all_38.gtf and is attached at the end.

mytranscriptome.gtf is generated with reference to ensembl hg38 v99 reference.

I used the command: /home/e0470749/gffcompare/gffcompare -r /gpfs/eplab/Nathaniel/norf_replicate/all_38.gtf -o /gpfs/eplab/Nathaniel/norf_replicate/norfs mytranscriptome.gtf.

This outputs a message file:

0 reference transcripts loaded.
  237788 query transfrags loaded.
  2714 duplicate query transfrags discarded.

All the expected output files are generated except for the .refmap file which I need for downstream analysis. I assume this is because 0 reference transcripts were loaded into the program. Is there something wrong with my custom reference gtf file?

all_38.gtf.zip

Sensitivity/Precision Stats

For the stats produced by gffcompare. Can you explain how they are calculated?
If, 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏=𝑻𝑷/(𝑻𝑷+𝑭𝑷)
What is a false positive?
And how do you calculate this on the 'base level'?

 #-----------------| Sensitivity | Precision  |      
 Base level:    82.4     |    76.5    |
        Exon level:    81.2     |    82.9    |
      Intron level:    86.1     |    94.8    |
Intron chain level:    56.9     |    52.4    |
  Transcript level:    55.2     |    38.9    |
       Locus level:    70.1     |    48.0    |

How can I extract the novel exons?

I am having the following statistic from gffcompare

      Missed exons:   81439/302312  ( 26.9%)
       Novel exons:    2536/146326  (  1.7%)
    Missed introns:   38931/183784  ( 21.2%)
     Novel introns:     556/100652  (  0.6%)
       Missed loci:       0/11150   (  0.0%)
        Novel loci:     166/10328   (  1.6%)

I can see there is 2536 novel exons in my input file. Is there any way I can extract this exons?

Thanks

why -x and -y option print empty file

I use command:
gffread -x sample.CDS.fa -y sample.CDS.protein.fa -g GRCh38.fasta sample.gtf
and No errors appear,but both sample.CDS.fa and sample.CDS.protein.fa are empty file,Why is this happening?

the head of the sample.gtf:

1	StringTie	transcript	356184	357780	1000	+	.	gene_id "MSTRG.49"; transcript_id "MSTRG.49.1"; 
1	StringTie	exon	356184	356438	1000	+	.	gene_id "MSTRG.49"; transcript_id "MSTRG.49.1"; exon_number "1"; 
1	StringTie	exon	357540	357780	1000	+	.	gene_id "MSTRG.49"; transcript_id "MSTRG.49.1"; exon_number "2"; 
1	StringTie	transcript	1488502	1572128	1000	-	.	gene_id "MSTRG.70"; transcript_id "MSTRG.70.1"; 
1	StringTie	exon	1488502	1488533	1000	-	.	gene_id "MSTRG.70"; transcript_id "MSTRG.70.1"; exon_number "1"; 
1	StringTie	exon	1565130	1572128	1000	-	.	gene_id "MSTRG.70"; transcript_id "MSTRG.70.1"; exon_number "2"; 
1	StringTie	transcript	2147579	2148044	1000	-	.	gene_id "MSTRG.79"; transcript_id "MSTRG.79.1"; 
1	StringTie	exon	2147579	2147793	1000	-	.	gene_id "MSTRG.79"; transcript_id "MSTRG.79.1"; exon_number "1"; 
1	StringTie	exon	2147992	2148044	1000	-	.	gene_id "MSTRG.79"; transcript_id "MSTRG.79.1"; exon_number "2"; 

gffcompare fails to match gtf without chr prepended to chromosome names

The HISAT index provided by http://ccb.jhu.edu/software/hisat2/index.shtml does not have "chr" in the chromosome names, therefore I modified the gencode standard gtf to remove the "chr" before each chromosome as well as renaming "chrM" to "MT" to match the HISAT index.

However, after assembly with Stringtie, I attempt to use gffcompare to compare to the modified gtf, and I find that gffcompare discards every transcript except the unaligned contigs. In the stats file I see

# Reference mRNAs : 271 in 262 loci (27 multi-exon)

Running again with "chr" prepended to the transcript names in the gtf all of the transcripts are kept

# Reference mRNAs : 142428 in 22607 loci (135360 multi-exon)

Why not two '='

This is my gtf file for test, saved as 'Test.gtf':

1       StringTie       transcript      9182204 9192089 1000    +       .       gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "3.691087"; FPKM "0.354073"; TPM "0.688700";
1       StringTie       exon    9182204 9182316 1000    +       .       gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "1"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "0.261039";
1       StringTie       exon    9183633 9183807 1000    +       .       gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "2"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "4.634286";
1       StringTie       exon    9191393 9192089 1000    +       .       gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "3"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "4.010363";
1       StringTie       transcript      9182752 9182952 1000    .       .       gene_id "origin.STRG.179"; transcript_id "origin.STRG.179.1"; cov "6.796020"; FPKM "0.651919"; TPM "1.268033";
1       StringTie       exon    9182752 9182952 1000    .       .       gene_id "origin.STRG.179"; transcript_id "origin.STRG.179.1"; exon_number "1"; cov "6.796020";

I typed following command:

gffcompare -r Test.gtf Test.gtf 

and this is gffcmp.tracking:

TCONS_00000001  XLOC_000001     origin.STRG.178|origin.STRG.178.1       =       q1:origin.STRG.178|origin.STRG.178.1|3|0.354073|0.688700|3.691087|985
TCONS_00000002  XLOC_000002     origin.STRG.178|origin.STRG.178.1       i       q1:origin.STRG.179|origin.STRG.179.1|1|0.651919|1.268033|6.796020|201

See, I compare this file with itself, and two transcripts were not marked two '=', instead, origin.STRG.179 was marked i compared with origin.STRG.178. Is there any explanation?

cannot locate input file: –r [likely solution]

Error while executing the below command

gffcompare -R –r Reference.gtf –o merged stringtie_merged.gtf
Error: cannot locate input file: –r

Reference.gtf is available in the same folder with full permission

Version: gffcompare v0.9.7

Interpreting sensitivity and precision

Hello,

I mapped my RNA-seq reads with HISAT2 and assembled them with stringtie while using the Salmo salar whole genome as a reference in both stages. I understand that "Sensitivity is the proportion of coding nucleotides that have been correctly predicted as coding, and Specificity is the proportion of noncoding nucleotides that have been correctly predicted as noncoding." (From Burset and Guigo 1996)

Is Precision synonymous with Specificity in this case?

What could lead to the discrepancy between Sensitivity and Precision seen in my data? Thanks!

gffcompare v0.9.8 | Command line was:
./gffcompare -r /media/genetics/Data1/Jeff_workstation_1/subread-1.5.0-p1/GCF_000233375.1_ICSASG_v2_genomic.gff -o /media/genetics/Data1/Jeff_workstation_1/gffcompare-0.9.8.Linux_x86_64/compare/8-B7877_S4_1a_mRNA-e-disabled /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled-e-disabled/8-B7877_S4_1a_mRNA-e-disabled.gtf

= Summary for dataset: /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled-e-disabled/8-B7877_S4_1a_mRNA-e-disabled.gtf
Query mRNAs : 20484 in 19076 loci (13136 multi-exon transcripts)
(998 multi-transcript loci, ~1.1 transcripts per locus)
Reference mRNAs : 136039 in 81267 loci (111148 multi-exon)
Super-loci w/ reference transcripts: 12245
-----------------| Sensitivity | Precision |
Base level: 12.2 | 86.9 |
Exon level: 8.9 | 73.9 |
Intron level: 9.5 | 96.7 |
Intron chain level: 5.7 | 48.6 |
Transcript level: 5.6 | 37.3 |
Locus level: 8.3 | 35.4 |

 Matching intron chains:    6378
   Matching transcripts:    7647
          Matching loci:    6744

      Missed exons:  498437/562650	( 88.6%)
       Novel exons:    5960/66229	(  9.0%)
    Missed introns:  421469/471248	( 89.4%)
     Novel introns:     935/46255	(  2.0%)
       Missed loci:   68711/81267	( 84.5%)
        Novel loci:    4714/19076	( 24.7%)

Total union super-loci across all input datasets: 17075

Error building gffcompare

Hi @gpertea ,

I followed the instructions to install gffcompare and I got the error below with make release. Could you provide some help? Thanks!

g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFastaIndex.cpp -o ../gclib/GFastaIndex.o
g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFaSeqGet.cpp -o ../gclib/GFaSeqGet.o
g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gff.cpp -o ../gclib/gff.o
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)’:
../gclib/gff.cpp:1988: error: template argument for ‘template class GVec’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDS’
../gclib/gff.cpp:1988: error: trying to instantiate ‘template class GVec’
../gclib/gff.cpp:1989: error: template argument for ‘template class GArray’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GSegMatch’
../gclib/gff.cpp:1989: error: trying to instantiate ‘template class GArray’
../gclib/gff.cpp: In member function ‘void GffObj::processGeneSegments(GffReader*)::GeneCDSChain::addCDS(int, uint, uint)’:
../gclib/gff.cpp:1998: error: request for member ‘Add’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘void GffObj::processGeneSegments(GffReader*)::GeneCDSChain::addMatch(int, int, int)’:
../gclib/gff.cpp:2003: error: request for member ‘Add’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::singleExonCDSMatch(uint, uint, int&)’:
../gclib/gff.cpp:2009: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2011: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2012: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2012: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::singleCDStoExon(GffObj&, int&)’:
../gclib/gff.cpp:2022: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2023: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2024: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2026: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2026: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::multiCDStoExon(GffObj&, int&)’:
../gclib/gff.cpp:2043: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2047: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2048: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2050: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2050: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2063: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2064: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2074: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2076: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::containedBy(GffObj&, int&)’:
../gclib/gff.cpp:2097: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)’:
../gclib/gff.cpp:2105: error: template argument for ‘template class GHash’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDSChain’
../gclib/gff.cpp:2105: error: trying to instantiate ‘template class GHash’
../gclib/gff.cpp:2105: error: invalid type in declaration before ‘(’ token
../gclib/gff.cpp:2106: error: template argument for ‘template class GPVec’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDSChain’
../gclib/gff.cpp:2106: error: trying to instantiate ‘template class GPVec’
../gclib/gff.cpp:2106: error: invalid type in declaration before ‘;’ token
../gclib/gff.cpp:2128: error: request for member ‘Find’ in ‘cdsChainById’, which is of non-class type ‘int’
../gclib/gff.cpp:2133: error: request for member ‘Add’ in ‘cdsChains’, which is of non-class type ‘int’
../gclib/gff.cpp:2134: error: request for member ‘shkAdd’ in ‘cdsChainById’, which is of non-class type ‘int’
../gclib/gff.cpp:2143: error: request for member ‘Count’ in ‘cdsChains’, which is of non-class type ‘int’
../gclib/gff.cpp:2144: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2153: error: request for member ‘Count’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp:2158: error: request for member ‘First’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp:2159: error: request for member ‘Count’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2160: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2161: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2165: error: request for member ‘First’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
make: *** [../gclib/gff.o] Error 1

How to statistics the novel transcripts

Hello,
I read the Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.
then I download the file chrX_data.tar.gz to practice.
I have a problem for novel transcripts. I use one simple,ERR188044,to practice ,like this:
gffcompare -r /disk/chrX_data/genes/chrX.gtf -G -o /disk/luping/chrX_data/gffcompare/ERR188044 /disk/chrX_data/gtf/ERR188044_chrX.gtf.

#= Summary for dataset: /disk/chrX_data/gtf/ERR188044_chrX.gtf

Query mRNAs : 1234 in 786 loci (890 multi-exon transcripts)

(226 multi-transcript loci, ~1.6 transcripts per locus)

Reference mRNAs : 2102 in 1086 loci (1856 multi-exon)

Super-loci w/ reference transcripts: 461

#-----------------| Sensitivity | Precision |
Base level: 51.6 | 80.3 |
Exon level: 46.7 | 85.4 |
Intron level: 47.2 | 93.8 |
Intron chain level: 31.2 | 65.2 |
Transcript level: 31.6 | 53.8 |
Locus level: 36.6 | 50.6 |

 Matching intron chains:     580
   Matching transcripts:     664
          Matching loci:     397

      Missed exons:    4398/8804	( 50.0%)
       Novel exons:     418/4863	(  8.6%)
    Missed introns:    3832/7946	( 48.2%)
     Novel introns:      79/3995	(  2.0%)
       Missed loci:     581/1086	( 53.5%)
        Novel loci:     260/786	( 33.1%)

Total union super-loci across all input datasets: 741
1234 out of 1234 consensus transcripts written in /disk/luping/chrX_data/gffcompare/ERR188044.annotated.gtf (0 discarded as redundant)

then I statistics the novel transcripts using class_code "j",but this is different from the results of table7

ERR188044 Novel transcripts 615 Did I have any mistake?

Difficulty matching behavior between older and newer gffcompare

Hi,

I am taking a few stringtie merged assemblies of mouse RNAseq data and annotating them back against Gencode vM12 gtf files using gffcompare. I have noticed some differences between older versions of gffcompare and the newer versions where the arguments and default modes of gffcompare have been changed.

With version v0.9.6f, I would annotate the merged gtf file against the gencode gtf with the -G -C options which were defined as doing the following:

-G generic GFF input file(s): do not assume Cufflinks/Stringtie GTF input, (do not discard intron-redundant transfrags)
-C include the "contained" transcripts in the .combined.gtf file

In the newest version cloned from github, v0.9.9d, -G is now the default and -C appears to do the opposite of what -C was defined as doing in the older versions, though I could be misinterpreting this.

-C discard the "contained" transfrags in the .combined.gtf (i.e. collapse intron-redundant transfrags across all query files)

So, I figured I would want to run the newer versions of gffcompare without -G and -C in order to mimic the behavior of the older versions of gffcompare with the -G and -C options. However, I am seeing cases where transcripts are being treated as duplicates, even when they are not the same as each other and then getting merged together. So, I tried running with -C and -C -K options, but generally, no matter what I do, I haven't been able to identify the right combination of arguments to mimic the behavior of older versions of gffcompare with -G -C. I also noticed that whenever I run with -C option, that I get a .combined.gtf file instead of a .annotated.gtf file when running without the -C option.

I've attached an image of IGV showing gencode annotatoin for a single exon gene with two transcripts of differing lengths in the gencode annotation. I also show the stringtie assembly and then the older version of gffcompare with -G -C and the newest version of gffcompare with multiple different options. You can see that in the newer version, the contained transcript is removed. This is a simple example, but I see this in multi exon genes as well.

case of merging mouse gffcompare

Maybe I am missing something obvious, so could you please give advice on how I can run the newest version of gffcompare to preserve the contained transcripts?

straight annotation mode for single input file

Currently the *.combined.gtf file renames the transcripts as they are "merged". There should be a new "annotation mode" option which would simply keep the original transcript names (without trying to rename them), using the longer isoform's transcript ID in case of merging (which shouldn't happen at all if that file is the output of stringtie --merge) or, optionally, even disabling the merging of compatible/equivalent isoforms (separate option?). Sure the latter option would make the name of this file a bit awkward (in which case perhaps it should be changed to *.annotated.gtf or *.class.gtf ?).

update release tags

We're installing gffcompare on our cluster and it would help us to quickly find and update to the latest stable release if you could keep the tags up-to-date. The latest tag says 0.9.8, but the version string in gffcompare.cpp says 0.9.9d.

Many thanks for your consideration

How to interpret sensitivity and precision?

Hello,

Thank you very much for this helpful tool. But I'd like to ask how are sensitivity and precision at different levels (bases, exons, transcripts) calculated? In other words, how do you define a "truth positive" or a "match"? Can you provide more details here?

Thanks!

Exons and introns coordinates

Hi

The .stats file, mentions a number of novel exons and a number of novel introns identified. Is there an easy way to get the coordinates of these regions?

e.g.
Novel exons:   16055/631658  (  2.5%)
Novel introns:    5926/381131  (  1.6%)

Thank you in advance
Foivos

No coverage in .tmap file

Hi,
I am running stringtie merge and then gffcompare:
gffcompare -r gencode.gtf -G -o gff stringtie_merged.gtf
When I look at the gff.stringtie_merged.gtf.tmap all values are zero for FPKM,TPM, and coverage for all transcripts??

Error allocating memory

Hello, I've been following the protocol published on Nature and running gffcompare on two RNAseq replicates that I have. I removed the -G flag as (from what I understand) it's now the default behaviour.

Still, when I run
gffcompare -r hg19.gtf merged.gtf
I get the following error:
Error allocating memory. Aborted (core dumped)

When I run the tool in debugging mode with -D I see a lot of
GFF Warning: unusual segment inclusion: exon(X-X) within CDS(X-X) (ID=X)
and of
GFF warning: merging adjacent/overlapping segments of X on X (X-X, X-X)
and then:

Kept X ref transcripts out of Y
 W duplicate reference transcripts discarded.
Prefix for output files: gffcmp
Loading query file #1: merged.gtf
Error allocating memory.
Aborted (core dumped)

Any idea on what could be causing this?
I am working on an Ubuntu 16.04 machine and I tried both with the precompiled binaries and with my own build (following instruction in the README file).

Many thanks in advance!
Cheers

Build fails

make release
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFastaIndex.cpp -o ../gclib/GFastaIndex.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFaSeqGet.cpp -o ../gclib/GFaSeqGet.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gff.cpp -o ../gclib/gff.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gdna.cpp -o ../gclib/gdna.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/codons.cpp -o ../gclib/codons.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GBase.cpp -o ../gclib/GBase.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GStr.cpp -o ../gclib/GStr.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GArgs.cpp -o ../gclib/GArgs.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c gtf_tracking.cpp -o gtf_tracking.o
gtf_tracking.cpp: In function ‘void read_transcripts(FILE*, GList&, bool)’:
gtf_tracking.cpp:606:7: error: ‘class GffReader’ has no member named ‘keepingAttrs’; did you mean ‘keepAttrs’?
gffr.keepingAttrs(keepAttrs, true);
^~~~~~~~~~~~
gtf_tracking.cpp:607:10: error: ‘class GffReader’ has no member named ‘mergingCloseExons’; did you mean ‘mergeCloseExons’?
gffr.mergingCloseExons(true);
^~~~~~~~~~~~~~~~~
gtf_tracking.cpp: In function ‘void read_mRNAs(FILE*, GList&, GList, bool, int, const char, bool)’:
gtf_tracking.cpp:640:8: error: ‘class GffReader’ has no member named ‘mergingCloseExons’; did you mean ‘mergeCloseExons’?
gffr->mergingCloseExons(true);
^~~~~~~~~~~~~~~~~
gtf_tracking.cpp:641:8: error: ‘class GffReader’ has no member named ‘keepingAttrs’; did you mean ‘keepAttrs’?
gffr->keepingAttrs(!isRefData, isRefData || gtf_tracking_largeScale );
^~~~~~~~~~~~
make: *** [gtf_tracking.o] Error 1

Any suggestions?

Output path of refmap and tmap

Hi,

This is only a very small issue, but I wanted to tell about it.
It was not clear to me that the refmap and tmap is not written to the current directory, but in the directory of the GFF files.
It is not a big deal, but I asked myself for about 2 hours why gffcompare generates all other files like .stats, .tracking and so on, but no tmap or refmap.

Sorry for bothering with this issue.

Cheers
Chris

how to unnderstand the *.stats file?

hello, sorry to botter you. I am a beginner.I wonder what these numbers mean in stats file.
What does the number after 'Novel exons' represent?

            Missed exons:   70627/364162  ( 19.4%)
           Novel exons:    3595/181415  (  2.0%)
        Missed introns:   36856/226049  ( 16.3%)
         Novel introns:    1791/139213  (  1.3%)
           Missed loci:       0/13156   (  0.0%)
            Novel loci:      92/14996   (  0.6%)

thank you .

Question on gffcompare output

Hello, I am currently working on the isoform detection analysis using gffcompare.
I was wondering if there are some cases on the reference (Gencode for instance) that have quite a similar isoform structure variation where it might be difficult for the input transcript to be discriminated, how would the input transcript be decided for its annotation? If the input transcript gets annotated to one of the closest matching reference transcripts, what would be the deciding factor? (more matching bases?)
I could not get a clear answer that I wanted from the utility page. If you help with this issue, I would really appreciate it!

Jungwoo.

Error running gffcompare

I am trying to run the following command:

gffcompare -r ref.gff -G -o merged stringtie-merged.gtf

ref.gff - downloaded from NCBI

string-merged.gtf - obtained from stringtie --merge command

Error encountered : GFF Error: overlapping duplicate transcript feature (ID=gene29892)

When I grep "gene29892" from both the ref.gff and merged stringtie-merged.gtf

From ref.gff

NC_007957.1	RefSeq	gene	74631	74744	.	+	.	ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=1/2
NC_007957.1	RefSeq	gene	146276	147073	.	+	.	ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=2/2
NC_007957.1	RefSeq	CDS	74631	74744	.	+	0	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	CDS	146276	146507	.	+	0	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	CDS	147048	147073	.	+	2	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	exon	146276	146507	.	+	.	ID=id318095;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=1;gbkey=exon;gene=rps12
NC_007957.1	RefSeq	exon	147048	147073	.	+	.	ID=id318096;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=2;gbkey=exon;gene=rps12

From stringtie-merged.gtf

NC_007957.1	StringTie	transcript	74631	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	74631	74744	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	146276	146507	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "2"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	147048	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "3"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	transcript	146276	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	146276	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892"; 

What could be possibly wrong as I cannot see any duplicate values!

How to compare CDS gff with genome guided assembled gff

I am working on plant species transcriptome. Its draft genome is known. We want to identify some new transcripts in genome-guided assembly.

I need your help to remove overlapping sequencing by comparing already reported CDS gff file with assembled transcript gff file using StringTie.

Can we also remove the redundancy of transcript which mapping to the same location on the genome by using gff file.

Thanks

Missing tmap and refmap files

Hi

I tried version 0.11.2 (from anaconda) and 0.11.4 (precompiled binary) as following:

gffcompare \
-r reference.gtf \
-s .genome.fa \
-o prefix \
merged_annotation.gtf

where merged_annotation.gtf is the output of stringtie merge. After this run I get the following files:

  • .tracking
  • .annotated.gtf
  • .loci
  • .stats

but I am missing the .tmap and .refmap

Am I doing something wrong?

Thank you in advance
Foivos

gffcompare class code "x", what is the minimum overlap?

Hello,

What is the minimum overlap required on the opposite strand for the class code "x" to be assigned? I am curious if this can be altered at all within the script as I am interested in search for antisense transcription in my dataset but would like to know how that is being classified using gffcompare. I've looked in your papers and in the cuffcompare repository/website but can't seem to find this information.

Any help is appreciated!

Thank you,

Diego

too many input files (limit set to 500 at compile time)

I used cuffcompare to process 1000 files and get an error below:

Error: too many input files (limit set to 500 at compile time)
(if you need to raise this limit set a new value for
MAX_QFILES in gtf_tracking.h and recompile)

But I can't find the "MAX_QFILES" option in the gtf_tracking.h

How can I raise the limit of the input files?

single-exon transcript in reference

Here is my testing code.
gffcompare -r test.gff test1.gff
The two gff files are identical, but I got this summary.

#= Summary for dataset: test1.gff
#     Query mRNAs :   89544 in   56270 loci  (57969 multi-exon transcripts)
#            (10996 multi-transcript loci, ~1.6 transcripts per locus)
# Reference mRNAs :   57969 in   25225 loci  (57969 multi-exon)
...

When i run this order with -M, I got this summary. It works good.

#= Summary for dataset: test1.gff
#     Query mRNAs :   57969 in   25225 loci  (57969 multi-exon transcripts)
#            (10569 multi-transcript loci, ~2.3 transcripts per locus)
# Reference mRNAs :   57969 in   25225 loci  (57969 multi-exon)

Brifly, single-exon reference transcripts are discarded even without -N option.

how to find novel genes and transcripts in gffcompare?

I am also new to RNA-SEQ analysis. I also followed this same protocol for my dataset. And i am having some problem for identifying novel transcripts and genes from the below gffcompare ".stats" file

i was running the following command $ gffcompare –r chrX_data/genes/chrX.gtf –G –o merged stringtie_merged.gtf. and i got the result file as
= Summary for dataset: stringtie_merged.gtf
Query mRNAs : 253181 in 70635 loci (216728 multi-exon transcripts)
(23264 multi-transcript loci, ~3.6 transcripts per locus)
Reference mRNAs : 216257 in 60158 loci (189357 multi-exon)
Super-loci w/ reference transcripts: 51791
-----------------| Sensitivity | Precision |

Base level: 100.0 | 93.0 |
Exon level: 99.9 | 93.6 |

Intron level: 99.4 | 94.0 |

Intron chain level: 99.7 | 87.1 | Transcript level: 99.8 | 85.2 | Locus level: 100.0 | 84.4 |

Matching intron chains: 188838
Matching transcripts: 215729
Matching loci: 60158

Missed exons: 0/623537 ( 0.0%)
Novel exons: 23741/674273 ( 3.5%)
Missed introns: 2160/383827 ( 0.6%)
Novel introns: 4747/405880 ( 1.2%)
Missed loci: 0/60158 ( 0.0%)
Novel loci: 10239/70635 ( 14.5%)

Total union super-loci across all input datasets: 70632 253181 out of 253181 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant).

from this result where i can find the novel genes and transcripts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.